Version: 0.4

OpenAI Provider

CubePi ships two OpenAI providers covering the two API surfaces:

OpenAIProvider — Chat Completions API (/v1/chat/completions). Use this for the GPT-4/5 family and most OpenAI-compatible servers (vLLM, LiteLLM, DeepSeek, Qwen, MiniMax, DouBao, …).
OpenAIResponsesProvider — Responses API (/v1/responses). Use this when you want server-side state and reasoning summaries.

Both implement the same Provider protocol; pick one per agent.

Chat Completions: `OpenAIProvider`

from cubepi import Model
from cubepi.providers.openai import OpenAIProvider

provider = OpenAIProvider(
    api_key="sk-…",      # or reads OPENAI_API_KEY
    base_url=None,        # set for OpenAI-compatible servers
    extra_body=None,      # merged into every request
    extra_headers=None,
    payload_quirks=None,  # ["max_completion_tokens_alias", …]
)

model = Model(
    id="gpt-5",
    provider="openai",
    reasoning=True,        # enables thinking level mapping
    max_tokens=8192,
    context_window=128_000,
)

Thinking on Chat Completions

OpenAI exposes reasoning content through delta.reasoning_content on o-series and gpt-5 models. CubePi captures it as ThinkingContent and emits thinking_* events identically to Anthropic. The same ThinkingLevel enum ("off" → "high") works.

Many OpenAI-compatible OSS backends emit reasoning under different fields. CubePi understands three in priority order:

delta.reasoning_content (DeepSeek, Qwen, DouBao)
delta.reasoning (vLLM)
delta.reasoning_details (MiniMax)

No configuration needed — the provider picks whichever field is present.

`extra_body` for OSS quirks

Most OpenAI-compatible servers accept extensions through the request body. Set them once at construction:

provider = OpenAIProvider(
    api_key="…",
    base_url="https://api.deepseek.com/v1",
    extra_body={"enable_thinking": True, "stream_options": {"include_usage": True}},
)

If you need per-request mutation, use on_payload (see below).

`payload_quirks`

Some servers require max_tokens instead of max_completion_tokens:

provider = OpenAIProvider(
    api_key="…",
    payload_quirks=["max_completion_tokens_alias"],
)

CubePi renames the key on the way out.

Pointing at vLLM / LiteLLM / DeepSeek

provider = OpenAIProvider(
    api_key="dummy",                                    # vLLM ignores it
    base_url="http://localhost:8000/v1",
    extra_headers={"Authorization": "Bearer dummy"},
)

For LiteLLM:

provider = OpenAIProvider(
    api_key=os.environ["LITELLM_KEY"],
    base_url="https://litellm.internal/v1",
)

Responses API: `OpenAIResponsesProvider`

from cubepi.providers.openai_responses import OpenAIResponsesProvider

provider = OpenAIResponsesProvider(api_key="sk-…")
model = Model(id="gpt-5", provider="openai_responses", reasoning=True)

The Responses API keeps state server-side (referenced by previous_response_id). CubePi tracks AssistantMessage.response_id and feeds it back automatically — your code looks identical to the Chat Completions path.

Use the Responses provider when:

You want reasoning summaries (not just text) surfaced as thinking blocks.
You're using the o-series and want the server to hold the reasoning chain across turns (smaller payloads, faster reuse).

Stay on OpenAIProvider when you want full control over the message list and prompt caching strategy.

`on_payload` / `on_response`

Same shape as the Anthropic provider. The payload dict differs (messages instead of messages + system separately, OpenAI-style tools schema), so inspect it once before mutating.

async def add_user_metadata(payload, model):
    payload["user"] = "u-42"     # billable user attribution
    return payload

agent = Agent(provider=provider, model=model, on_payload=add_user_metadata)

Tool calling

Tool definitions are auto-converted to OpenAI's {"type": "function", "function": {...}} shape. The streaming format emits incremental JSON arguments under toolcall_delta; CubePi buffers and parses them through cubepi.utils.json_parse.parse_streaming_json so partials always validate to the closest well-formed object.

Multiple parallel tool calls in one assistant message just work — they're routed through the same parallel executor as the Anthropic provider.

Common pitfalls

stream_options.include_usage rejected — Some compatibles reject the whole stream_options field. on_payload cannot fix this: cubepi 0.3 calls kwargs.setdefault("stream_options", {}) after your callback runs, so deleting the key in on_payload is silently undone. Workarounds:
- Subclass OpenAIProvider and override stream() to skip the setdefault for your backend.
- Set include_usage=False in on_payload (the field still goes out, but is usually accepted as a no-op even by strict backends).
- Open an issue against cubepi to add a payload_quirks entry such as "no_stream_options" for native opt-out.
Thinking events but no thinking_* events — Your backend surfaces reasoning under a non-standard field. Either add a fourth branch via PR or transcode it with on_payload.
Mixed providers in one process — Each provider holds its own HTTP client. Reuse a single instance per (base_url, api_key) pair instead of creating one per agent.
Usage shows 0 input tokens — Most compatibles omit usage entirely or only emit it on the final chunk. Inspect the trailing chunk in on_payload for a hint, or treat token counts as best-effort on those backends.

Chat Completions: OpenAIProvider​

Thinking on Chat Completions​

extra_body for OSS quirks​

payload_quirks​

Pointing at vLLM / LiteLLM / DeepSeek​

Responses API: OpenAIResponsesProvider​

on_payload / on_response​

Tool calling​

Common pitfalls​

See also​