Endpoints

Chat completions

OpenAI-compatible chat completions with streaming, tools, and the same request/response shape your existing code expects.

POST /api/v1/chat/completions

curl https://api.pendra.ai/api/v1/chat/completions \
  -H "Authorization: Bearer pdr_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:27b",
    "messages": [
      {
        "role": "user",
        "content": "Explain GPUs in one sentence."
      }
    ]
  }'

from pendra import Pendra

client = Pendra()  # reads PENDRA_API_KEY

response = client.chat.completions.create(
    model="qwen3.6:27b",
    messages=[{"role": "user", "content": "Explain GPUs in one sentence."}],
)
print(response.choices[0].message.content)

import Pendra from 'pendra';

const client = new Pendra(); // reads PENDRA_API_KEY

const response = await client.chat.completions.create({
  model: 'qwen3.6:27b',
  messages: [{ role: 'user', content: 'Explain GPUs in one sentence.' }],
});
console.log(response.choices[0].message.content);

200

{
  "id": "chatcmpl-9f2b1c8e",
  "object": "chat.completion",
  "created": 1715346123,
  "model": "qwen3.6:27b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "GPUs are massively parallel processors."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 24,
    "total_tokens": 36
  }
}

New to chat? Start with the Chat guide. This page is the field-by-field reference. Related: Thinking, Tool calling, Structured outputs, Vision.

Body application/json

model string required

Model ID, e.g. qwen3.6:27b, llama3.3:70b, gpt-oss:120b. Browse available models.

messages array required

The conversation — an array of { role, content } objects (system, user, assistant, tool).

stream boolean default: false

Set true for server-sent events (see Streaming below).

tools array

Function-calling tools, OpenAI-shaped. Pair with tool_choice and parallel_tool_calls. See Tool calling.

response_format object

Set to { "type": "json_object" } or a json_schema for structured output. See Structured outputs.

reasoning_effort string

"low" / "medium" / "high" for reasoning-capable models, or "none" to turn thinking off. See Thinking.

enable_thinking boolean

Set to false to make a reasoning model answer directly, without its chain-of-thought. Defaults to true. See Thinking.

Other optional fields

temperature, top_p, top_k, min_p, max_tokens, stop, seed — standard OpenAI sampling controls.
frequency_penalty, presence_penalty, logit_bias — repetition and token-bias controls.
logprobs, top_logprobs — request token log-probabilities (model permitting).

Any other standard OpenAI Chat Completions field is forwarded to the serving worker as-is. Whether a given field takes effect depends on the model serving the request.

Response

Non-streaming responses come back as a single OpenAI-shaped chat.completion object (see example). usage is always populated; finish_reason is "stop" on a natural finish or "length" when capped by max_tokens.

For a reasoning model, the chain-of-thought comes back on message.reasoning_content (mirrored on message.reasoning), separate from the answer in message.content — streamed as delta.reasoning_content. See Thinking.

Streaming

With stream: true, Pendra returns Server-Sent Events matching OpenAI's format: each event is a data: { ... } line containing a delta chunk, terminated by data: [DONE]. Pendra flushes each chunk immediately, so tokens arrive as they're generated.

curl https://api.pendra.ai/api/v1/chat/completions \
  -H "Authorization: Bearer pdr_sk_..." \
  -N \
  -d '{
    "model": "qwen3.6:27b",
    "stream": true,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

stream = client.chat.completions.create(
    model="qwen3.6:27b",
    stream=True,
    messages=[{"role": "user", "content": "Hello"}],
)
for event in stream:
    print(event.choices[0].delta.content or "", end="")

const stream = await client.chat.completions.create({
  model: 'qwen3.6:27b',
  stream: true,
  messages: [{ role: 'user', content: 'Hello' }],
});
for await (const event of stream) {
  process.stdout.write(event.choices[0]?.delta?.content ?? '');
}

Each chunk is an OpenAI-shaped chat.completion.chunk. The final chunk before [DONE] carries usage because Pendra always sets stream_options.include_usage = true server-side.

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":8,"completion_tokens":2,"total_tokens":10}}

data: [DONE]

Web tool steps

When the worker serving your request has Web tools enabled, the model can fetch pages and run web searches while it answers. Each of those steps is reported back on the response under a Pendra-specific pendra field, so you can show what the model looked at. It sits alongside choices and is safe to ignore — the OpenAI SDKs skip unknown fields, so your existing code is unaffected.

Non-streaming responses carry the full list as pendra.web_tool_steps; streaming responses emit one pendra.web_tool_step event per step (a "call" when the model asks, then a "result" once it runs) on a chunk whose choices delta is empty. Each step has: name (web_fetch or web_search), arguments, ok, a truncated result preview, result_chars (the full length), and error when a step failed.

{
  "choices": [{ "index": 0, "message": { "role": "assistant", "content": "…" }, "finish_reason": "stop" }],
  "usage": { "prompt_tokens": 812, "completion_tokens": 43, "total_tokens": 855 },
  "pendra": {
    "web_tool_steps": [
      { "index": 0, "name": "web_fetch", "arguments": "{\"url\":\"https://example.com\"}",
        "ok": true, "result": "Example Domain…", "result_chars": 129 }
    ]
  }
}

Notices

The same pendra field also carries the occasional advisory about a response, under pendra.notice. Today there's one: when a reasoning model spends its whole max_tokens budget thinking and returns no answer (content empty, finish_reason: "length"), the response includes pendra.notice with code: "truncated_during_reasoning" and a human-readable message. Streaming responses emit it on a chunk whose choices delta is empty, just before the final chunk. Detect it to retry with a larger budget or with thinking off (see Thinking). Like everything under pendra, it's safe to ignore — OpenAI SDKs skip unknown fields.

Response headers

Non-streaming chat responses carry these headers (streaming responses don't):

Header	Meaning
`X-Request-Id`	UUID. Quote this to support when reporting an issue with a request.
`X-Worker-Id`	Which GPU worker served the request.
`X-Worker-Name`	Human-readable worker name from the console.
`Server-Timing`	Per-request performance for this generation: `ttft` (time to first token, ms), `tps` (tokens per second, in the metric's `desc`), and `queue` (time the request waited for a free slot, ms). Members are present only when measured.

A non-streaming response includes a Server-Timing header so you can read the per-request latency and throughput without any extra call — handy when benchmarking or sizing concurrency:

Server-Timing: ttft;dur=42, tps;dur=0;desc="58.3", queue;dur=5

The same figures are also recorded against each request in the console's Usage view, including for streaming requests (a streaming response can't carry the header, because headers are sent before the first token).

Timeouts

A single non-streaming chat request can run up to ~30 minutes. While a slow or large model works, Pendra automatically keeps the connection alive, so you no longer need to switch to streaming just to avoid a timeout on a long generation. stream: true is still the best choice for interactive UIs — it shows partial tokens as they're generated rather than waiting for the whole reply.

OpenAI SDK compatibility

Point the OpenAI SDK at Pendra by setting OPENAI_BASE_URL=https://api.pendra.ai/api/v1 and OPENAI_API_KEY=pdr_sk_…. No other code changes needed.

The OpenAI convention https://api.pendra.ai/v1 works too, so if you already have a base URL ending in /v1 you can leave it as-is — chat/completions, embeddings, and models all resolve at both /api/v1 and /v1.