API

Chat completions

POST /api/v1/chat/completions is OpenAI-compatible. The request and response bodies match the OpenAI schema exactly — drop a Pendra base URL and key into your existing OpenAI SDK and you're done.

Request

curl
curl https://api.pendra.ai/api/v1/chat/completions \
  -H "Authorization: Bearer pdr_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:27b",
    "messages": [{"role": "user", "content": "Explain GPUs in one sentence."}]
  }'

Required fields

  • model — model ID, e.g. qwen3.6:27b, llama3.3:70b, gpt-oss:120b. Browse available models.
  • messages — array of { role, content } objects.

Common optional fields

  • stream — set true for server-sent events.
  • temperature, top_p, max_tokens, stop — standard OpenAI sampling controls.
  • tools, tool_choice — function calling, OpenAI-shaped.
  • response_format — set to { "type": "json_object" } for JSON-mode output (model permitting).

Response

Non-streaming responses come back as a single OpenAI-shaped chat.completion object. usage is always populated; finish_reason is "stop" when the model finished naturally or "length" when capped by max_tokens.

{
  "id": "chatcmpl-9f2b1c8e",
  "object": "chat.completion",
  "created": 1715346123,
  "model": "qwen3.6:27b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "GPUs are specialised processors built for the kind of massively parallel arithmetic that graphics and AI workloads need."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 24,
    "total_tokens": 36
  }
}

Streaming

With stream: true, Pendra returns Server-Sent Events matching OpenAI's format: each event is a data: { ... } line containing a delta chunk, terminated by data: [DONE]. Caddy is configured to flush immediately so tokens arrive as they're generated.

curl
curl https://api.pendra.ai/api/v1/chat/completions \
  -H "Authorization: Bearer pdr_sk_..." \
  -N \
  -d '{
    "model": "qwen3.6:27b",
    "stream": true,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Streaming response

Each chunk is an OpenAI-shaped chat.completion.chunk. The final chunk before [DONE] carries usage because Pendra always sets stream_options.include_usage = true server-side.

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","created":1715346123,"model":"qwen3.6:27b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","created":1715346123,"model":"qwen3.6:27b","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","created":1715346123,"model":"qwen3.6:27b","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-9f2b","object":"chat.completion.chunk","created":1715346123,"model":"qwen3.6:27b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":8,"completion_tokens":2,"total_tokens":10}}

data: [DONE]

Python (using the Pendra SDK)

stream.py
from pendra import Pendra

client = Pendra()  # reads PENDRA_API_KEY

for event in client.chat.completions.create(
    model="qwen3.6:27b",
    stream=True,
    messages=[{"role": "user", "content": "Write a haiku."}],
):
    print(event.choices[0].delta.content or "", end="")

Response headers

Every chat response — streaming or not — carries these headers:

HeaderMeaning
X-Request-IdUUID. Quote this to support when reporting an issue with a request.
X-Worker-IdWhich GPU worker served the request.
X-Worker-NameHuman-readable worker name from the console.

Tool / function calling

Tools work the same as OpenAI. Pass a tools array; the model may reply with a tool_calls array in the assistant message. Resolve, append a {role: "tool", tool_call_id, content} message, and re-call /chat/completions.

Timeouts

Idle-connection cut-off is about 100 seconds for non-streaming chat. Long generations should use stream: true so partial tokens keep the socket alive.

OpenAI SDK compatibility

Point the OpenAI SDK at Pendra by setting OPENAI_BASE_URL=https://api.pendra.ai/api/v1 and OPENAI_API_KEY=pdr_sk_…. No other code changes needed.