API Reference

Errors & rate limits

Pendra returns standard HTTP status codes. Errors come back as JSON with a short detail message, and inference responses carry an X-Request-Id header you can paste into a support ticket.

Status codes

Code	Meaning	What to do
`400`	Bad request — malformed JSON, an invalid field, or a prompt that's too long for the model's context window.	Fix the request body. The `detail` / `error.message` field describes the problem (e.g. `prompt exceeds context window`).
`413` / `422`	The request reached a worker but was rejected as too large or unprocessable (e.g. an input over the model's limit).	Shorten the input or split it into smaller requests. The error message names the limit that was exceeded.
`401`	Missing or invalid API key.	Check the `Authorization` or `x-api-key` header. Rotate the key from the console if needed.
`403`	Authenticated but not allowed — e.g. owner-only operation called by a member.	Use a key from a user with the right role.
`404`	Unknown route, or the requested model isn't installed on any connected worker.	Check the URL prefix (`/api/v1` vs `/v1`), and that the model is installed and a worker serving it is online. Browse available models.
`429`	Rate limited, or every worker serving the model is busy. Pendra queues your request for up to ~30 seconds waiting for a free slot before returning this, so a brief traffic spike usually clears on its own.	Back off and retry — honour the `Retry-After` header (seconds) when present, otherwise use exponential jitter. Higher plans get higher limits; if you self-host, add workers to handle more concurrent requests.
`500`	Unexpected server error.	Retry once; if it persists, send the `X-Request-Id` to support.
`503`	No worker is currently connected that can serve the model (none online) — distinct from `429`, which means workers are online but busy.	Retry shortly, honouring the `Retry-After` header; Pendra routes to a worker as one comes online. Check your worker is running if you self-host.
`502`	Gateway error — Pendra couldn't get a usable response from any worker (the worker dropped the connection or returned something it couldn't parse). A worker that rejects your request for a specific reason returns that reason's own code (e.g. 400) instead.	The `X-Worker-Id` / `X-Worker-Name` headers identify the worker. Retry; Pendra will pick a different worker.
`504`	Timeout — a single request can run up to ~30 minutes before Pendra gives up. Pendra keeps the connection alive while a slow model works, so you rarely hit this.	If you do see it, the model or worker is likely overloaded — retry, or split a very large request into smaller ones.

Error body shape

{
  "detail": "Invalid or expired token"
}

For OpenAI-compatible endpoints, Pendra also returns the OpenAI-shaped error envelope when the worker produces one — e.g. { "error": { "message": "...", "type": "..." } }.

Helpful response headers

Header	On which responses	Use
`Retry-After`	`429` and `503` responses	How many seconds to wait before retrying. Honour it instead of retrying immediately.
`X-Request-Id`	Non-streaming inference responses (chat, embeddings, images, audio)	Quote this on a support ticket so we can look up the exact request.
`X-Worker-Id`	Non-streaming inference responses	Identifies which GPU worker served the request.
`X-Worker-Name`	Non-streaming inference responses	Human-readable worker name from the console.

Rate limits

Pendra applies per-organisation rate limits scaled to your subscription plan. The default is generous and covers most production workloads. When you hit the limit you'll get a 429 with a detail message describing the bucket; back off with jitter and retry.

A 429 can also mean every worker serving your model is at capacity right now. Pendra holds the request briefly (up to ~30 seconds) waiting for a slot to free up, so short bursts usually go through without you noticing; you only see the 429 when the squeeze lasts longer than that. Either way, the Retry-After header tells you how long to wait — and adding workers (if you self-host) clears it.

If you expect to sustain very high throughput, contact sales to discuss a dedicated plan.

Idempotency & retries

All inference endpoints are safe to retry. Chat completions and image generation are non-deterministic, so a retry produces a fresh sample — duplicate billing is not a concern because retries only bill on success.