Documentation
Pendra Workers
Run open-weight LLMs on your own GPU infrastructure without changing a line of code. One install command, the same Pendra API, and inference happens entirely on your hardware.
Two Ways to Deploy
Managed Infrastructure
Pendra hosts and manages the GPU clusters. You call the API, we handle everything else. Best for teams that want zero operational overhead.
Self-Hosted Workers
Install a Pendra Worker on your own GPUs. Inference runs entirely on your hardware with secure routing via the Pendra API. Best for teams with strict data residency requirements.
Both options use the same API, the same SDKs, and the same dashboard. You can run managed and self-hosted workers side by side — route sensitive workloads to on-premises hardware and everything else to Pendra's managed fleet.
Why Self-Host
For organisations handling the most sensitive data — patient records, classified material, privileged legal communications — even sovereign managed infrastructure may not satisfy every compliance requirement. Self-hosted workers close that gap entirely.
Security & Compliance Benefits
- 01 Prompts and completions are processed entirely on your hardware — never sent to an external endpoint
- 02 Secure, transient routing via the Pendra API — no persistent data paths
- 03 Centralised load balancing with local compute — scale across your own GPU fleet
- 04 Deploy and configure workers in minutes with a single install command
- 05 Unified oversight of access, usage, and routing from the Pendra dashboard
Self-hosted workers give you full control over where inference happens, with the convenience of centralised API management, load balancing, and usage tracking through Pendra.
How It Works
A Pendra Worker is a lightweight runtime that runs inference on your hardware and connects back to the Pendra control plane for model management, API key validation, and usage tracking. The control plane never sees your prompts or completions.
# One command to install and register your worker
curl -fsSL https://get.pendra.ai/worker | bash The installer checks prerequisites, prompts for your worker private key (from the Pendra dashboard), lets you choose a backend and models, then pulls images and starts the stack. The entire process takes under two minutes.
Inference Backends
Architecture
Requirements
| Component | Minimum |
|---|---|
| Docker | Docker Engine + Compose v2 |
| GPU | NVIDIA GPU recommended for production workloads (CPU-only mode available for testing) |
| OS | Linux or macOS — any system that runs Docker |
| Network | Outbound WSS to api.pendra.ai for secure routing and load balancing |
Larger models benefit from multi-GPU setups. The worker automatically detects and uses all available GPUs on the host via the NVIDIA Container Toolkit.
Hybrid Deployments
Most organisations don't need to self-host everything. Pendra supports hybrid routing so you can direct traffic based on sensitivity, cost, or performance requirements.
Example Routing
- On-prem Patient record summarisation, classified document analysis, privileged legal review
- Managed Internal knowledge bases, customer support drafts, code generation, general-purpose tasks
Routing is configured at the organisation level through the Pendra dashboard. Your application code stays exactly the same regardless of where inference runs.
Ready to deploy on your infrastructure?
Get in touch and we'll help you plan a deployment that fits your security and compliance requirements.
Get in Touch