Worker
Run inference on your own GPUs
Install a single binary or Docker container. Your prompts and completions stay on your hardware — routed through the same API, the same SDKs, the same console.
qwen-2.5-72b
deepseek-r1-70b
Why self-host
Full data control
Prompts and completions are processed entirely on your hardware. Requests route through the Pendra API, but inference never leaves your environment.
Unified orchestration
Centralised load balancing across your GPU fleet. One API, one console, one set of SDKs.
Hybrid ready
Run managed and self-hosted workers side by side. Route sensitive workloads to on-premises, everything else to Pendra.
Installation
Install with the native package for your OS or run as a Docker container. Open the Pendra console → Workers → Add worker for an OS-aware download with checksums, or grab the file straight from get.pendra.ai/worker/archives/latest/. Full per-OS instructions live in Workers → Install.
# macOS (Apple Silicon)
# Drag Pendra.app into /Applications, then launch it
open Pendra-<v>-arm64.dmg
# Windows — run the signed installer, no UAC needed
PendraSetup-<v>.exe
# Linux — pick CPU / CUDA / Vulkan to match your GPU
sudo apt install ./pendra-cuda_<v>_linux_amd64.deb
sudo pendra setup # writes /var/lib/pendra/config.yaml and restarts pendra.service Each installer registers the OS service (LaunchAgent on macOS, Run key on Windows, systemd on Linux), so the worker comes back up after a reboot. macOS supports Apple Silicon only — Intel Macs are no longer supported because the in-process Metal path needs Apple Silicon. Linux ships three GPU variants (CPU baseline, CUDA, Vulkan) for both amd64 and arm64.
CLI reference
The pendra CLI manages your worker. Config is stored in
~/.pendra/config.yaml. Full env-var reference at
Workers → Configuration.
| Command | Description |
|---|---|
pendra setup | Interactive setup wizard — enter your key, discover backends, save config |
pendra models install <model> | Pull a catalogue model into the worker's inference backend |
pendra run | Start the worker and begin serving inference requests |
pendra models | List all models available on your configured backends |
pendra status | Show connection status, backend health, and active models |
pendra config | View resolved configuration (env + file + defaults) |
pendra config set KEY VAL | Set a configuration value in ~/.pendra/config.yaml |
pendra logs | Tail the worker's log buffer; -f follows. Use systemctl status pendra / launchctl print to manage the OS-supervised service installed by the .deb / .rpm / .dmg / .exe package. |
pendra update | Self-update to the latest version |
pendra version | Show version, Go version, and platform |
Inference backends
Every Pendra worker ships with the Pendra backend built in — it serves catalogue chat models directly. You don't need to install anything else to start serving chat completions.
You can optionally connect external backends on the same machine for capabilities the Pendra backend doesn't cover today — image generation (Ollama) and audio transcription (Speaches). The worker auto-discovers them on startup. See the backend capability matrix for what each supports.
Pendra Built-in
The default. Bundled with the worker — no external service to install. Chat models from the catalogue install with one click.
Ollama
Optional external backend. Supports chat, embedding, and image models. Full curated install/uninstall.
vLLM
Optional external backend. High-throughput serving for HuggingFace models. Continuous batching, PagedAttention.
Requirements
| Component | Requirement |
|---|---|
| OS | Linux (x86_64, arm64), macOS (Apple Silicon only), or Windows (amd64). Full matrix: system requirements. |
| GPU | NVIDIA recommended for production; the Pendra backend ships CUDA, Metal, and Vulkan builds. AMD ROCm via Ollama. CPU-only mode supported for testing. |
| Backend | The Pendra backend is built in — no separate install. Optionally add Ollama, vLLM, LM Studio, or Speaches on the same machine. |
| Network | Outbound WSS to api.pendra.ai. No inbound ports needed. |
| Docker | Only needed if running the worker as a container. Not required for binary install. |
Hybrid deployments
Route traffic based on sensitivity, cost, or performance. Your application code stays the same regardless of where inference runs.
Self-hosted
Patient record summarisation, classified document analysis, privileged legal review.
Pendra-managed
Internal knowledge bases, customer support drafts, code generation, general-purpose tasks.
Want us to handle it instead?
Let us run your workers
If you'd rather not manage your own GPUs, we can run dedicated workers for you on Pendra-managed infrastructure. Same API, zero operational overhead.