Worker

Run inference on your own GPUs

Install a single binary or Docker container. Your prompts and completions stay on your hardware — routed through the same API, the same SDKs, the same console.

Self-Hosted Workers

Your GPUs. Our orchestration layer.

Your App

Pendra · UK

Pendra API

Self-hosted workers

Worker A

8× L40S 48GB
qwen-2.5-72b

Worker B

4× A100 80GB
deepseek-r1-70b

Inference runs in your environment

Why self-host

Full data control

Prompts and completions are processed entirely on your hardware. Requests route through the Pendra API, but inference never leaves your environment.

Unified orchestration

Centralised load balancing across your GPU fleet. One API, one console, one set of SDKs.

Hybrid ready

Run managed and self-hosted workers side by side. Route sensitive workloads to on-premises, everything else to Pendra.

Installation

Install with the native package for your OS or run as a Docker container. Open the Pendra console → Workers → Add worker for an OS-aware download with checksums, or grab the file straight from get.pendra.ai/worker/archives/latest/. Full per-OS instructions live in Workers → Install.

install

# macOS (Apple Silicon)
# Drag Pendra.app into /Applications, then launch it
open Pendra-<v>-arm64.dmg

# Windows — run the signed installer, no UAC needed
PendraSetup-<v>.exe

# Linux — pick CPU / CUDA / Vulkan to match your GPU
sudo apt install ./pendra-cuda_<v>_linux_amd64.deb
sudo pendra setup   # writes /var/lib/pendra/config.yaml and restarts pendra.service

Each installer registers the OS service (LaunchAgent on macOS, Run key on Windows, systemd on Linux), so the worker comes back up after a reboot. macOS supports Apple Silicon only — Intel Macs are no longer supported because the in-process Metal path needs Apple Silicon. Linux ships three GPU variants (CPU baseline, CUDA, Vulkan) for both amd64 and arm64.

CLI reference

The pendra CLI manages your worker. Config is stored in ~/.pendra/config.yaml. Full env-var reference at Workers → Configuration.

Command	Description
`pendra setup`	Interactive setup wizard — enter your key, discover backends, save config
`pendra models install <model>`	Pull a catalogue model into the worker's inference backend
`pendra run`	Start the worker and begin serving inference requests
`pendra models`	List all models available on your configured backends
`pendra status`	Show connection status, backend health, and active models
`pendra config`	View resolved configuration (env + file + defaults)
`pendra config set KEY VAL`	Set a configuration value in `~/.pendra/config.yaml`
`pendra logs`	Tail the worker's log buffer; `-f` follows. Use `systemctl status pendra` / `launchctl print` to manage the OS-supervised service installed by the `.deb` / `.rpm` / `.dmg` / `.exe` package.
`pendra update`	Self-update to the latest version
`pendra version`	Show version, Go version, and platform

Inference backends

Every Pendra worker ships with the Pendra backend built in — it serves catalogue chat models directly. You don't need to install anything else to start serving chat completions.

You can optionally connect external backends on the same machine for capabilities the Pendra backend doesn't cover today — image generation (Ollama) and audio transcription (Speaches). The worker auto-discovers them on startup. See the backend capability matrix for what each supports.

Pendra Built-in

The default. Bundled with the worker — no external service to install. Chat models from the catalogue install with one click.

Ollama

Optional external backend. Supports chat, embedding, and image models. Full curated install/uninstall.

vLLM

Optional external backend. High-throughput serving for HuggingFace models. Continuous batching, PagedAttention.

Requirements

Component	Requirement
OS	Linux (x86_64, arm64), macOS (Apple Silicon only), or Windows (amd64). Full matrix: system requirements.
GPU	NVIDIA recommended for production; the Pendra backend ships CUDA, Metal, and Vulkan builds. AMD ROCm via Ollama. CPU-only mode supported for testing.
Backend	The Pendra backend is built in — no separate install. Optionally add Ollama, vLLM, LM Studio, or Speaches on the same machine.
Network	Outbound WSS to `api.pendra.ai`. No inbound ports needed.
Docker	Only needed if running the worker as a container. Not required for binary install.

Hybrid deployments

Route traffic based on sensitivity, cost, or performance. Your application code stays the same regardless of where inference runs.

Self-hosted

Patient record summarisation, classified document analysis, privileged legal review.

Pendra-managed

Internal knowledge bases, customer support drafts, code generation, general-purpose tasks.

Want us to handle it instead?

Let us run your workers

If you'd rather not manage your own GPUs, we can run dedicated workers for you on Pendra-managed infrastructure. Same API, zero operational overhead.

Get in touch