Worker
System requirements
The Pendra daemon is a pure-Go static binary with no runtime dependencies. It runs on every major desktop and server OS; the bottleneck is whatever inference backend you point it at.
Operating system support
| Capability | macOS | Windows | Linux server | Linux desktop |
|---|---|---|---|---|
Headless daemon (pendra) | ✓ | ✓ | ✓ | ✓ |
| Menu-bar / tray GUI | ✓ Pendra.app | ✓ PendraSetup.exe | — | planned |
| Distribution format | .dmg / .app.tar.gz | PendraSetup.exe | .deb / .rpm / Docker | .deb / .rpm |
| One-line installer | ✓ | (downloads .exe) | ✓ | ✓ |
| Native service manager | launchd | Windows SCM | systemd | systemd |
| Auto-launch at login | LaunchAgent | HKCU\Run\PendraGui | (servers use systemd) | XDG autostart |
| Self-update | Ed25519-signed appcast | Ed25519-signed appcast | Ed25519-signed appcast | Ed25519-signed appcast |
| Code signing | Apple Developer ID + notarised | Authenticode (Azure Key Vault) | n/a | n/a |
CPU & runtime
- Linux server builds are
CGO_ENABLED=0— static binary, works on glibc and musl. - macOS and Windows GUIs are CGO (for the system-tray library); the worker helper inside the bundle is still pure Go.
- Memory footprint of the daemon itself is small — <50 MB resident in normal use.
GPU & inference hardware
The Pendra worker ships with a built-in inference backend (the Pendra backend), and can also proxy to external backends — Ollama, vLLM, LM Studio, or Speaches — when you add them. Hardware requirements are dictated by whichever backend serves the model, not by Pendra itself.
Rough guidance:
- NVIDIA with a recent CUDA driver works across every backend. RTX 30/40-series consumer cards or A/H/L data-centre cards all work. The Pendra backend ships CUDA builds for these.
- Apple Silicon works well via the Pendra backend (Metal), Ollama, and LM Studio. M-series unified memory eliminates the host-to-GPU copy.
- AMD ROCm works via Ollama on supported cards. The Pendra backend uses Vulkan on AMD.
- CPU-only is supported but slow — fine for embeddings and small chat models, not for 70B-class.
Backend capability matrix
Which backends Pendra can talk to, what each supports, and which ones the curated catalogue can install models into — see the dedicated Backend capability matrix.
Disk space
The daemon binary is ~30 MB. Models live in the backend's own
directory (Ollama uses ~/.ollama, LM Studio uses
~/.lmstudio, etc.) — those are what consume disk. Modern
quantised chat models are 4–80 GB each; budget accordingly.
Network
The daemon needs outbound HTTPS/WebSocket on port 443 to
api.pendra.ai. No inbound ports need to be open. The
self-updater also fetches from get.pendra.ai.
Concurrency tuning
The worker defaults to serving one inference request at a time. That
matches how most self-hosted GPUs are sized — a single in-flight
generation gets the full memory bandwidth, and consumer cards don't OOM
under parallel pressure. The knob is not exposed in
pendra setup or the console.
If your hardware has clear headroom for parallel generations (modern
data-centre GPUs, multi-GPU machines), set max_concurrent
in ~/.pendra/config.yaml or export
MAX_CONCURRENT. Values are clamped to [1, 64];
anything outside that range falls back to the default.
Local control channel
The CLI (pendra status, pendra doctor) and the
menu-bar GUI talk to the running daemon over a local OS-level channel —
a Unix socket at ~/.pendra/pendra.sock on macOS / Linux, or
a named pipe \\.\pipe\pendra on Windows. Both are
permission-restricted to the current user, so no tokens or shared
secrets cross the wire.