Worker

System requirements

The Pendra daemon is a pure-Go static binary with no runtime dependencies. It runs on every major desktop and server OS; the bottleneck is whatever inference backend you point it at.

Operating system support

CapabilitymacOSWindowsLinux serverLinux desktop
Headless daemon (pendra)
Menu-bar / tray GUIPendra.appPendraSetup.exeplanned
Distribution format.dmg / .app.tar.gzPendraSetup.exe.deb / .rpm / Docker.deb / .rpm
One-line installer(downloads .exe)
Native service managerlaunchdWindows SCMsystemdsystemd
Auto-launch at loginLaunchAgentHKCU\Run\PendraGui(servers use systemd)XDG autostart
Self-updateEd25519-signed appcastEd25519-signed appcastEd25519-signed appcastEd25519-signed appcast
Code signingApple Developer ID + notarisedAuthenticode (Azure Key Vault)n/an/a

CPU & runtime

  • Linux server builds are CGO_ENABLED=0 — static binary, works on glibc and musl.
  • macOS and Windows GUIs are CGO (for the system-tray library); the worker helper inside the bundle is still pure Go.
  • Memory footprint of the daemon itself is small — <50 MB resident in normal use.

GPU & inference hardware

The Pendra worker ships with a built-in inference backend (the Pendra backend), and can also proxy to external backends — Ollama, vLLM, LM Studio, or Speaches — when you add them. Hardware requirements are dictated by whichever backend serves the model, not by Pendra itself.

Rough guidance:

  • NVIDIA with a recent CUDA driver works across every backend. RTX 30/40-series consumer cards or A/H/L data-centre cards all work. The Pendra backend ships CUDA builds for these.
  • Apple Silicon works well via the Pendra backend (Metal), Ollama, and LM Studio. M-series unified memory eliminates the host-to-GPU copy.
  • AMD ROCm works via Ollama on supported cards. The Pendra backend uses Vulkan on AMD.
  • CPU-only is supported but slow — fine for embeddings and small chat models, not for 70B-class.

Backend capability matrix

Which backends Pendra can talk to, what each supports, and which ones the curated catalogue can install models into — see the dedicated Backend capability matrix.

Disk space

The daemon binary is ~30 MB. Models live in the backend's own directory (Ollama uses ~/.ollama, LM Studio uses ~/.lmstudio, etc.) — those are what consume disk. Modern quantised chat models are 4–80 GB each; budget accordingly.

Network

The daemon needs outbound HTTPS/WebSocket on port 443 to api.pendra.ai. No inbound ports need to be open. The self-updater also fetches from get.pendra.ai.

Concurrency tuning

The worker defaults to serving one inference request at a time. That matches how most self-hosted GPUs are sized — a single in-flight generation gets the full memory bandwidth, and consumer cards don't OOM under parallel pressure. The knob is not exposed in pendra setup or the console.

If your hardware has clear headroom for parallel generations (modern data-centre GPUs, multi-GPU machines), set max_concurrent in ~/.pendra/config.yaml or export MAX_CONCURRENT. Values are clamped to [1, 64]; anything outside that range falls back to the default.

Local control channel

The CLI (pendra status, pendra doctor) and the menu-bar GUI talk to the running daemon over a local OS-level channel — a Unix socket at ~/.pendra/pendra.sock on macOS / Linux, or a named pipe \\.\pipe\pendra on Windows. Both are permission-restricted to the current user, so no tokens or shared secrets cross the wire.