Worker

System requirements

The Pendra daemon is a small static binary with no runtime dependencies. It runs on every major desktop and server OS.

Operating system support

	macOS	Windows	Linux
Architectures	Apple Silicon	x86_64	x86_64 / ARM64
Download	`.dmg`	`PendraSetup.exe`	`apt` / `yum` repo, `.deb` / `.rpm`, or Docker
One-line installer	✓	(downloads the installer)	✓
Menu-bar / tray app	✓	✓	planned
Runs in the background	✓ (starts at login)	✓ (starts at login)	✓ (`systemd`)
Updates	Automatic, signed	Automatic, signed	`apt` / `dnf upgrade`
Code-signed	✓ Apple-notarised	✓ Authenticode	—

CPU & runtime

The pendra binary is a static build that runs anywhere (glibc or musl).
The in-process inference runtime needs glibc 2.38+ (Ubuntu 24.04+, Debian 13+). On older Linux it won't load — pendra doctor flags it. On Linux, NVIDIA hosts also need the CUDA 13 runtime installed (the .deb/.rpm bundles the CUDA inference libs, not the runtime). The macOS and Windows installers bundle everything — including the CUDA runtime on Windows — so a clean NVIDIA-driver-only host needs nothing extra.
The macOS and Windows tray GUIs use a native system-tray library; the worker helper inside the bundle is the same static binary.
Memory footprint of the daemon itself is small — <50 MB resident in normal use.

GPU & inference hardware

The Pendra worker runs inference in-process and ships GPU builds for every major platform. Hardware requirements are dictated by the model you install — its size and quantization — not by Pendra itself.

Rough guidance:

NVIDIA is the production recommendation. RTX 30/40-series consumer cards or A/H/L data-centre cards all work, and Pendra ships CUDA builds for them. The host needs a working NVIDIA driver installed and loaded before the worker can use the GPU — verify with nvidia-smi, which should list your GPU and a CUDA version.
Apple Silicon works well (Metal). M-series unified memory eliminates the host-to-GPU copy, and nothing extra is needed — Metal is built in.
AMD / Intel are supported via Vulkan.
CPU-only is supported but slow — fine for embeddings and small chat models, not for 70B-class.

A GPU host with no NVIDIA driver runs CPU-only — silently. The GPU is optional (Pendra falls back to CPU), but without a working driver the worker can't see the card and quietly runs every request on the CPU. Install the driver first if you expect GPU inference, then confirm with nvidia-smi. By platform:

Linux: install NVIDIA driver R580 or newer and the CUDA 13 runtime — the .deb/.rpm bundle the inference libraries but not the driver or runtime. See the install guide for the exact steps and the install-time check.
Windows: the installer bundles the CUDA runtime, so you only need the driver — but cloud Windows images (GCP and similar) often ship none. Install it and reboot first.
macOS / Apple Silicon: nothing to install — Metal is built in.

After setup, pendra doctor reports whether the GPU backend actually loaded, so you can catch a CPU fallback before sending traffic.

If you install a model that's larger than the worker's GPUs can hold, the worker runs it partially offloaded — it keeps as many layers (and, for mixture-of-experts models, as many experts) on the GPU as fit and runs the rest on the CPU, rather than refusing the model. It still serves requests; it's just slower than a model that fits entirely in VRAM. The worker's detail page in the dashboard marks such a model "Partial offload" so you can spot it at a glance. For full speed, install the model on a worker with more GPU memory or pick a smaller quantization. (Image and vision models are the exception: they must fit on a single GPU, so a vision model that needs more than one card's memory is marked "Won't fit here" and isn't sent requests.)

On Apple Silicon and other unified-memory hosts — where the GPU and CPU share one pool of memory — moving layers to the CPU gives no relief, because it's the same memory either way. So a model whose full size would exceed the host's memory is marked "Won't fit here" and isn't sent requests, rather than loading and risking the machine's stability. To run it, pick a smaller model or quantization, or use a host with more memory.

Separately, if a worker's GPU faults while it's running, a model can fall back to running entirely on CPU — it keeps serving requests, just far slower. The dashboard flags this with a "Running on CPU" badge (distinct from "Partial offload", which is about a model that simply didn't fit). To restore GPU acceleration, run pendra restart-backend on that worker — it restarts just the inference engine, without stopping the worker, and the next request runs on the GPU again. Pendra also automatically tries to recover on its own when it detects this, but the command is there if you want to force it.

Host RAM

For best results, give the host at least as much system RAM as your largest GPU's VRAM — and ideally as much as the total VRAM you intend to use. Models load from disk into memory before they reach the GPU, and a host with less RAM than a single card's VRAM has to lean entirely on the OS file cache to page a model in, which makes cold loads slower and less predictable. Sizing host RAM ≥ your largest GPU keeps model loads clean.

This matters most on multi-GPU boxes with modest per-card memory. Each GPU also has its own per-card ceiling for image and vision models, which must fit on a single GPU (they can't be split across cards) — so a vision model needs to fit one card's VRAM even if the box has plenty in total. pendra doctor flags any vision model that won't fit, any model running partially offloaded, and a worker that has fallen back to running on CPU, so you can catch them before sending traffic.

Disk space

Pendra itself takes little disk — what consumes space is the models. They live in the worker's models directory (~/.pendra/models by default); modern quantised chat models are 4–80 GB each, so budget accordingly.

Network

The daemon needs outbound HTTPS/WebSocket on port 443 to api.pendra.ai. No inbound ports need to be open. Updates and model downloads are fetched from get.pendra.ai (the apt/yum package repository on Linux, the signed update feed on macOS and Windows).

Concurrency tuning

The worker defaults to serving one inference request at a time. That matches how most self-hosted GPUs are sized — a single in-flight generation gets the full memory bandwidth, and consumer cards don't OOM under parallel pressure.

If your hardware has clear headroom for parallel generations (modern data-centre GPUs, multi-GPU machines), raise Max concurrent requests under Workers → your worker → Settings in the console (org owner). It sets how many requests the worker runs in parallel — and how many Pendra routes to it before picking another worker — and applies to the running worker within a few seconds, no restart needed. Values are clamped to [1, 64].

Context size

Chat defaults to auto context sizing: the worker picks the largest context window (n_ctx) each model can safely run in the memory the system can actually dedicate to it — real free GPU VRAM (summed across every GPU, since the model is split across them) on NVIDIA, or the working-set the runtime can use on Apple-unified-memory and CPU hosts — after the loaded model's weights, then clamps it to the model's trained context. Sizing accounts for the model itself, so a compact model gets a larger window than a heavy one on the same card. More usable memory means a larger window, and auto grows into available headroom on every architecture — including Apple Silicon, which is no longer flatly capped at a few thousand tokens.

Auto also self-heals: if a model ever proves unstable at a window, the worker steps it down to a safe value to keep serving and retries a larger window later, so a one-off problem can't leave a model stuck small. The console shows the window Auto resolved to for each model, flags any window it reduced after an instability, and notes when your recent traffic would fit a smaller window (a hint, not a change).

Context is tuned per model. To pin a specific window, open Workers → your worker → Models & context in the console (owner-only), expand the model to open its settings, switch it to Custom, and choose any value from 1,024 to 1,048,576 tokens (clamped to that model's trained context). A larger window uses more memory, so raise it deliberately on hosts with headroom — and only on the models that need it, leaving the rest on Auto. There's no worker-wide config-file or environment-variable override. Embeddings auto-size their own window and aren't configurable.

Local control channel

The CLI (pendra status, pendra doctor) and the menu-bar GUI talk to the running daemon over a local OS-level channel — a Unix socket at ~/.pendra/pendra.sock on macOS / Linux, or a named pipe \\.\pipe\pendra on Windows. Both are permission-restricted to the current user, so no tokens or shared secrets cross the wire.