Worker

Configuration

Worker configuration comes from three sources, merged in order (later wins):

Built-in defaults compiled into the daemon.
Config file at ~/.pendra/config.yaml (Unix) or %ProgramData%\Pendra\config.yaml (Windows).
Environment variables.

pendra setup writes the config file interactively; pendra config set <key> <value> mutates it directly. Env vars are re-read on every CLI invocation and at daemon startup.

Example config file

~/.pendra/config.yaml

# ~/.pendra/config.yaml
app_server_ws_url: wss://api.pendra.ai
gpu_worker_private_key: <base64-ed25519-private-key>
worker_id: wrk-a1b2c3d4
worker_name: gpu-01
models_dir: ~/.pendra/models
# optional — private inference (set by 'pendra keygen --save'):
workload_private_key_file: ~/.pendra/workload_key

The file holds the Ed25519 private key in plaintext, so the daemon enforces mode 0600 on Unix and warns at startup if it's wider.

Settings reference

Env var	Config key	Default	Purpose
`APP_SERVER_WS_URL`	`app_server_ws_url`	`wss://api.pendra.ai`	Base URL the daemon dials; `/ws/gpu` is appended automatically.
`GPU_WORKER_PRIVATE_KEY`	`gpu_worker_private_key`	—	Base64 Ed25519 private key from the console. The daemon refuses to start without one.
`PENDRA_WORKLOAD_PRIVATE_KEY_FILE`	`workload_private_key_file`	—	Path to the private-inference workload key file. `pendra keygen --save` writes a PEM X25519 key beside `config.yaml` and sets this; you can equally point it at a PEM X25519 key you generated yourself (e.g. `openssl genpkey -algorithm X25519`). Leave it unset and drop a key file at the default location (`~/.pendra/workload_key`) and the worker picks it up automatically. Off unless a key is configured or found there — see private inference.
`PENDRA_WORKLOAD_PRIVATE_KEY_FILE_PREVIOUS`	`workload_private_key_file_previous`	—	Path to the previous workload key during a key rotation (same base64 or PEM encodings as the primary). Set automatically when `pendra keygen --save` rotates an existing key, so the worker keeps serving clients still pinned to the old fingerprint until they re-pin.
`WORKER_ID`	`worker_id`	auto `wrk-<hex>`	Stable ID across restarts. Persisted on first run.
`WORKER_NAME`	`worker_name`	`hostname`	Friendly name shown in the console.
`MODELS`	`models`	(serve all)	Filter to a subset of discovered models. JSON array of IDs or `{id}` objects.
`PENDRA_MODELS_DIR`	`models_dir`	`~/.pendra/models` (Unix), `%ProgramData%\Pendra\models` (Windows)	Directory the worker serves GGUFs from. `pendra models install` downloads into this directory; `pendra models dir` prints the resolved path.
`PENDRA_DISABLE_LLAMACPP`	—	`false` (on)	Set `1`/`true`/`yes` to turn off Pendra's in-process chat and embeddings.
`PENDRA_DISABLE_WHISPERCPP`	—	`false` (on)	Set `1`/`true`/`yes` to turn off Pendra's in-process audio transcription. Transcription accepts `wav`/`mp3`/`flac` and runs on Linux, macOS, and Windows workers.
`PENDRA_DISABLE_STABLEDIFFUSION`	—	`false` (on)	Set `1`/`true`/`yes` to turn off Pendra's in-process image generation. Serves `/v1/images/generations` from a diffusion `.gguf` in your models directory.
`PENDRA_ALLOW_METAL_BF16`	—	`false` (refuse)	Apple Silicon Macs only. A few models whose weights use the bf16 number format are declined on a Mac worker by default, because they can crash it. Set `1`/`true`/`yes` to run one anyway once you've confirmed it works on your Mac. No effect on other platforms.
`PENDRA_BLOCK_METAL_BF16_PROJECTOR`	—	`false` (allowed)	Apple Silicon Macs only. Vision models run on Mac workers by default. If a vision model's image analysis fails to start on an older Mac, set `1`/`true`/`yes` to serve those models as text-only (image input off). This only affects image input — it won't make a model that can't run at all (e.g. a too-large or unsupported one) start running. No effect on other platforms.
`INSECURE_TLS`	`insecure_tls`	`false`	Skip TLS cert verification for the WebSocket. Local dev only.
`PENDRA_LOG_MAX_FILES`	—	`5`	How many log files to keep in `~/.pendra/logs/`. The worker starts a fresh log on each restart and keeps this many (the live file plus older sessions), so the log from before a crash or restart is still there to read. Minimum 2.
`PENDRA_LOG_MAX_SIZE_MB`	—	`20`	Size in MB at which the current log file rolls over to a new one mid-session. The total kept is still bounded by `PENDRA_LOG_MAX_FILES`. Minimum 1.

 Per-worker settings in the console
 
A few worker-wide tuning options live in the console rather than in
config.yaml or an env var. Open Workers → your worker → Settings
(you must be the org owner) — changes apply to the running worker within a few seconds, no
    restart needed.
   Setting Default Purpose
 
  Speculative decoding cap 2 How many tokens the draft proposes per step when a model runs speculative decoding (range 1–6). Turning speculative decoding on or off is a separate per-model setting (Models & context); this tunes the depth when it's on.
 Allow remote image URLs Off Let vision requests reference http(s) image URLs, which the worker fetches for you (private/loopback hosts are always blocked). Off by default — pass images inline as base64 data: URIs.
 Web tools Off Let tool-capable models fetch live web pages and run web searches during a chat, so answers can use up-to-date information (private/loopback hosts are always blocked). Off by default — turning it on adds extra work per request. When it's on, the chat response also lists the pages fetched and searches run under a pendra.web_tool_steps field, so you can see what the model looked at.
 Task timeout 30m Wall-clock limit for a single non-streaming request (chat, embeddings, rerank, image, transcription), from 1 to 120 minutes. Raise it for very large or slow models; lower it to fail faster. Streaming chat is unaffected.
 Max concurrent requests 1 How many requests this worker runs in parallel (range 1–64) — also how many Pendra sends it before routing to another worker. Defaults to one at a time; raise it on hardware that can serve more without running out of memory.
 
 
 CLI commands
  Command Purpose
  pendra / pendra run Start the worker daemon.
 pendra status Connection and model info for the running worker.
 pendra restart-backend Restart just the inference engine (a fresh process starts on the next request) without stopping the worker. Use it to recover a stalled GPU that has fallen back to running on CPU.
 pendra config Print resolved config and the file path it loaded from.
 pendra config set <key> <value> Mutate the config file.
 pendra setup Interactive setup wizard.
 pendra models List the models installed on the worker.
 pendra models install <model> Install a catalogue model onto the worker.
 pendra logs Tail the worker's log buffer (-f follows). The OS-supervised service is managed with systemctl / launchctl / Services.msc.
 pendra doctor Diagnostics — checks the inference runtime, config, and the live worker connection. On a Linux service install, re-runs itself with sudo when it needs to read the service's config.
 
 
 Related
  Install — get the worker running.
 System requirements — OS support, hardware.
 Choosing a context size — how the context window is sized.

Setting	Default	Purpose
Speculative decoding cap	`2`	How many tokens the draft proposes per step when a model runs speculative decoding (range `1`–`6`). Turning speculative decoding on or off is a separate per-model setting (Models & context); this tunes the depth when it's on.
Allow remote image URLs	Off	Let vision requests reference `http(s)` image URLs, which the worker fetches for you (private/loopback hosts are always blocked). Off by default — pass images inline as base64 `data:` URIs.
Web tools	Off	Let tool-capable models fetch live web pages and run web searches during a chat, so answers can use up-to-date information (private/loopback hosts are always blocked). Off by default — turning it on adds extra work per request. When it's on, the chat response also lists the pages fetched and searches run under a `pendra.web_tool_steps` field, so you can see what the model looked at.
Task timeout	`30m`	Wall-clock limit for a single non-streaming request (chat, embeddings, rerank, image, transcription), from 1 to 120 minutes. Raise it for very large or slow models; lower it to fail faster. Streaming chat is unaffected.
Max concurrent requests	`1`	How many requests this worker runs in parallel (range `1`–`64`) — also how many Pendra sends it before routing to another worker. Defaults to one at a time; raise it on hardware that can serve more without running out of memory.

Command	Purpose
`pendra` / `pendra run`	Start the worker daemon.
`pendra status`	Connection and model info for the running worker.
`pendra restart-backend`	Restart just the inference engine (a fresh process starts on the next request) without stopping the worker. Use it to recover a stalled GPU that has fallen back to running on CPU.
`pendra config`	Print resolved config and the file path it loaded from.
`pendra config set <key> <value>`	Mutate the config file.
`pendra setup`	Interactive setup wizard.
`pendra models`	List the models installed on the worker.
`pendra models install <model>`	Install a catalogue model onto the worker.
`pendra logs`	Tail the worker's log buffer (`-f` follows). The OS-supervised service is managed with `systemctl` / `launchctl` / Services.msc.
`pendra doctor`	Diagnostics — checks the inference runtime, config, and the live worker connection. On a Linux service install, re-runs itself with `sudo` when it needs to read the service's config.