Backends
vLLM
vLLM is a high-throughput inference server tuned for production-scale serving. It's the right backend when you need to saturate a data-centre GPU. Pendra treats vLLM as externally managed: you decide what model it serves, Pendra just connects.
What's supported
| Capability | Status |
|---|---|
| Chat completions | ✓ |
| Embeddings | ✓ |
| Image generation | — |
| Audio transcription | — |
| Model install | ✗ — managed externally |
| Model uninstall | ✗ — managed externally |
Connection
- Default port: 8000
- Auto-discovery probes:
http://localhost:8000, thenhttp://host.docker.internal:8000 - Verification: the worker calls
/versionand/v1/models(both OpenAI-compatible). - Override: set
VLLM_ENDPOINTin worker config.
Why no model install?
vLLM loads a single model per process and is typically launched via
Docker or systemd with command-line flags like
--model meta-llama/Llama-3.1-70B-Instruct and tensor-parallel
settings. There's no REST hook to swap models at runtime, so install /
uninstall happen at the OS level (restart vLLM with a different
--model, or kill and start a new container).
Pendra picks up whatever model vLLM is currently serving via
/v1/models and surfaces it in the console. The
capability lights up the moment vLLM is responsive on the configured port.
Model metadata
The worker enriches vLLM models with context length (from
max_model_len) and parameter size (regex-extracted from the
model ID). HuggingFace metadata enrichment is applied where the model ID
matches a public HF repo — set HF_TOKEN on the API to enrich
gated models too.