Backends

vLLM

vLLM is a high-throughput inference server tuned for production-scale serving. It's the right backend when you need to saturate a data-centre GPU. Pendra treats vLLM as externally managed: you decide what model it serves, Pendra just connects.

What's supported

CapabilityStatus
Chat completions
Embeddings
Image generation
Audio transcription
Model install✗ — managed externally
Model uninstall✗ — managed externally

Connection

  • Default port: 8000
  • Auto-discovery probes: http://localhost:8000, then http://host.docker.internal:8000
  • Verification: the worker calls /version and /v1/models (both OpenAI-compatible).
  • Override: set VLLM_ENDPOINT in worker config.

Why no model install?

vLLM loads a single model per process and is typically launched via Docker or systemd with command-line flags like --model meta-llama/Llama-3.1-70B-Instruct and tensor-parallel settings. There's no REST hook to swap models at runtime, so install / uninstall happen at the OS level (restart vLLM with a different --model, or kill and start a new container).

Pendra picks up whatever model vLLM is currently serving via /v1/models and surfaces it in the console. The capability lights up the moment vLLM is responsive on the configured port.

Model metadata

The worker enriches vLLM models with context length (from max_model_len) and parameter size (regex-extracted from the model ID). HuggingFace metadata enrichment is applied where the model ID matches a public HF repo — set HF_TOKEN on the API to enrich gated models too.

Related