vLLM

Overview

vLLM serves LLM inference via an OpenAI-compatible API. It runs inside Docker with GPU access and is only reachable through the nginx proxy — no port is published directly.


Client → nginx (port 8100) → vLLM (internal, port 8000)

Startup

vLLM is started by vllm/start_vllm.sh, which builds the vllm serve command from env vars, with prefix caching enabled, a custom chat template at /chat_template.jinja, and the API mounted under /vllm/.


Endpoints

vLLM exposes an OpenAI-compatible API. Through nginx, the following are reachable:

MethodPathDescription
GET/healthHealth check (proxied from vLLM)
GET/v1/modelsList loaded models
POST/v1/chat/completionsChat inference
POST/v1/completionsText completion
ANY/vllm/*Full pass-through to vLLM (all routes)