vLLM
Overview
vLLM serves LLM inference via an OpenAI-compatible API. It runs inside Docker with GPU access and is only reachable through the nginx proxy — no port is published directly.
Client → nginx (port 8100) → vLLM (internal, port 8000)
Startup
vLLM is started by vllm/start_vllm.sh, which builds the vllm serve command from env vars, with prefix caching enabled, a custom chat template at /chat_template.jinja, and the API mounted under /vllm/.
Endpoints
vLLM exposes an OpenAI-compatible API. Through nginx, the following are reachable:
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check (proxied from vLLM) |
| GET | /v1/models | List loaded models |
| POST | /v1/chat/completions | Chat inference |
| POST | /v1/completions | Text completion |
| ANY | /vllm/* | Full pass-through to vLLM (all routes) |