vLLM

Overview

vLLM serves LLM inference via an OpenAI-compatible API. It runs inside Docker with GPU access and is only reachable through the nginx proxy — no port is published directly.


Client → nginx (port 8100) → vLLM (internal, port 8000)

Startup

vLLM is started by vllm/start_vllm.sh, which builds the vllm serve command from env vars, with prefix caching enabled, a custom chat template at /chat_template.jinja, and the API mounted under /vllm/.

Endpoints

vLLM exposes an OpenAI-compatible API. Through nginx, the following are reachable:

Method	Path	Description
GET	`/health`	Health check (proxied from vLLM)
GET	`/v1/models`	List loaded models
POST	`/v1/chat/completions`	Chat inference
POST	`/v1/completions`	Text completion
ANY	`/vllm/*`	Full pass-through to vLLM (all routes)