vLLM API Usage

Base URL: https://llm.kaisens.fr

All POST endpoints require a Bearer token:


Authorization: Bearer <VLLM_API_KEY>
X-Client-Name: <your-app-name>   # recommended — used for per-client quota

Chat Completions

POST /v1/chat/completions


curl https://llm.kaisens.fr/v1/chat/completions \
  -H "Authorization: Bearer <VLLM_API_KEY>" \
  -H "Content-Type: application/json" \
  -H "X-Client-Name: my-app" \
  -d '{
    "model": "solidrust/Mistral-7B-Instruct-v0.3-AWQ",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Once upon a time" }
    ],
    "max_tokens": 20
  }'

Example response:


{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "solidrust/Mistral-7B-Instruct-v0.3-AWQ",
  "choices": [
    {
      "message": { "role": "assistant", "content": "there was a brave knight who..." },
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

Text Completions

POST /v1/completions


curl https://llm.kaisens.fr/v1/completions \
  -H "Authorization: Bearer <VLLM_API_KEY>" \
  -H "Content-Type: application/json" \
  -H "X-Client-Name: my-app" \
  -d '{
    "model": "solidrust/Mistral-7B-Instruct-v0.3-AWQ",
    "prompt": "Once upon a time",
    "max_tokens": 20
  }'

Models

GET /v1/models — no auth required


curl https://llm.kaisens.fr/v1/models

Common Parameters

ParameterTypeDescription
modelstringModel name to use
promptstringInput text (completions only)
messagesarrayConversation history (chat only)
max_tokensintMaximum tokens to generate
temperaturefloatSampling temperature (0.0–2.0, default 1.0)
top_pfloatNucleus sampling threshold (default 1.0)
streamboolStream response chunks (default false)
stopstring/arrayStop sequence(s)

Streaming

Add "stream": true to get server-sent events:


curl https://llm.kaisens.fr/v1/chat/completions \
  -H "Authorization: Bearer <VLLM_API_KEY>" \
  -H "Content-Type: application/json" \
  -H "X-Client-Name: my-app" \
  -d '{
    "model": "solidrust/Mistral-7B-Instruct-v0.3-AWQ",
    "messages": [{ "role": "user", "content": "Tell me a story" }],
    "max_tokens": 100,
    "stream": true
  }'

Error Responses

StatusMeaning
401Missing or invalid Bearer token
403IP not in whitelist
404Endpoint not allowlisted
429Per-client quota exceeded (X-Client-Name or IP rate limit)
503vLLM backend is down