vLLM Cheat Sheet
Overview
vLLM is a production-grade LLM serving system that achieves high throughput through PagedAttention — a memory management algorithm inspired by OS paging that eliminates KV cache fragmentation. This enables near-100% GPU memory utilization and up to 24× higher throughput than naive HuggingFace implementations.
Key features: OpenAI-compatible REST API (drop-in replacement), continuous batching (dynamically fills GPU with requests), tensor parallelism for multi-GPU inference, quantization support (AWQ, GPTQ, FP8, INT8), speculative decoding for latency reduction, and LoRA adapter serving.
Minimum requirements: NVIDIA GPU (Pascal+ for most features, Ampere+ for FP8/FlashAttention), CUDA 11.8+, Python 3.9+.
Installation
# Standard install (CUDA 12.1)
pip install vllm
# Specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
# CPU-only (for testing, not production)
pip install vllm-cpu
# ROCm (AMD GPU)
pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.1
# Docker (recommended for production)
docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3.1-8B-Instruct
# Verify
python -c "import vllm; print(vllm.__version__)"
Core CLI Commands
| Command | Description |
|---|---|
vllm serve <model> | Start OpenAI-compatible API server |
vllm serve <model> --port 8000 | Custom port |
vllm serve <model> --tensor-parallel-size 4 | Multi-GPU inference |
vllm serve <model> --quantization awq | Load AWQ quantized model |
vllm serve <model> --max-model-len 4096 | Limit context window |
vllm serve <model> --gpu-memory-utilization 0.95 | GPU memory fraction |
vllm serve <model> --dtype bfloat16 | Override dtype |
vllm serve <model> --enable-lora --max-loras 4 | Enable LoRA serving |
vllm serve <model> --api-key token-abc123 | Require API key |
vllm serve <model> --served-model-name my-model | Custom model name |
vllm bench throughput --model <model> | Benchmark throughput |
vllm bench latency --model <model> | Benchmark latency |
Configuration
# Single GPU server
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype auto
# Multi-GPU (tensor parallel)
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--dtype bfloat16
# AWQ quantized model
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
--quantization awq \
--max-model-len 4096
# With LoRA support
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 4 \
--max-lora-rank 32
# Environment variables
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HF_TOKEN=hf_... # for gated models
Core API Reference
| API | Description |
|---|---|
LLM(model, ...) | Offline inference engine |
llm.generate(prompts, sampling_params) | Batch generate completions |
llm.chat(messages, sampling_params) | Chat-style generation |
SamplingParams(temperature, top_p, max_tokens, ...) | Generation config |
llm.encode(prompts) | Get embeddings (embedding models only) |
AsyncLLMEngine | Async engine for server integration |
AsyncLLMEngine.generate(prompt, params, request_id) | Async streaming generation |
GET /v1/models | List available models (API server) |
POST /v1/completions | OpenAI completions endpoint |
POST /v1/chat/completions | OpenAI chat completions endpoint |
POST /v1/embeddings | Embeddings endpoint |
Advanced Usage
Offline Batch Inference
from vllm import LLM, SamplingParams
# Load model
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
dtype="bfloat16",
max_model_len=8192,
gpu_memory_utilization=0.90,
tensor_parallel_size=1, # number of GPUs
)
# Sampling parameters
params = SamplingParams(
temperature=0.7,
top_p=0.9,
top_k=50,
max_tokens=512,
repetition_penalty=1.1,
stop=["</s>", "<|eot_id|>"], # stop sequences
)
# Batch inference (highly efficient — vLLM schedules all at once)
prompts = [
"Explain quantum computing in simple terms.",
"Write a Python function to reverse a linked list.",
"What are the main causes of World War I?",
"Summarize the theory of relativity.",
]
outputs = llm.generate(prompts, params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Output: {output.outputs[0].text}")
print(f"Tokens: {len(output.outputs[0].token_ids)}")
print()
Chat API (OpenAI-compatible Client)
from openai import OpenAI
# Point to vLLM server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123", # any string if no auth configured
)
# List models
models = client.models.list()
print([m.id for m in models.data])
# Chat completion
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a poem about GPUs."}],
stream=True,
max_tokens=256,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Multi-GPU Tensor Parallelism
from vllm import LLM, SamplingParams
# 70B model across 4 GPUs
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4, # split model across 4 GPUs
pipeline_parallel_size=1, # pipeline stages (for very large models)
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.90,
)
# Multi-GPU pipeline parallelism (for models > A100 × tensor_parallel)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-405B",
tensor_parallel_size=8,
pipeline_parallel_size=2, # 8 × 2 = 16 GPUs total
dtype="bfloat16",
)
Quantization
# AWQ (Activation-aware Weight Quantization) — 4-bit, near-lossless
llm = LLM(
model="TheBloke/Llama-2-70B-Chat-AWQ",
quantization="awq",
dtype="half",
)
# GPTQ — 4-bit quantization
llm = LLM(
model="TheBloke/Llama-2-13B-GPTQ",
quantization="gptq",
dtype="float16",
)
# FP8 (requires Hopper GPUs — H100/H800)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
quantization="fp8",
dtype="bfloat16",
)
# INT8 (W8A8 — weights and activations in 8-bit)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization="int8",
)
# bitsandbytes (4-bit NF4 — compatible but slower than AWQ)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B",
quantization="bitsandbytes",
load_format="bitsandbytes",
)
LoRA Adapter Serving
# Start server with LoRA support
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 4 \
--max-lora-rank 32 \
--lora-extra-vocab-size 256
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
# Use specific LoRA adapter by name (loaded at runtime)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write SQL to find top 10 customers"}],
extra_body={
"lora_request": {
"lora_name": "sql-adapter",
"lora_int_id": 1,
"lora_local_path": "/adapters/sql-lora",
}
},
)
Speculative Decoding
# Draft model speculative decoding (faster latency for large models)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct", # target (large)
speculative_model="meta-llama/Meta-Llama-3.1-8B-Instruct", # draft (small)
num_speculative_tokens=5, # tokens to draft ahead
speculative_draft_tensor_parallel_size=1,
tensor_parallel_size=4,
)
# n-gram speculative decoding (no draft model needed)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
speculative_model="[ngram]",
num_speculative_tokens=5,
ngram_prompt_lookup_min=4,
ngram_prompt_lookup_max=4,
)
Async Engine for Production Servers
import asyncio
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm import SamplingParams
engine_args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
dtype="bfloat16",
max_model_len=8192,
gpu_memory_utilization=0.90,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
async def generate(prompt: str, request_id: str):
params = SamplingParams(temperature=0.7, max_tokens=256)
results_generator = engine.generate(prompt, params, request_id)
full_output = ""
async for request_output in results_generator:
if request_output.finished:
full_output = request_output.outputs[0].text
return full_output
# Run concurrent requests efficiently
async def main():
tasks = [
generate(f"Question {i}: ...", f"req-{i}")
for i in range(100)
]
results = await asyncio.gather(*tasks)
return results
Common Workflows
Workflow 1: Benchmark Throughput Before Deployment
# Throughput benchmark (requests/second)
python -m vllm.entrypoints.benchmark_throughput \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--num-prompts 1000 \
--input-len 512 \
--output-len 256 \
--dtype bfloat16
# Latency benchmark (time-to-first-token, TPOT)
python -m vllm.entrypoints.benchmark_latency \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--input-len 512 \
--output-len 256 \
--batch-size 1 \
--num-iters 50
Workflow 2: Docker Production Deployment
# docker-compose.yml
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- HF_TOKEN=${HF_TOKEN}
- CUDA_VISIBLE_DEVICES=0,1,2,3
volumes:
- ${HF_CACHE}:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model meta-llama/Meta-Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--max-model-len 8192
--gpu-memory-utilization 0.90
--dtype bfloat16
--served-model-name llama-70b
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
Workflow 3: Custom OpenAI-Compatible Integration
# Replace openai client with vLLM server, zero code changes needed
import os
os.environ["OPENAI_API_BASE"] = "http://localhost:8000/v1"
os.environ["OPENAI_API_KEY"] = "dummy"
# Your existing OpenAI code works unchanged
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
Tips and Best Practices
gpu_memory_utilization=0.90is a safe default; increase to0.95if not OOMing, decrease if you see CUDA OOM errors.max_model_lenlimits the KV cache size — set this to your actual maximum prompt+output length, not the model’s theoretical maximum.- Continuous batching is automatic — vLLM dynamically combines in-flight requests; you don’t need to batch requests yourself.
- AWQ quantization provides the best throughput/quality tradeoff for 4-bit; prefer it over GPTQ for inference.
- Tensor parallelism scales linearly up to 8 GPUs; use
pipeline_parallel_sizefor models that don’t fit even in tensor parallel. - Speculative decoding reduces latency by 1.5–3× for chatbot workloads (output-heavy); it doesn’t help for batch/throughput use cases.
- LoRA hot-swapping lets you serve dozens of fine-tunes from one GPU — far more efficient than running separate servers per adapter.
- Monitor KV cache utilization via the
/metricsPrometheus endpoint (running,waiting,gpu_cache_usage_perc). --max-num-seqslimits concurrent requests; tune this if you see scheduling delays under high load.- Flash Attention is used automatically on Ampere+ GPUs — do not explicitly set
--dtype halfon H100s; usebfloat16instead.