Skip to content

vLLM Cheat Sheet

Overview

vLLM is a production-grade LLM serving system that achieves high throughput through PagedAttention — a memory management algorithm inspired by OS paging that eliminates KV cache fragmentation. This enables near-100% GPU memory utilization and up to 24× higher throughput than naive HuggingFace implementations.

Key features: OpenAI-compatible REST API (drop-in replacement), continuous batching (dynamically fills GPU with requests), tensor parallelism for multi-GPU inference, quantization support (AWQ, GPTQ, FP8, INT8), speculative decoding for latency reduction, and LoRA adapter serving.

Minimum requirements: NVIDIA GPU (Pascal+ for most features, Ampere+ for FP8/FlashAttention), CUDA 11.8+, Python 3.9+.

Installation

# Standard install (CUDA 12.1)
pip install vllm

# Specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# CPU-only (for testing, not production)
pip install vllm-cpu

# ROCm (AMD GPU)
pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.1

# Docker (recommended for production)
docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct

# Verify
python -c "import vllm; print(vllm.__version__)"

Core CLI Commands

CommandDescription
vllm serve <model>Start OpenAI-compatible API server
vllm serve <model> --port 8000Custom port
vllm serve <model> --tensor-parallel-size 4Multi-GPU inference
vllm serve <model> --quantization awqLoad AWQ quantized model
vllm serve <model> --max-model-len 4096Limit context window
vllm serve <model> --gpu-memory-utilization 0.95GPU memory fraction
vllm serve <model> --dtype bfloat16Override dtype
vllm serve <model> --enable-lora --max-loras 4Enable LoRA serving
vllm serve <model> --api-key token-abc123Require API key
vllm serve <model> --served-model-name my-modelCustom model name
vllm bench throughput --model <model>Benchmark throughput
vllm bench latency --model <model>Benchmark latency

Configuration

# Single GPU server
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --dtype auto

# Multi-GPU (tensor parallel)
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --dtype bfloat16

# AWQ quantized model
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
  --quantization awq \
  --max-model-len 4096

# With LoRA support
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 32

# Environment variables
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export CUDA_VISIBLE_DEVICES=0,1,2,3
export HF_TOKEN=hf_...                     # for gated models

Core API Reference

APIDescription
LLM(model, ...)Offline inference engine
llm.generate(prompts, sampling_params)Batch generate completions
llm.chat(messages, sampling_params)Chat-style generation
SamplingParams(temperature, top_p, max_tokens, ...)Generation config
llm.encode(prompts)Get embeddings (embedding models only)
AsyncLLMEngineAsync engine for server integration
AsyncLLMEngine.generate(prompt, params, request_id)Async streaming generation
GET /v1/modelsList available models (API server)
POST /v1/completionsOpenAI completions endpoint
POST /v1/chat/completionsOpenAI chat completions endpoint
POST /v1/embeddingsEmbeddings endpoint

Advanced Usage

Offline Batch Inference

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
    tensor_parallel_size=1,         # number of GPUs
)

# Sampling parameters
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_tokens=512,
    repetition_penalty=1.1,
    stop=["</s>", "<|eot_id|>"],   # stop sequences
)

# Batch inference (highly efficient — vLLM schedules all at once)
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to reverse a linked list.",
    "What are the main causes of World War I?",
    "Summarize the theory of relativity.",
]

outputs = llm.generate(prompts, params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text}")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print()

Chat API (OpenAI-compatible Client)

from openai import OpenAI

# Point to vLLM server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",              # any string if no auth configured
)

# List models
models = client.models.list()
print([m.id for m in models.data])

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain PagedAttention."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about GPUs."}],
    stream=True,
    max_tokens=256,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Multi-GPU Tensor Parallelism

from vllm import LLM, SamplingParams

# 70B model across 4 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,         # split model across 4 GPUs
    pipeline_parallel_size=1,       # pipeline stages (for very large models)
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.90,
)

# Multi-GPU pipeline parallelism (for models > A100 × tensor_parallel)
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-405B",
    tensor_parallel_size=8,
    pipeline_parallel_size=2,       # 8 × 2 = 16 GPUs total
    dtype="bfloat16",
)

Quantization

# AWQ (Activation-aware Weight Quantization) — 4-bit, near-lossless
llm = LLM(
    model="TheBloke/Llama-2-70B-Chat-AWQ",
    quantization="awq",
    dtype="half",
)

# GPTQ — 4-bit quantization
llm = LLM(
    model="TheBloke/Llama-2-13B-GPTQ",
    quantization="gptq",
    dtype="float16",
)

# FP8 (requires Hopper GPUs — H100/H800)
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization="fp8",
    dtype="bfloat16",
)

# INT8 (W8A8 — weights and activations in 8-bit)
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization="int8",
)

# bitsandbytes (4-bit NF4 — compatible but slower than AWQ)
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B",
    quantization="bitsandbytes",
    load_format="bitsandbytes",
)

LoRA Adapter Serving

# Start server with LoRA support
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 32 \
  --lora-extra-vocab-size 256
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

# Use specific LoRA adapter by name (loaded at runtime)
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write SQL to find top 10 customers"}],
    extra_body={
        "lora_request": {
            "lora_name": "sql-adapter",
            "lora_int_id": 1,
            "lora_local_path": "/adapters/sql-lora",
        }
    },
)

Speculative Decoding

# Draft model speculative decoding (faster latency for large models)
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",   # target (large)
    speculative_model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # draft (small)
    num_speculative_tokens=5,   # tokens to draft ahead
    speculative_draft_tensor_parallel_size=1,
    tensor_parallel_size=4,
)

# n-gram speculative decoding (no draft model needed)
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    ngram_prompt_lookup_min=4,
    ngram_prompt_lookup_max=4,
)

Async Engine for Production Servers

import asyncio
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm import SamplingParams

engine_args = AsyncEngineArgs(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    dtype="bfloat16",
    max_model_len=8192,
    gpu_memory_utilization=0.90,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

async def generate(prompt: str, request_id: str):
    params = SamplingParams(temperature=0.7, max_tokens=256)
    results_generator = engine.generate(prompt, params, request_id)
    full_output = ""
    async for request_output in results_generator:
        if request_output.finished:
            full_output = request_output.outputs[0].text
    return full_output

# Run concurrent requests efficiently
async def main():
    tasks = [
        generate(f"Question {i}: ...", f"req-{i}")
        for i in range(100)
    ]
    results = await asyncio.gather(*tasks)
    return results

Common Workflows

Workflow 1: Benchmark Throughput Before Deployment

# Throughput benchmark (requests/second)
python -m vllm.entrypoints.benchmark_throughput \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 256 \
  --dtype bfloat16

# Latency benchmark (time-to-first-token, TPOT)
python -m vllm.entrypoints.benchmark_latency \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 256 \
  --batch-size 1 \
  --num-iters 50

Workflow 2: Docker Production Deployment

# docker-compose.yml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - CUDA_VISIBLE_DEVICES=0,1,2,3
    volumes:
      - ${HF_CACHE}:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Meta-Llama-3.1-70B-Instruct
      --tensor-parallel-size 4
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --dtype bfloat16
      --served-model-name llama-70b
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

Workflow 3: Custom OpenAI-Compatible Integration

# Replace openai client with vLLM server, zero code changes needed
import os
os.environ["OPENAI_API_BASE"] = "http://localhost:8000/v1"
os.environ["OPENAI_API_KEY"] = "dummy"

# Your existing OpenAI code works unchanged
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

Tips and Best Practices

  • gpu_memory_utilization=0.90 is a safe default; increase to 0.95 if not OOMing, decrease if you see CUDA OOM errors.
  • max_model_len limits the KV cache size — set this to your actual maximum prompt+output length, not the model’s theoretical maximum.
  • Continuous batching is automatic — vLLM dynamically combines in-flight requests; you don’t need to batch requests yourself.
  • AWQ quantization provides the best throughput/quality tradeoff for 4-bit; prefer it over GPTQ for inference.
  • Tensor parallelism scales linearly up to 8 GPUs; use pipeline_parallel_size for models that don’t fit even in tensor parallel.
  • Speculative decoding reduces latency by 1.5–3× for chatbot workloads (output-heavy); it doesn’t help for batch/throughput use cases.
  • LoRA hot-swapping lets you serve dozens of fine-tunes from one GPU — far more efficient than running separate servers per adapter.
  • Monitor KV cache utilization via the /metrics Prometheus endpoint (running, waiting, gpu_cache_usage_perc).
  • --max-num-seqs limits concurrent requests; tune this if you see scheduling delays under high load.
  • Flash Attention is used automatically on Ampere+ GPUs — do not explicitly set --dtype half on H100s; use bfloat16 instead.