SGLang Cheat Sheet
Overview
SGLang (Structured Generation Language) is a high-performance serving framework for large language models. It introduces RadixAttention for automatic KV cache reuse across requests with shared prefixes, achieving significant speedups for batch inference and multi-turn conversations. SGLang also provides a frontend language for programming LLM interactions with primitives for generation, selection, branching, and parallelism.
The framework is optimized for both throughput and latency, supporting continuous batching, tensor parallelism, quantization (AWQ, GPTQ, FP8), and structured output generation (JSON mode, regex constraints). SGLang serves as both a production inference server and a programming framework for complex LLM workflows.
Installation
pip install sglang[all]
# With FlashInfer (recommended for speed)
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
# From source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
Server Usage
Launch Server
# Basic server launch
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000
# With tensor parallelism (multi-GPU)
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \
--port 30000
# With quantization
python -m sglang.launch_server \
--model-path TheBloke/Llama-3.1-8B-Instruct-AWQ \
--quantization awq \
--port 30000
# FP8 quantization
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--port 30000
# With memory optimization
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--mem-fraction-static 0.85 \
--max-running-requests 64 \
--port 30000
OpenAI-Compatible API
# Chat completions
curl -X POST http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "What is RAG?"}
],
"max_tokens": 256,
"temperature": 0.7
}'
# Streaming
curl -X POST http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true,
"max_tokens": 500
}'
# Completions (non-chat)
curl -X POST http://localhost:30000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "The capital of France is",
"max_tokens": 32
}'
Python Client (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="none")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain transformers"}],
max_tokens=256,
temperature=0.7
)
print(response.choices[0].message.content)
SGLang Frontend
Basic Programs
import sglang as sgl
@sgl.function
def qa_program(s, question):
s += sgl.system("You are a helpful AI assistant.")
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=256))
# Run program
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
state = qa_program.run(question="What is machine learning?")
print(state["answer"])
Structured Generation
import sglang as sgl
@sgl.function
def extract_info(s, text):
s += sgl.user(f"Extract structured information from: {text}")
s += sgl.assistant(sgl.gen("result",
max_tokens=500,
regex=r'\{"name": "[^"]+", "age": \d+, "city": "[^"]+"\}'
))
# JSON mode
@sgl.function
def json_output(s, query):
s += sgl.user(query)
s += sgl.assistant(sgl.gen("output",
max_tokens=500,
json_schema={
"type": "object",
"properties": {
"answer": {"type": "string"},
"confidence": {"type": "number"},
"sources": {"type": "array", "items": {"type": "string"}}
},
"required": ["answer", "confidence"]
}
))
Selection (Constrained Choice)
@sgl.function
def classify(s, text):
s += sgl.user(f"Classify the sentiment: {text}")
s += sgl.assistant(sgl.select("sentiment", ["positive", "negative", "neutral"]))
state = classify.run(text="I love this product!")
print(state["sentiment"]) # "positive"
Branching and Parallelism
@sgl.function
def multi_perspective(s, topic):
s += sgl.user(f"Analyze {topic}")
# Parallel branches
forks = s.fork(3)
forks[0] += sgl.assistant(sgl.gen("technical", max_tokens=200))
forks[1] += sgl.assistant(sgl.gen("business", max_tokens=200))
forks[2] += sgl.assistant(sgl.gen("ethical", max_tokens=200))
forks.join()
# Access all results
s += sgl.user("Summarize all perspectives")
s += sgl.assistant(sgl.gen("summary", max_tokens=300))
Server Configuration
Launch Options
| Parameter | Description | Default |
|---|---|---|
--model-path | HuggingFace model path | Required |
--port | Server port | 30000 |
--host | Server host | 127.0.0.1 |
--tp | Tensor parallelism degree | 1 |
--dp | Data parallelism degree | 1 |
--quantization | Quantization method (awq, gptq, fp8) | None |
--mem-fraction-static | GPU memory fraction for KV cache | 0.88 |
--max-running-requests | Max concurrent requests | Auto |
--context-length | Override model context length | Model default |
--schedule-policy | Scheduling policy (lpm, random, fcfs) | lpm |
--disable-radix-cache | Disable RadixAttention | False |
--chunked-prefill-size | Chunk size for prefill | 8192 |
--log-level | Logging level | info |
Multi-GPU Serving
# Tensor parallelism across 4 GPUs
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \
--port 30000
# Data parallelism (multiple replicas)
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--dp 2 \
--port 30000
Advanced Usage
Batch Inference
import sglang as sgl
@sgl.function
def process(s, text):
s += sgl.user(f"Summarize: {text}")
s += sgl.assistant(sgl.gen("summary", max_tokens=100))
# Batch processing
texts = ["Document 1 content...", "Document 2 content...", "Document 3 content..."]
states = process.run_batch(
[{"text": t} for t in texts],
num_threads=16,
progress_bar=True
)
for state in states:
print(state["summary"])
Embedding Server
# Launch embedding model
python -m sglang.launch_server \
--model-path BAAI/bge-large-en-v1.5 \
--is-embedding \
--port 30000
# Get embeddings
curl -X POST http://localhost:30000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-large-en-v1.5",
"input": ["What is RAG?", "Vector databases"]
}'
Docker Deployment
docker run --gpus all -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0
Benchmarking
# Benchmark throughput
python -m sglang.bench_serving \
--backend sglang \
--port 30000 \
--dataset-name sharegpt \
--num-prompts 1000 \
--request-rate 10
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce --mem-fraction-static, use quantization |
| Model loading fails | Check HuggingFace token, verify model path |
| Slow first request | KV cache warming; subsequent requests are faster |
| Port already in use | Use different --port value |
| FlashInfer errors | Install correct version for your CUDA/PyTorch |
| TP mismatch | Ensure --tp divides evenly into available GPUs |
| Low throughput | Increase --max-running-requests, tune batch size |
| Structured gen fails | Check regex/JSON schema syntax |
# Check server health
curl http://localhost:30000/health
# Get model info
curl http://localhost:30000/v1/models
# Monitor metrics
curl http://localhost:30000/metrics