SGLang Cheat Sheet

Overview

SGLang (Structured Generation Language) is a high-performance serving framework for large language models. It introduces RadixAttention for automatic KV cache reuse across requests with shared prefixes, achieving significant speedups for batch inference and multi-turn conversations. SGLang also provides a frontend language for programming LLM interactions with primitives for generation, selection, branching, and parallelism.

The framework is optimized for both throughput and latency, supporting continuous batching, tensor parallelism, quantization (AWQ, GPTQ, FP8), and structured output generation (JSON mode, regex constraints). SGLang serves as both a production inference server and a programming framework for complex LLM workflows.

Installation

pip install sglang[all]

# With FlashInfer (recommended for speed)
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

# From source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Server Usage

Launch Server

# Basic server launch
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# With tensor parallelism (multi-GPU)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --port 30000

# With quantization
python -m sglang.launch_server \
  --model-path TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --port 30000

# FP8 quantization
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --port 30000

# With memory optimization
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.85 \
  --max-running-requests 64 \
  --port 30000

OpenAI-Compatible API

# Chat completions
curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "What is RAG?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Streaming
curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Write a story"}],
    "stream": true,
    "max_tokens": 500
  }'

# Completions (non-chat)
curl -X POST http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 32
  }'

Python Client (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain transformers"}],
    max_tokens=256,
    temperature=0.7
)
print(response.choices[0].message.content)

SGLang Frontend

Basic Programs

import sglang as sgl

@sgl.function
def qa_program(s, question):
    s += sgl.system("You are a helpful AI assistant.")
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256))

# Run program
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
state = qa_program.run(question="What is machine learning?")
print(state["answer"])

Structured Generation

import sglang as sgl

@sgl.function
def extract_info(s, text):
    s += sgl.user(f"Extract structured information from: {text}")
    s += sgl.assistant(sgl.gen("result",
        max_tokens=500,
        regex=r'\{"name": "[^"]+", "age": \d+, "city": "[^"]+"\}'
    ))

# JSON mode
@sgl.function
def json_output(s, query):
    s += sgl.user(query)
    s += sgl.assistant(sgl.gen("output",
        max_tokens=500,
        json_schema={
            "type": "object",
            "properties": {
                "answer": {"type": "string"},
                "confidence": {"type": "number"},
                "sources": {"type": "array", "items": {"type": "string"}}
            },
            "required": ["answer", "confidence"]
        }
    ))

Selection (Constrained Choice)

@sgl.function
def classify(s, text):
    s += sgl.user(f"Classify the sentiment: {text}")
    s += sgl.assistant(sgl.select("sentiment", ["positive", "negative", "neutral"]))

state = classify.run(text="I love this product!")
print(state["sentiment"])  # "positive"

Branching and Parallelism

@sgl.function
def multi_perspective(s, topic):
    s += sgl.user(f"Analyze {topic}")

    # Parallel branches
    forks = s.fork(3)
    forks[0] += sgl.assistant(sgl.gen("technical", max_tokens=200))
    forks[1] += sgl.assistant(sgl.gen("business", max_tokens=200))
    forks[2] += sgl.assistant(sgl.gen("ethical", max_tokens=200))
    forks.join()

    # Access all results
    s += sgl.user("Summarize all perspectives")
    s += sgl.assistant(sgl.gen("summary", max_tokens=300))

Server Configuration

Launch Options

Parameter	Description	Default
`--model-path`	HuggingFace model path	Required
`--port`	Server port	30000
`--host`	Server host	127.0.0.1
`--tp`	Tensor parallelism degree	1
`--dp`	Data parallelism degree	1
`--quantization`	Quantization method (awq, gptq, fp8)	None
`--mem-fraction-static`	GPU memory fraction for KV cache	0.88
`--max-running-requests`	Max concurrent requests	Auto
`--context-length`	Override model context length	Model default
`--schedule-policy`	Scheduling policy (lpm, random, fcfs)	lpm
`--disable-radix-cache`	Disable RadixAttention	False
`--chunked-prefill-size`	Chunk size for prefill	8192
`--log-level`	Logging level	info

Multi-GPU Serving

# Tensor parallelism across 4 GPUs
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --port 30000

# Data parallelism (multiple replicas)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dp 2 \
  --port 30000

Advanced Usage

Batch Inference

import sglang as sgl

@sgl.function
def process(s, text):
    s += sgl.user(f"Summarize: {text}")
    s += sgl.assistant(sgl.gen("summary", max_tokens=100))

# Batch processing
texts = ["Document 1 content...", "Document 2 content...", "Document 3 content..."]
states = process.run_batch(
    [{"text": t} for t in texts],
    num_threads=16,
    progress_bar=True
)

for state in states:
    print(state["summary"])

Embedding Server

# Launch embedding model
python -m sglang.launch_server \
  --model-path BAAI/bge-large-en-v1.5 \
  --is-embedding \
  --port 30000

# Get embeddings
curl -X POST http://localhost:30000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-large-en-v1.5",
    "input": ["What is RAG?", "Vector databases"]
  }'

Docker Deployment

docker run --gpus all -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0

Benchmarking

# Benchmark throughput
python -m sglang.bench_serving \
  --backend sglang \
  --port 30000 \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --request-rate 10

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce `--mem-fraction-static`, use quantization
Model loading fails	Check HuggingFace token, verify model path
Slow first request	KV cache warming; subsequent requests are faster
Port already in use	Use different `--port` value
FlashInfer errors	Install correct version for your CUDA/PyTorch
TP mismatch	Ensure `--tp` divides evenly into available GPUs
Low throughput	Increase `--max-running-requests`, tune batch size
Structured gen fails	Check regex/JSON schema syntax

# Check server health
curl http://localhost:30000/health

# Get model info
curl http://localhost:30000/v1/models

# Monitor metrics
curl http://localhost:30000/metrics