Skip to content

SGLang Cheat Sheet

Overview

SGLang (Structured Generation Language) is a high-performance serving framework for large language models. It introduces RadixAttention for automatic KV cache reuse across requests with shared prefixes, achieving significant speedups for batch inference and multi-turn conversations. SGLang also provides a frontend language for programming LLM interactions with primitives for generation, selection, branching, and parallelism.

The framework is optimized for both throughput and latency, supporting continuous batching, tensor parallelism, quantization (AWQ, GPTQ, FP8), and structured output generation (JSON mode, regex constraints). SGLang serves as both a production inference server and a programming framework for complex LLM workflows.

Installation

pip install sglang[all]

# With FlashInfer (recommended for speed)
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

# From source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Server Usage

Launch Server

# Basic server launch
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# With tensor parallelism (multi-GPU)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --port 30000

# With quantization
python -m sglang.launch_server \
  --model-path TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --port 30000

# FP8 quantization
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 \
  --port 30000

# With memory optimization
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.85 \
  --max-running-requests 64 \
  --port 30000

OpenAI-Compatible API

# Chat completions
curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are helpful."},
      {"role": "user", "content": "What is RAG?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Streaming
curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Write a story"}],
    "stream": true,
    "max_tokens": 500
  }'

# Completions (non-chat)
curl -X POST http://localhost:30000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 32
  }'

Python Client (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain transformers"}],
    max_tokens=256,
    temperature=0.7
)
print(response.choices[0].message.content)

SGLang Frontend

Basic Programs

import sglang as sgl

@sgl.function
def qa_program(s, question):
    s += sgl.system("You are a helpful AI assistant.")
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256))

# Run program
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
state = qa_program.run(question="What is machine learning?")
print(state["answer"])

Structured Generation

import sglang as sgl

@sgl.function
def extract_info(s, text):
    s += sgl.user(f"Extract structured information from: {text}")
    s += sgl.assistant(sgl.gen("result",
        max_tokens=500,
        regex=r'\{"name": "[^"]+", "age": \d+, "city": "[^"]+"\}'
    ))

# JSON mode
@sgl.function
def json_output(s, query):
    s += sgl.user(query)
    s += sgl.assistant(sgl.gen("output",
        max_tokens=500,
        json_schema={
            "type": "object",
            "properties": {
                "answer": {"type": "string"},
                "confidence": {"type": "number"},
                "sources": {"type": "array", "items": {"type": "string"}}
            },
            "required": ["answer", "confidence"]
        }
    ))

Selection (Constrained Choice)

@sgl.function
def classify(s, text):
    s += sgl.user(f"Classify the sentiment: {text}")
    s += sgl.assistant(sgl.select("sentiment", ["positive", "negative", "neutral"]))

state = classify.run(text="I love this product!")
print(state["sentiment"])  # "positive"

Branching and Parallelism

@sgl.function
def multi_perspective(s, topic):
    s += sgl.user(f"Analyze {topic}")

    # Parallel branches
    forks = s.fork(3)
    forks[0] += sgl.assistant(sgl.gen("technical", max_tokens=200))
    forks[1] += sgl.assistant(sgl.gen("business", max_tokens=200))
    forks[2] += sgl.assistant(sgl.gen("ethical", max_tokens=200))
    forks.join()

    # Access all results
    s += sgl.user("Summarize all perspectives")
    s += sgl.assistant(sgl.gen("summary", max_tokens=300))

Server Configuration

Launch Options

ParameterDescriptionDefault
--model-pathHuggingFace model pathRequired
--portServer port30000
--hostServer host127.0.0.1
--tpTensor parallelism degree1
--dpData parallelism degree1
--quantizationQuantization method (awq, gptq, fp8)None
--mem-fraction-staticGPU memory fraction for KV cache0.88
--max-running-requestsMax concurrent requestsAuto
--context-lengthOverride model context lengthModel default
--schedule-policyScheduling policy (lpm, random, fcfs)lpm
--disable-radix-cacheDisable RadixAttentionFalse
--chunked-prefill-sizeChunk size for prefill8192
--log-levelLogging levelinfo

Multi-GPU Serving

# Tensor parallelism across 4 GPUs
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --port 30000

# Data parallelism (multiple replicas)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --dp 2 \
  --port 30000

Advanced Usage

Batch Inference

import sglang as sgl

@sgl.function
def process(s, text):
    s += sgl.user(f"Summarize: {text}")
    s += sgl.assistant(sgl.gen("summary", max_tokens=100))

# Batch processing
texts = ["Document 1 content...", "Document 2 content...", "Document 3 content..."]
states = process.run_batch(
    [{"text": t} for t in texts],
    num_threads=16,
    progress_bar=True
)

for state in states:
    print(state["summary"])

Embedding Server

# Launch embedding model
python -m sglang.launch_server \
  --model-path BAAI/bge-large-en-v1.5 \
  --is-embedding \
  --port 30000
# Get embeddings
curl -X POST http://localhost:30000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-large-en-v1.5",
    "input": ["What is RAG?", "Vector databases"]
  }'

Docker Deployment

docker run --gpus all -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0

Benchmarking

# Benchmark throughput
python -m sglang.bench_serving \
  --backend sglang \
  --port 30000 \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --request-rate 10

Troubleshooting

IssueSolution
CUDA out of memoryReduce --mem-fraction-static, use quantization
Model loading failsCheck HuggingFace token, verify model path
Slow first requestKV cache warming; subsequent requests are faster
Port already in useUse different --port value
FlashInfer errorsInstall correct version for your CUDA/PyTorch
TP mismatchEnsure --tp divides evenly into available GPUs
Low throughputIncrease --max-running-requests, tune batch size
Structured gen failsCheck regex/JSON schema syntax
# Check server health
curl http://localhost:30000/health

# Get model info
curl http://localhost:30000/v1/models

# Monitor metrics
curl http://localhost:30000/metrics