LMDeploy - LLM Compression & Serving Cheatsheet
LMDeploy (from the InternLM/OpenMMLab ecosystem) is a toolkit for compressing, quantizing, and serving large language models. Its high-performance TurboMind engine delivers strong throughput via persistent batching, blocked KV cache, and optimized CUDA kernels, and it also offers a PyTorch backend. It supports 4-bit AWQ weight quantization and KV-cache quantization, an OpenAI-compatible API server, and vision-language models (VLMs).
Installation
| Method | Command |
|---|
| pip | pip install lmdeploy |
| With CUDA extras | pip install lmdeploy[all] |
| Docker | docker run --gpus all openmmlab/lmdeploy:latest |
| Requirements | NVIDIA GPU + CUDA |
| Verify | lmdeploy --version |
Quick Inference (CLI)
# Interactive chat in the terminal (TurboMind engine)
lmdeploy chat internlm/internlm2_5-7b-chat
# Batch/pipeline inference in Python (below)
Python Pipeline
from lmdeploy import pipeline
pipe = pipeline("internlm/internlm2_5-7b-chat")
resp = pipe(["Explain RAG in one sentence."])
print(resp[0].text)
| Call | Description |
|---|
pipeline(model) | Load a model with the default (TurboMind) engine |
pipe([prompts]) | Batch inference |
GenerationConfig(...) | Sampling params (temperature, top_p, max_new_tokens) |
TurbomindEngineConfig(...) | Engine tuning (tp, cache, session len) |
Serving an API
# OpenAI-compatible server on port 23333
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333
# Query it
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"internlm2_5-7b-chat","messages":[{"role":"user","content":"hi"}]}'
| Command | Description |
|---|
lmdeploy serve api_server MODEL | Start the OpenAI-compatible server |
lmdeploy serve gradio MODEL | Launch a Gradio web UI |
--server-port | API port |
--tp N | Tensor parallelism across N GPUs |
--session-len N | Max context length |
Quantization (AWQ)
# 4-bit AWQ weight-only quantization
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat \
--work-dir internlm2_5-7b-chat-4bit
# Serve the quantized model
lmdeploy serve api_server internlm2_5-7b-chat-4bit --model-format awq
| Command | Description |
|---|
lmdeploy lite auto_awq MODEL | Produce a 4-bit AWQ model |
lmdeploy lite calibrate MODEL | Calibration step |
--model-format awq | Serve an AWQ-quantized model |
| KV-cache quant | --quant-policy 4 or 8 for INT4/INT8 KV cache |
Engine Configuration
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
backend_config=TurbomindEngineConfig(
tp=2, session_len=8192, cache_max_entry_count=0.8, quant_policy=8))
| Option | Effect |
|---|
tp | Tensor-parallel GPUs |
session_len | Context length |
cache_max_entry_count | Fraction of VRAM for KV cache |
quant_policy | KV cache quantization (4/8) |
Vision-Language Models
| Capability | Note |
|---|
| VLM support | Serve models like InternVL, LLaVA, Qwen-VL |
| Same API | Multimodal messages via the OpenAI-compatible server |
LMDeploy vs Other Engines
| Aspect | LMDeploy | vLLM | Aphrodite |
|---|
| Engine | TurboMind + PyTorch | PagedAttention | vLLM fork |
| Quantization | AWQ + KV cache quant | Growing | Broadest formats |
| VLM support | Strong | Yes | Yes |
| Best for | High-throughput + AWQ + VLMs | Standard serving | Community quant formats |
Resources