LMDeploy - LLM Compression & Serving Cheatsheet

LMDeploy (from the InternLM/OpenMMLab ecosystem) is a toolkit for compressing, quantizing, and serving large language models. Its high-performance TurboMind engine delivers strong throughput via persistent batching, blocked KV cache, and optimized CUDA kernels, and it also offers a PyTorch backend. It supports 4-bit AWQ weight quantization and KV-cache quantization, an OpenAI-compatible API server, and vision-language models (VLMs).

Installation

Method	Command
pip	`pip install lmdeploy`
With CUDA extras	`pip install lmdeploy[all]`
Docker	`docker run --gpus all openmmlab/lmdeploy:latest`
Requirements	NVIDIA GPU + CUDA
Verify	`lmdeploy --version`

Quick Inference (CLI)

# Interactive chat in the terminal (TurboMind engine)
lmdeploy chat internlm/internlm2_5-7b-chat

# Batch/pipeline inference in Python (below)

Python Pipeline

from lmdeploy import pipeline

pipe = pipeline("internlm/internlm2_5-7b-chat")
resp = pipe(["Explain RAG in one sentence."])
print(resp[0].text)

Call	Description
`pipeline(model)`	Load a model with the default (TurboMind) engine
`pipe([prompts])`	Batch inference
`GenerationConfig(...)`	Sampling params (temperature, top_p, max_new_tokens)
`TurbomindEngineConfig(...)`	Engine tuning (tp, cache, session len)

Serving an API

# OpenAI-compatible server on port 23333
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333

# Query it
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"internlm2_5-7b-chat","messages":[{"role":"user","content":"hi"}]}'

Command	Description
`lmdeploy serve api_server MODEL`	Start the OpenAI-compatible server
`lmdeploy serve gradio MODEL`	Launch a Gradio web UI
`--server-port`	API port
`--tp N`	Tensor parallelism across N GPUs
`--session-len N`	Max context length

Quantization (AWQ)

# 4-bit AWQ weight-only quantization
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat \
  --work-dir internlm2_5-7b-chat-4bit

# Serve the quantized model
lmdeploy serve api_server internlm2_5-7b-chat-4bit --model-format awq

Command	Description
`lmdeploy lite auto_awq MODEL`	Produce a 4-bit AWQ model
`lmdeploy lite calibrate MODEL`	Calibration step
`--model-format awq`	Serve an AWQ-quantized model
KV-cache quant	`--quant-policy 4` or `8` for INT4/INT8 KV cache

Engine Configuration

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
    backend_config=TurbomindEngineConfig(
        tp=2, session_len=8192, cache_max_entry_count=0.8, quant_policy=8))

Option	Effect
`tp`	Tensor-parallel GPUs
`session_len`	Context length
`cache_max_entry_count`	Fraction of VRAM for KV cache
`quant_policy`	KV cache quantization (4/8)

Vision-Language Models

Capability	Note
VLM support	Serve models like InternVL, LLaVA, Qwen-VL
Same API	Multimodal messages via the OpenAI-compatible server

LMDeploy vs Other Engines

Aspect	LMDeploy	vLLM	Aphrodite
Engine	TurboMind + PyTorch	PagedAttention	vLLM fork
Quantization	AWQ + KV cache quant	Growing	Broadest formats
VLM support	Strong	Yes	Yes
Best for	High-throughput + AWQ + VLMs	Standard serving	Community quant formats

LMDeploy - LLM Compression & Serving Cheatsheet

LMDeploy - LLM Compression & Serving Cheatsheet

Installation

Quick Inference (CLI)

Python Pipeline

Serving an API

Quantization (AWQ)

Engine Configuration

Vision-Language Models

LMDeploy vs Other Engines

Resources