Skip to content

LMDeploy - LLM Compression & Serving Cheatsheet

LMDeploy - LLM Compression & Serving Cheatsheet

LMDeploy (from the InternLM/OpenMMLab ecosystem) is a toolkit for compressing, quantizing, and serving large language models. Its high-performance TurboMind engine delivers strong throughput via persistent batching, blocked KV cache, and optimized CUDA kernels, and it also offers a PyTorch backend. It supports 4-bit AWQ weight quantization and KV-cache quantization, an OpenAI-compatible API server, and vision-language models (VLMs).

Installation

MethodCommand
pippip install lmdeploy
With CUDA extraspip install lmdeploy[all]
Dockerdocker run --gpus all openmmlab/lmdeploy:latest
RequirementsNVIDIA GPU + CUDA
Verifylmdeploy --version

Quick Inference (CLI)

# Interactive chat in the terminal (TurboMind engine)
lmdeploy chat internlm/internlm2_5-7b-chat

# Batch/pipeline inference in Python (below)

Python Pipeline

from lmdeploy import pipeline

pipe = pipeline("internlm/internlm2_5-7b-chat")
resp = pipe(["Explain RAG in one sentence."])
print(resp[0].text)
CallDescription
pipeline(model)Load a model with the default (TurboMind) engine
pipe([prompts])Batch inference
GenerationConfig(...)Sampling params (temperature, top_p, max_new_tokens)
TurbomindEngineConfig(...)Engine tuning (tp, cache, session len)

Serving an API

# OpenAI-compatible server on port 23333
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333

# Query it
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"internlm2_5-7b-chat","messages":[{"role":"user","content":"hi"}]}'
CommandDescription
lmdeploy serve api_server MODELStart the OpenAI-compatible server
lmdeploy serve gradio MODELLaunch a Gradio web UI
--server-portAPI port
--tp NTensor parallelism across N GPUs
--session-len NMax context length

Quantization (AWQ)

# 4-bit AWQ weight-only quantization
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat \
  --work-dir internlm2_5-7b-chat-4bit

# Serve the quantized model
lmdeploy serve api_server internlm2_5-7b-chat-4bit --model-format awq
CommandDescription
lmdeploy lite auto_awq MODELProduce a 4-bit AWQ model
lmdeploy lite calibrate MODELCalibration step
--model-format awqServe an AWQ-quantized model
KV-cache quant--quant-policy 4 or 8 for INT4/INT8 KV cache

Engine Configuration

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
    backend_config=TurbomindEngineConfig(
        tp=2, session_len=8192, cache_max_entry_count=0.8, quant_policy=8))
OptionEffect
tpTensor-parallel GPUs
session_lenContext length
cache_max_entry_countFraction of VRAM for KV cache
quant_policyKV cache quantization (4/8)

Vision-Language Models

CapabilityNote
VLM supportServe models like InternVL, LLaVA, Qwen-VL
Same APIMultimodal messages via the OpenAI-compatible server

LMDeploy vs Other Engines

AspectLMDeployvLLMAphrodite
EngineTurboMind + PyTorchPagedAttentionvLLM fork
QuantizationAWQ + KV cache quantGrowingBroadest formats
VLM supportStrongYesYes
Best forHigh-throughput + AWQ + VLMsStandard servingCommunity quant formats

Resources