LMDeploy - LLM Compression & Serving Cheatsheet

LMDeploy (aus dem InternLM/OpenMMLab Ökosystem) ist ein Toolkit zum Komprimieren, Quantisieren und Servieren von Large Language Models. Seine hochperformante TurboMind Engine liefert starken Throughput via Persistent Batching, Blocked KV Cache und Optimierten CUDA Kernels, und es bietet auch einen PyTorch Backend. Es unterstützt 4-Bit AWQ Weight Quantisierung und KV-Cache Quantisierung, einen OpenAI-kompatiblen API Server und Vision-Language Modelle (VLMs).

Installation

Methode	Befehl
pip	`pip install lmdeploy`
Mit CUDA Extras	`pip install lmdeploy[all]`
Docker	`docker run --gpus all openmmlab/lmdeploy:latest`
Anforderungen	NVIDIA GPU + CUDA
Überprüfen	`lmdeploy --version`

Quick Inference (CLI)

# Interaktive Chat im Terminal (TurboMind Engine)
lmdeploy chat internlm/internlm2_5-7b-chat

# Batch/Pipeline Inferenz in Python (unten)

Python Pipeline

from lmdeploy import pipeline

pipe = pipeline("internlm/internlm2_5-7b-chat")
resp = pipe(["Erklären Sie RAG in einem Satz."])
print(resp[0].text)

Call	Beschreibung
`pipeline(model)`	Lade ein Modell mit dem Default (TurboMind) Engine
`pipe([prompts])`	Batch Inferenz
`GenerationConfig(...)`	Sampling Params (Temperature, top_p, max_new_tokens)
`TurbomindEngineConfig(...)`	Engine Tuning (tp, Cache, Session Len)

Serving eine API

# OpenAI-kompatible Server auf Port 23333
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333

# Query es
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"internlm2_5-7b-chat","messages":[{"role":"user","content":"hi"}]}'

Befehl	Beschreibung
`lmdeploy serve api_server MODEL`	Starte den OpenAI-kompatiblen Server
`lmdeploy serve gradio MODEL`	Starte eine Gradio Web UI
`--server-port`	API Port
`--tp N`	Tensor Parallelism über N GPUs
`--session-len N`	Max Context Length

Quantisierung (AWQ)

# 4-Bit AWQ Weight-Only Quantisierung
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat \
  --work-dir internlm2_5-7b-chat-4bit

# Serve das Quantisierte Modell
lmdeploy serve api_server internlm2_5-7b-chat-4bit --model-format awq

Befehl	Beschreibung
`lmdeploy lite auto_awq MODEL`	Erzeuge ein 4-Bit AWQ Modell
`lmdeploy lite calibrate MODEL`	Kalibrierungs Schritt
`--model-format awq`	Serve ein AWQ-Quantisiertes Modell
KV-Cache Quant	`--quant-policy 4` oder `8` für INT4/INT8 KV Cache

Engine Konfiguration

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
    backend_config=TurbomindEngineConfig(
        tp=2, session_len=8192, cache_max_entry_count=0.8, quant_policy=8))

Option	Effekt
`tp`	Tensor-Parallel GPUs
`session_len`	Context Length
`cache_max_entry_count`	Fraktion von VRAM für KV Cache
`quant_policy`	KV Cache Quantisierung (4/8)

Vision-Language Modelle

Fähigkeit	Notiz
VLM Support	Serve Modelle wie InternVL, LLaVA, Qwen-VL
Same API	Multimodal Messages über den OpenAI-kompatiblen Server

LMDeploy vs Andere Engines

Aspekt	LMDeploy	vLLM	Aphrodite
Engine	TurboMind + PyTorch	PagedAttention	vLLM Fork
Quantisierung	AWQ + KV Cache Quant	Wachsend	Breiteste Formate
VLM Support	Stark	Ja	Ja
Beste für	High-Throughput + AWQ + VLMs	Standard Serving	Community Quant Formate

LMDeploy - LLM Compression & Serving Cheatsheet

LMDeploy - LLM Compression & Serving Cheatsheet

Installation

Quick Inference (CLI)

Python Pipeline

Serving eine API

Quantisierung (AWQ)

Engine Konfiguration

Vision-Language Modelle

LMDeploy vs Andere Engines

Ressourcen