LMDeploy - Compression et service des LLM

LMDeploy (de l’écosystème InternLM/OpenMMLab) est une boîte à outils pour compresser, quantifier et servir les grands modèles de langage. Son moteur haute performance TurboMind offre un bon débit via la mise en lot persistante, le cache KV bloqué, et les noyaux CUDA optimisés, et il offre également un backend PyTorch. Il supporte la quantification de poids AWQ 4-bit et la quantification du cache KV, un serveur API compatible OpenAI, et les modèles vision-langage (VLM).

Installation

Méthode	Commande
pip	`pip install lmdeploy`
Avec extras CUDA	`pip install lmdeploy[all]`
Docker	`docker run --gpus all openmmlab/lmdeploy:latest`
Exigences	NVIDIA GPU + CUDA
Vérifier	`lmdeploy --version`

Inférence rapide (CLI)

# Chat interactif dans le terminal (moteur TurboMind)
lmdeploy chat internlm/internlm2_5-7b-chat

# Inférence batch/pipeline en Python (ci-dessous)

Pipeline Python

from lmdeploy import pipeline

pipe = pipeline("internlm/internlm2_5-7b-chat")
resp = pipe(["Expliquer RAG en une phrase."])
print(resp[0].text)

Appel	Description
`pipeline(model)`	Charger un modèle avec le moteur par défaut (TurboMind)
`pipe([prompts])`	Inférence en lot
`GenerationConfig(...)`	Params d’échantillonnage (température, top_p, max_new_tokens)
`TurbomindEngineConfig(...)`	Tuning du moteur (tp, cache, session len)

Servir une API

# Serveur compatible OpenAI sur le port 23333
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333

# L'interroger
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"internlm2_5-7b-chat","messages":[{"role":"user","content":"hi"}]}'

Commande	Description
`lmdeploy serve api_server MODEL`	Démarrer le serveur compatible OpenAI
`lmdeploy serve gradio MODEL`	Lancer une interface Web Gradio
`--server-port`	Port de l’API
`--tp N`	Parallélisme de tenseur entre N GPU
`--session-len N`	Longueur maximale du contexte

Quantification (AWQ)

# Quantification de poids AWQ 4-bit poids-only
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat \
  --work-dir internlm2_5-7b-chat-4bit

# Servir le modèle quantifié
lmdeploy serve api_server internlm2_5-7b-chat-4bit --model-format awq

Commande	Description
`lmdeploy lite auto_awq MODEL`	Produire un modèle AWQ 4-bit
`lmdeploy lite calibrate MODEL`	Étape d’étalonnage
`--model-format awq`	Servir un modèle quantifié AWQ
KV-cache quant	`--quant-policy 4` ou `8` pour cache KV INT4/INT8

Configuration du moteur

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
    backend_config=TurbomindEngineConfig(
        tp=2, session_len=8192, cache_max_entry_count=0.8, quant_policy=8))

Option	Effet
`tp`	GPU parallélisme-tenseur
`session_len`	Longueur du contexte
`cache_max_entry_count`	Fraction de VRAM pour le cache KV
`quant_policy`	Quantification du cache KV (4/8)

Modèles vision-langage

Capacité	Remarque
Support VLM	Servir des modèles comme InternVL, LLaVA, Qwen-VL
Même API	Messages multimodaux via le serveur compatible OpenAI

LMDeploy vs autres moteurs

Aspect	LMDeploy	vLLM	Aphrodite
Moteur	TurboMind + PyTorch	PagedAttention	Dérivé de vLLM
Quantification	AWQ + quant cache KV	Croissante	Formats les plus larges
Support VLM	Forte	Oui	Oui
Meilleur pour	Débit élevé + AWQ + VLM	Service standard	Formats quant communautaires

LMDeploy - Compression et service des LLM

LMDeploy - Compression et service des LLM

Installation

Inférence rapide (CLI)

Pipeline Python

Servir une API

Quantification (AWQ)

Configuration du moteur

Modèles vision-langage

LMDeploy vs autres moteurs

Ressources