콘텐츠로 이동

LMDeploy - LLM 압축 및 서빙 치트시트

LMDeploy - LLM 압축 및 서빙 치트시트

LMDeploy (InternLM/OpenMMLab 생태계에서)는 대규모 언어 모델을 압축, 양자화, 서빙하기 위한 도구 모음입니다. 고성능 TurboMind 엔진은 지속적 배치, 차단된 KV 캐시, 최적화된 CUDA 커널을 통해 강한 처리량을 제공하며, PyTorch 백엔드도 제공합니다. 4비트 AWQ 가중치 양자화와 KV 캐시 양자화, OpenAI 호환 API 서버, 비전 언어 모델 (VLM)을 지원합니다.

설치

방법	명령어
pip	`pip install lmdeploy`
CUDA 확장 포함	`pip install lmdeploy[all]`
Docker	`docker run --gpus all openmmlab/lmdeploy:latest`
요구사항	NVIDIA GPU + CUDA
검증	`lmdeploy --version`

빠른 추론 (CLI)

# 터미널에서 대화형 채팅 (TurboMind 엔진)
lmdeploy chat internlm/internlm2_5-7b-chat

# Python에서 배치/파이프라인 추론 (아래)

Python 파이프라인

from lmdeploy import pipeline

pipe = pipeline("internlm/internlm2_5-7b-chat")
resp = pipe(["Explain RAG in one sentence."])
print(resp[0].text)

호출	설명
`pipeline(model)`	기본 (TurboMind) 엔진으로 모델 로드
`pipe([prompts])`	배치 추론
`GenerationConfig(...)`	샘플링 매개변수 (temperature, top_p, max_new_tokens)
`TurbomindEngineConfig(...)`	엔진 튜닝 (tp, cache, session len)

API 서빙

# 포트 23333에서 OpenAI 호환 서버
lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333

# 쿼리
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"internlm2_5-7b-chat","messages":[{"role":"user","content":"hi"}]}'

명령어	설명
`lmdeploy serve api_server MODEL`	OpenAI 호환 서버 시작
`lmdeploy serve gradio MODEL`	Gradio 웹 UI 실행
`--server-port`	API 포트
`--tp N`	N개 GPU 전체 텐서 병렬화
`--session-len N`	최대 컨텍스트 길이

양자화 (AWQ)

# 4비트 AWQ 가중치 전용 양자화
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat \
  --work-dir internlm2_5-7b-chat-4bit

# 양자화 모델 제공
lmdeploy serve api_server internlm2_5-7b-chat-4bit --model-format awq

명령어	설명
`lmdeploy lite auto_awq MODEL`	4비트 AWQ 모델 생성
`lmdeploy lite calibrate MODEL`	보정 단계
`--model-format awq`	AWQ 양자화 모델 제공
KV-cache quant	`--quant-policy 4` 또는 `8` for INT4/INT8 KV cache

엔진 설정

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("internlm/internlm2_5-7b-chat",
    backend_config=TurbomindEngineConfig(
        tp=2, session_len=8192, cache_max_entry_count=0.8, quant_policy=8))

옵션	효과
`tp`	텐서 병렬 GPU
`session_len`	컨텍스트 길이
`cache_max_entry_count`	KV 캐시용 VRAM 비율
`quant_policy`	KV 캐시 양자화 (4/8)

비전 언어 모델

기능	노트
VLM 지원	InternVL, LLaVA, Qwen-VL 같은 모델 제공
동일 API	OpenAI 호환 서버를 통한 다중모달 메시지

LMDeploy vs 다른 엔진

측면	LMDeploy	vLLM	Aphrodite
엔진	TurboMind + PyTorch	PagedAttention	vLLM 포크
양자화	AWQ + KV cache quant	확대 중	가장 광범위
VLM 지원	강함	예	예
최고의 용도	고처리량 + AWQ + VLM	표준 서빙	커뮤니티 quant 형식

리소스