TensorRT-LLM은 GPU에서의 LLM 추론 최적화를 위한 NVIDIA 라이브러리입니다. 양자화, KV 캐시 관리, 인플라이트 배칭, 텐서/파이프라인 병렬 처리를 갖춘 최적화된 TensorRT 엔진으로 모델을 컴파일하여 최대 처리량을 달성합니다.
설치
# Install via pip (requires NVIDIA GPU and CUDA toolkit)
pip install tensorrt-llm
# Install from NVIDIA container (recommended)
# Pull the official container
docker pull nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# Or build from source
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -e .
# Verify installation
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
모델 변환 및 엔진 빌드
# Convert HuggingFace checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint \
--dtype float16
# Build TensorRT engine from checkpoint
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_outputs \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 2048 \
--max_seq_len 4096
# Build with quantization (INT8 weight-only)
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_int8 \
--gemm_plugin float16 \
--use_weight_only \
--weight_only_precision int8
# Build with FP8 quantization (Hopper GPUs)
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_fp8 \
--output_dir ./engine_fp8 \
--gemm_plugin float16 \
--strongly_typed
양자화 방법
# AWQ quantization (4-bit)
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_awq \
--dtype float16 \
--qformat w4a16_awq \
--calib_size 512
# GPTQ quantization
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_gptq \
--dtype float16 \
--qformat w4a16_gptq
# SmoothQuant (INT8)
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_sq \
--dtype float16 \
--qformat int8_sq \
--calib_size 512
# FP8 quantization (requires Hopper GPU)
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_fp8 \
--dtype float16 \
--qformat fp8 \
--calib_size 512
멀티 GPU 엔진 빌드
# Tensor parallelism (split model across GPUs)
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_tp4 \
--dtype float16 \
--tp_size 4
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_tp4 \
--output_dir ./engine_tp4 \
--gemm_plugin float16 \
--max_batch_size 16 \
--max_input_len 2048 \
--max_seq_len 4096
# Pipeline parallelism
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_pp2 \
--dtype float16 \
--pp_size 2
# Combined tensor + pipeline parallelism
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_tp4pp2 \
--dtype float16 \
--tp_size 4 \
--pp_size 2
추론 실행
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
# Load engine
runner = ModelRunner.from_dir(
engine_dir="./engine_outputs",
rank=0,
)
# Single request
outputs = runner.generate(
batch_input_ids=[tokenizer.encode("What is deep learning?")],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
end_id=tokenizer.eos_token_id,
pad_id=tokenizer.pad_token_id,
)
# Decode output
text = tokenizer.decode(outputs[0][0])
print(text)
배치 추론
# Process multiple requests simultaneously
prompts = [
"Explain neural networks:",
"What is backpropagation?",
"Define gradient descent:",
"What is a transformer?",
]
batch_input_ids = [tokenizer.encode(p) for p in prompts]
outputs = runner.generate(
batch_input_ids=batch_input_ids,
max_new_tokens=256,
temperature=0.7,
)
for i, output in enumerate(outputs):
print(f"Prompt: {prompts[i]}")
print(f"Response: {tokenizer.decode(output[0])}\n")
KV 캐시 구성
# Enable paged KV cache
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_paged \
--gemm_plugin float16 \
--paged_kv_cache enable \
--max_batch_size 32 \
--max_input_len 2048 \
--max_seq_len 4096
# KV cache with FP8 (reduces memory)
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_kv_fp8 \
--gemm_plugin float16 \
--paged_kv_cache enable \
--kv_cache_type fp8 \
--max_batch_size 64
Triton 추론 서버 통합
# Set up model repository structure
mkdir -p model_repo/tensorrt_llm/1
cp -r ./engine_outputs/* model_repo/tensorrt_llm/1/
# Launch Triton with TRT-LLM backend
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/model_repo:/models \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 \
tritonserver --model-repository=/models
# Send request to Triton
curl -X POST http://localhost:8000/v2/models/tensorrt_llm/generate \
-H "Content-Type: application/json" \
-d '{"text_input": "What is AI?", "max_tokens": 100}'
벤치마킹
# Throughput benchmark
python benchmarks/benchmark.py \
--engine_dir ./engine_outputs \
--tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
--dataset_path ShareGPT_V3_unfiltered.json \
--num_requests 1000 \
--max_input_len 2048 \
--max_output_len 512
# Latency benchmark
python benchmarks/benchmark.py \
--engine_dir ./engine_outputs \
--tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
--batch_size 1 \
--input_len 128 \
--output_len 128
지원 모델
| Model | Quantization Support |
|---|
| Llama 2/3/3.1 | FP16, INT8, FP8, AWQ, GPTQ |
| Mistral / Mixtral | FP16, INT8, FP8, AWQ |
| Falcon | FP16, INT8, AWQ |
| GPT-J / GPT-NeoX | FP16, INT8 |
| Phi-2/3 | FP16, INT8, FP8 |
| Qwen 1.5/2 | FP16, INT8, FP8, AWQ |
| Gemma | FP16, INT8, FP8 |
| ChatGLM | FP16, INT8 |
빌드 옵션 참조
| Flag | Description |
|---|
--checkpoint_dir | Converted checkpoint directory |
--output_dir | Engine output directory |
--gemm_plugin | GEMM plugin precision (float16, bfloat16) |
--max_batch_size | Maximum batch size |
--max_input_len | Maximum input sequence length |
--max_seq_len | Maximum total sequence length |
--tp_size | Tensor parallelism degree |
--pp_size | Pipeline parallelism degree |
--use_weight_only | Enable weight-only quantization |
--paged_kv_cache | Enable paged KV cache |
--strongly_typed | Enable for FP8 models |