TensorRT-LLM est la bibliothèque de NVIDIA pour optimiser l’inférence des LLMs sur GPU. Elle compile les modèles en moteurs TensorRT optimisés avec quantification, batching continu, parallélisme de tenseurs et attention paginée.
Installation
# Install via pip (requires NVIDIA GPU and CUDA toolkit)
pip install tensorrt-llm
# Install from NVIDIA container (recommended)
# Pull the official container
docker pull nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# Or build from source
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -e .
# Verify installation
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
Conversion de modèles et construction de moteur
# Convert HuggingFace checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint \
--dtype float16
# Build TensorRT engine from checkpoint
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_outputs \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 2048 \
--max_seq_len 4096
# Build with quantization (INT8 weight-only)
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_int8 \
--gemm_plugin float16 \
--use_weight_only \
--weight_only_precision int8
# Build with FP8 quantization (Hopper GPUs)
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_fp8 \
--output_dir ./engine_fp8 \
--gemm_plugin float16 \
--strongly_typed
Méthodes de quantification
# AWQ quantization (4-bit)
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_awq \
--dtype float16 \
--qformat w4a16_awq \
--calib_size 512
# GPTQ quantization
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_gptq \
--dtype float16 \
--qformat w4a16_gptq
# SmoothQuant (INT8)
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_sq \
--dtype float16 \
--qformat int8_sq \
--calib_size 512
# FP8 quantization (requires Hopper GPU)
python quantize.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_fp8 \
--dtype float16 \
--qformat fp8 \
--calib_size 512
Construction de moteur multi-GPU
# Tensor parallelism (split model across GPUs)
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_tp4 \
--dtype float16 \
--tp_size 4
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_tp4 \
--output_dir ./engine_tp4 \
--gemm_plugin float16 \
--max_batch_size 16 \
--max_input_len 2048 \
--max_seq_len 4096
# Pipeline parallelism
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_pp2 \
--dtype float16 \
--pp_size 2
# Combined tensor + pipeline parallelism
python convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-70B-Instruct \
--output_dir ./tllm_checkpoint_tp4pp2 \
--dtype float16 \
--tp_size 4 \
--pp_size 2
Exécution de l’inférence
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
# Load engine
runner = ModelRunner.from_dir(
engine_dir="./engine_outputs",
rank=0,
)
# Single request
outputs = runner.generate(
batch_input_ids=[tokenizer.encode("What is deep learning?")],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
end_id=tokenizer.eos_token_id,
pad_id=tokenizer.pad_token_id,
)
# Decode output
text = tokenizer.decode(outputs[0][0])
print(text)
Inférence par lots
# Process multiple requests simultaneously
prompts = [
"Explain neural networks:",
"What is backpropagation?",
"Define gradient descent:",
"What is a transformer?",
]
batch_input_ids = [tokenizer.encode(p) for p in prompts]
outputs = runner.generate(
batch_input_ids=batch_input_ids,
max_new_tokens=256,
temperature=0.7,
)
for i, output in enumerate(outputs):
print(f"Prompt: {prompts[i]}")
print(f"Response: {tokenizer.decode(output[0])}\n")
Configuration du cache KV
# Enable paged KV cache
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_paged \
--gemm_plugin float16 \
--paged_kv_cache enable \
--max_batch_size 32 \
--max_input_len 2048 \
--max_seq_len 4096
# KV cache with FP8 (reduces memory)
trtllm-build \
--checkpoint_dir ./tllm_checkpoint \
--output_dir ./engine_kv_fp8 \
--gemm_plugin float16 \
--paged_kv_cache enable \
--kv_cache_type fp8 \
--max_batch_size 64
Intégration du serveur Triton
# Set up model repository structure
mkdir -p model_repo/tensorrt_llm/1
cp -r ./engine_outputs/* model_repo/tensorrt_llm/1/
# Launch Triton with TRT-LLM backend
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/model_repo:/models \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 \
tritonserver --model-repository=/models
# Send request to Triton
curl -X POST http://localhost:8000/v2/models/tensorrt_llm/generate \
-H "Content-Type: application/json" \
-d '{"text_input": "What is AI?", "max_tokens": 100}'
Benchmarking
# Throughput benchmark
python benchmarks/benchmark.py \
--engine_dir ./engine_outputs \
--tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
--dataset_path ShareGPT_V3_unfiltered.json \
--num_requests 1000 \
--max_input_len 2048 \
--max_output_len 512
# Latency benchmark
python benchmarks/benchmark.py \
--engine_dir ./engine_outputs \
--tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
--batch_size 1 \
--input_len 128 \
--output_len 128
Modèles supportés
| Model | Quantization Support |
|---|
| Llama 2/3/3.1 | FP16, INT8, FP8, AWQ, GPTQ |
| Mistral / Mixtral | FP16, INT8, FP8, AWQ |
| Falcon | FP16, INT8, AWQ |
| GPT-J / GPT-NeoX | FP16, INT8 |
| Phi-2/3 | FP16, INT8, FP8 |
| Qwen 1.5/2 | FP16, INT8, FP8, AWQ |
| Gemma | FP16, INT8, FP8 |
| ChatGLM | FP16, INT8 |
Référence des options de construction
| Option | Description |
|---|
--checkpoint_dir | Converted checkpoint directory |
--output_dir | Engine output directory |
--gemm_plugin | GEMM plugin precision (float16, bfloat16) |
--max_batch_size | Maximum batch size |
--max_input_len | Maximum input sequence length |
--max_seq_len | Maximum total sequence length |
--tp_size | Tensor parallelism degree |
--pp_size | Pipeline parallelism degree |
--use_weight_only | Enable weight-only quantization |
--paged_kv_cache | Enable paged KV cache |
--strongly_typed | Enable for FP8 models |