Aller au contenu

Commandes TensorRT-LLM

TensorRT-LLM est la bibliothèque de NVIDIA pour optimiser l’inférence des LLMs sur GPU. Elle compile les modèles en moteurs TensorRT optimisés avec quantification, batching continu, parallélisme de tenseurs et attention paginée.

Installation

# Install via pip (requires NVIDIA GPU and CUDA toolkit)
pip install tensorrt-llm

# Install from NVIDIA container (recommended)
# Pull the official container
docker pull nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

# Or build from source
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -e .

# Verify installation
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

Conversion de modèles et construction de moteur

# Convert HuggingFace checkpoint to TRT-LLM format
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint \
  --dtype float16

# Build TensorRT engine from checkpoint
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_outputs \
  --gemm_plugin float16 \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_seq_len 4096

# Build with quantization (INT8 weight-only)
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_int8 \
  --gemm_plugin float16 \
  --use_weight_only \
  --weight_only_precision int8

# Build with FP8 quantization (Hopper GPUs)
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint_fp8 \
  --output_dir ./engine_fp8 \
  --gemm_plugin float16 \
  --strongly_typed

Méthodes de quantification

# AWQ quantization (4-bit)
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_awq \
  --dtype float16 \
  --qformat w4a16_awq \
  --calib_size 512

# GPTQ quantization
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_gptq \
  --dtype float16 \
  --qformat w4a16_gptq

# SmoothQuant (INT8)
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_sq \
  --dtype float16 \
  --qformat int8_sq \
  --calib_size 512

# FP8 quantization (requires Hopper GPU)
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_fp8 \
  --dtype float16 \
  --qformat fp8 \
  --calib_size 512

Construction de moteur multi-GPU

# Tensor parallelism (split model across GPUs)
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-70B-Instruct \
  --output_dir ./tllm_checkpoint_tp4 \
  --dtype float16 \
  --tp_size 4

trtllm-build \
  --checkpoint_dir ./tllm_checkpoint_tp4 \
  --output_dir ./engine_tp4 \
  --gemm_plugin float16 \
  --max_batch_size 16 \
  --max_input_len 2048 \
  --max_seq_len 4096

# Pipeline parallelism
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-70B-Instruct \
  --output_dir ./tllm_checkpoint_pp2 \
  --dtype float16 \
  --pp_size 2

# Combined tensor + pipeline parallelism
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-70B-Instruct \
  --output_dir ./tllm_checkpoint_tp4pp2 \
  --dtype float16 \
  --tp_size 4 \
  --pp_size 2

Exécution de l’inférence

import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

# Load engine
runner = ModelRunner.from_dir(
    engine_dir="./engine_outputs",
    rank=0,
)

# Single request
outputs = runner.generate(
    batch_input_ids=[tokenizer.encode("What is deep learning?")],
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    end_id=tokenizer.eos_token_id,
    pad_id=tokenizer.pad_token_id,
)

# Decode output
text = tokenizer.decode(outputs[0][0])
print(text)

Inférence par lots

# Process multiple requests simultaneously
prompts = [
    "Explain neural networks:",
    "What is backpropagation?",
    "Define gradient descent:",
    "What is a transformer?",
]

batch_input_ids = [tokenizer.encode(p) for p in prompts]

outputs = runner.generate(
    batch_input_ids=batch_input_ids,
    max_new_tokens=256,
    temperature=0.7,
)

for i, output in enumerate(outputs):
    print(f"Prompt: {prompts[i]}")
    print(f"Response: {tokenizer.decode(output[0])}\n")

Configuration du cache KV

# Enable paged KV cache
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_paged \
  --gemm_plugin float16 \
  --paged_kv_cache enable \
  --max_batch_size 32 \
  --max_input_len 2048 \
  --max_seq_len 4096

# KV cache with FP8 (reduces memory)
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_kv_fp8 \
  --gemm_plugin float16 \
  --paged_kv_cache enable \
  --kv_cache_type fp8 \
  --max_batch_size 64

Intégration du serveur Triton

# Set up model repository structure
mkdir -p model_repo/tensorrt_llm/1
cp -r ./engine_outputs/* model_repo/tensorrt_llm/1/

# Launch Triton with TRT-LLM backend
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $(pwd)/model_repo:/models \
  nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 \
  tritonserver --model-repository=/models

# Send request to Triton
curl -X POST http://localhost:8000/v2/models/tensorrt_llm/generate \
  -H "Content-Type: application/json" \
  -d '{"text_input": "What is AI?", "max_tokens": 100}'

Benchmarking

# Throughput benchmark
python benchmarks/benchmark.py \
  --engine_dir ./engine_outputs \
  --tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
  --dataset_path ShareGPT_V3_unfiltered.json \
  --num_requests 1000 \
  --max_input_len 2048 \
  --max_output_len 512

# Latency benchmark
python benchmarks/benchmark.py \
  --engine_dir ./engine_outputs \
  --tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
  --batch_size 1 \
  --input_len 128 \
  --output_len 128

Modèles supportés

ModelQuantization Support
Llama 2/3/3.1FP16, INT8, FP8, AWQ, GPTQ
Mistral / MixtralFP16, INT8, FP8, AWQ
FalconFP16, INT8, AWQ
GPT-J / GPT-NeoXFP16, INT8
Phi-2/3FP16, INT8, FP8
Qwen 1.5/2FP16, INT8, FP8, AWQ
GemmaFP16, INT8, FP8
ChatGLMFP16, INT8

Référence des options de construction

OptionDescription
--checkpoint_dirConverted checkpoint directory
--output_dirEngine output directory
--gemm_pluginGEMM plugin precision (float16, bfloat16)
--max_batch_sizeMaximum batch size
--max_input_lenMaximum input sequence length
--max_seq_lenMaximum total sequence length
--tp_sizeTensor parallelism degree
--pp_sizePipeline parallelism degree
--use_weight_onlyEnable weight-only quantization
--paged_kv_cacheEnable paged KV cache
--strongly_typedEnable for FP8 models