콘텐츠로 이동

TensorRT-LLM 명령어

TensorRT-LLM은 GPU에서의 LLM 추론 최적화를 위한 NVIDIA 라이브러리입니다. 양자화, KV 캐시 관리, 인플라이트 배칭, 텐서/파이프라인 병렬 처리를 갖춘 최적화된 TensorRT 엔진으로 모델을 컴파일하여 최대 처리량을 달성합니다.

설치

# Install via pip (requires NVIDIA GPU and CUDA toolkit)
pip install tensorrt-llm

# Install from NVIDIA container (recommended)
# Pull the official container
docker pull nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

# Or build from source
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
pip install -e .

# Verify installation
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

모델 변환 및 엔진 빌드

# Convert HuggingFace checkpoint to TRT-LLM format
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint \
  --dtype float16

# Build TensorRT engine from checkpoint
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_outputs \
  --gemm_plugin float16 \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_seq_len 4096

# Build with quantization (INT8 weight-only)
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_int8 \
  --gemm_plugin float16 \
  --use_weight_only \
  --weight_only_precision int8

# Build with FP8 quantization (Hopper GPUs)
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint_fp8 \
  --output_dir ./engine_fp8 \
  --gemm_plugin float16 \
  --strongly_typed

양자화 방법

# AWQ quantization (4-bit)
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_awq \
  --dtype float16 \
  --qformat w4a16_awq \
  --calib_size 512

# GPTQ quantization
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_gptq \
  --dtype float16 \
  --qformat w4a16_gptq

# SmoothQuant (INT8)
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_sq \
  --dtype float16 \
  --qformat int8_sq \
  --calib_size 512

# FP8 quantization (requires Hopper GPU)
python quantize.py \
  --model_dir /path/to/Llama-3.1-8B-Instruct \
  --output_dir ./tllm_checkpoint_fp8 \
  --dtype float16 \
  --qformat fp8 \
  --calib_size 512

멀티 GPU 엔진 빌드

# Tensor parallelism (split model across GPUs)
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-70B-Instruct \
  --output_dir ./tllm_checkpoint_tp4 \
  --dtype float16 \
  --tp_size 4

trtllm-build \
  --checkpoint_dir ./tllm_checkpoint_tp4 \
  --output_dir ./engine_tp4 \
  --gemm_plugin float16 \
  --max_batch_size 16 \
  --max_input_len 2048 \
  --max_seq_len 4096

# Pipeline parallelism
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-70B-Instruct \
  --output_dir ./tllm_checkpoint_pp2 \
  --dtype float16 \
  --pp_size 2

# Combined tensor + pipeline parallelism
python convert_checkpoint.py \
  --model_dir /path/to/Llama-3.1-70B-Instruct \
  --output_dir ./tllm_checkpoint_tp4pp2 \
  --dtype float16 \
  --tp_size 4 \
  --pp_size 2

추론 실행

import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

# Load engine
runner = ModelRunner.from_dir(
    engine_dir="./engine_outputs",
    rank=0,
)

# Single request
outputs = runner.generate(
    batch_input_ids=[tokenizer.encode("What is deep learning?")],
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    end_id=tokenizer.eos_token_id,
    pad_id=tokenizer.pad_token_id,
)

# Decode output
text = tokenizer.decode(outputs[0][0])
print(text)

배치 추론

# Process multiple requests simultaneously
prompts = [
    "Explain neural networks:",
    "What is backpropagation?",
    "Define gradient descent:",
    "What is a transformer?",
]

batch_input_ids = [tokenizer.encode(p) for p in prompts]

outputs = runner.generate(
    batch_input_ids=batch_input_ids,
    max_new_tokens=256,
    temperature=0.7,
)

for i, output in enumerate(outputs):
    print(f"Prompt: {prompts[i]}")
    print(f"Response: {tokenizer.decode(output[0])}\n")

KV 캐시 구성

# Enable paged KV cache
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_paged \
  --gemm_plugin float16 \
  --paged_kv_cache enable \
  --max_batch_size 32 \
  --max_input_len 2048 \
  --max_seq_len 4096

# KV cache with FP8 (reduces memory)
trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./engine_kv_fp8 \
  --gemm_plugin float16 \
  --paged_kv_cache enable \
  --kv_cache_type fp8 \
  --max_batch_size 64

Triton 추론 서버 통합

# Set up model repository structure
mkdir -p model_repo/tensorrt_llm/1
cp -r ./engine_outputs/* model_repo/tensorrt_llm/1/

# Launch Triton with TRT-LLM backend
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $(pwd)/model_repo:/models \
  nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 \
  tritonserver --model-repository=/models

# Send request to Triton
curl -X POST http://localhost:8000/v2/models/tensorrt_llm/generate \
  -H "Content-Type: application/json" \
  -d '{"text_input": "What is AI?", "max_tokens": 100}'

벤치마킹

# Throughput benchmark
python benchmarks/benchmark.py \
  --engine_dir ./engine_outputs \
  --tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
  --dataset_path ShareGPT_V3_unfiltered.json \
  --num_requests 1000 \
  --max_input_len 2048 \
  --max_output_len 512

# Latency benchmark
python benchmarks/benchmark.py \
  --engine_dir ./engine_outputs \
  --tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
  --batch_size 1 \
  --input_len 128 \
  --output_len 128

지원 모델

ModelQuantization Support
Llama 2/3/3.1FP16, INT8, FP8, AWQ, GPTQ
Mistral / MixtralFP16, INT8, FP8, AWQ
FalconFP16, INT8, AWQ
GPT-J / GPT-NeoXFP16, INT8
Phi-2/3FP16, INT8, FP8
Qwen 1.5/2FP16, INT8, FP8, AWQ
GemmaFP16, INT8, FP8
ChatGLMFP16, INT8

빌드 옵션 참조

FlagDescription
--checkpoint_dirConverted checkpoint directory
--output_dirEngine output directory
--gemm_pluginGEMM plugin precision (float16, bfloat16)
--max_batch_sizeMaximum batch size
--max_input_lenMaximum input sequence length
--max_seq_lenMaximum total sequence length
--tp_sizeTensor parallelism degree
--pp_sizePipeline parallelism degree
--use_weight_onlyEnable weight-only quantization
--paged_kv_cacheEnable paged KV cache
--strongly_typedEnable for FP8 models