Commandes TorchTune

TorchTune est une bibliothèque native PyTorch pour l’ajustement fin de grands modèles de langage. Elle fournit des blocs de construction composables et des recettes prédéfinies pour l’ajustement fin complet, LoRA et QLoRA sur des modèles comme Llama 3, Mistral, Gemma et Phi.

Installation

# Install from PyPI
pip install torchtune

# Install with all extras
pip install torchtune[dev]

# Install from source for latest features
git clone https://github.com/pytorch/torchtune.git
cd torchtune
pip install -e .

# Verify installation
tune --help

Téléchargement de modèles

# Download Llama 3.1 8B from HuggingFace
tune download meta-llama/Llama-3.1-8B-Instruct \
  --output-dir /tmp/Llama-3.1-8B-Instruct \
  --hf-token <YOUR_HF_TOKEN>

# Download Mistral 7B
tune download mistralai/Mistral-7B-Instruct-v0.3 \
  --output-dir /tmp/Mistral-7B-v0.3

# Download Gemma 2B
tune download google/gemma-2b \
  --output-dir /tmp/gemma-2b

# List available models
tune ls

Exécution de recettes

# Full fine-tuning on single GPU
tune run full_finetune_single_device \
  --config llama3_1/8B_full_single_device

# LoRA fine-tuning on single GPU
tune run lora_finetune_single_device \
  --config llama3_1/8B_lora_single_device

# QLoRA fine-tuning (4-bit quantization + LoRA)
tune run lora_finetune_single_device \
  --config llama3_1/8B_qlora_single_device

# Distributed full fine-tuning (multi-GPU)
tune run --nproc_per_node 4 full_finetune_distributed \
  --config llama3_1/8B_full

# Distributed LoRA fine-tuning
tune run --nproc_per_node 2 lora_finetune_distributed \
  --config llama3_1/8B_lora

Surcharge de configuration

# Override dataset and training params via CLI
tune run lora_finetune_single_device \
  --config llama3_1/8B_lora_single_device \
  epochs=3 \
  batch_size=4 \
  lr=2e-5 \
  dataset=torchtune.datasets.alpaca_dataset \
  checkpointer.output_dir=/tmp/my_checkpoints

# Override LoRA rank and alpha
tune run lora_finetune_single_device \
  --config llama3_1/8B_lora_single_device \
  model.lora_rank=16 \
  model.lora_alpha=32

# Use a custom config file
tune run lora_finetune_single_device \
  --config ./my_custom_config.yaml

Fichier de configuration personnalisé

# my_config.yaml
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
  lora_rank: 16
  lora_alpha: 32

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Llama-3.1-8B-Instruct/tokenizer.model

dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset

seed: 42
shuffle: true
batch_size: 2
epochs: 3
lr: 2e-5
optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.1-8B-Instruct
  output_dir: /tmp/fine-tuned-output
  checkpoint_files:
    - model-00001-of-00004.safetensors
    - model-00002-of-00004.safetensors
    - model-00003-of-00004.safetensors
    - model-00004-of-00004.safetensors

device: cuda
dtype: bf16

Datasets personnalisés

# Define a custom dataset using the message format
from torchtune.data import Message
from torchtune.datasets import SFTDataset
from torchtune.models.llama3 import llama3_tokenizer

def my_dataset(tokenizer, source="json", data_files="train.json", split="train"):
    """Custom dataset using sharegpt format."""
    return SFTDataset(
        source=source,
        data_files=data_files,
        split=split,
        message_transform=ShareGPTToMessages(),
        model_transform=tokenizer,
        max_seq_len=2048,
    )

[
  {
    "conversations": [
      {"from": "human", "value": "What is machine learning?"},
      {"from": "gpt", "value": "Machine learning is a subset of AI..."}
    ]
  }
]

Évaluation

# Run generation to test fine-tuned model
tune run generate \
  --config generation \
  checkpointer.checkpoint_dir=/tmp/fine-tuned-output \
  prompt="Explain transformers in simple terms:"

# EleutherAI evaluation harness integration
tune run eleuther_eval \
  --config eleuther_evaluation \
  checkpointer.checkpoint_dir=/tmp/fine-tuned-output \
  tasks=["hellaswag","mmlu"]

Quantification

# QLoRA uses 4-bit NF4 quantization during training
tune run lora_finetune_single_device \
  --config llama3_1/8B_qlora_single_device

# Quantize model after training
tune run quantize \
  --config quantization \
  checkpointer.checkpoint_dir=/tmp/fine-tuned-output

Entraînement distribué

# Multi-GPU on single node
tune run --nproc_per_node 4 full_finetune_distributed \
  --config llama3_1/8B_full \
  enable_activation_checkpointing=true

# Enable FSDP for memory-efficient distributed training
tune run --nproc_per_node 8 full_finetune_distributed \
  --config llama3_1/70B_full \
  fsdp_cpu_offload=true

# Set specific GPUs
CUDA_VISIBLE_DEVICES=0,1 tune run --nproc_per_node 2 \
  lora_finetune_distributed --config llama3_1/8B_lora

Lister les configurations et recettes

# List all built-in configs
tune ls

# Copy a config to modify locally
tune cp llama3_1/8B_lora_single_device ./my_lora_config.yaml

# Copy a recipe to customize
tune cp lora_finetune_single_device ./my_recipe.py

Optimisation mémoire

# Enable activation checkpointing in config
enable_activation_checkpointing: true

# Use gradient accumulation for effective larger batches
gradient_accumulation_steps: 4

# Enable compile for faster training
compile: true

Patterns courants

Task	Commande
Download model	`tune download meta-llama/Llama-3.1-8B-Instruct`
List recipes	`tune ls`
Full fine-tune (1 GPU)	`tune run full_finetune_single_device --config ...`
LoRA fine-tune (1 GPU)	`tune run lora_finetune_single_device --config ...`
QLoRA fine-tune	`tune run lora_finetune_single_device --config ..._qlora_...`
Multi-GPU training	`tune run --nproc_per_node N recipe --config ...`
Generate text	`tune run generate --config generation`
Evaluate model	`tune run eleuther_eval --config eleuther_evaluation`
Copy config locally	`tune cp config_name ./local_config.yaml`