Comandi TorchTune

TorchTune è una libreria nativa PyTorch per il fine-tuning di modelli linguistici di grandi dimensioni. Fornisce blocchi componibili e ricette pronte per il fine-tuning completo, LoRA e QLoRA su modelli come Llama 3, Mistral, Gemma e Phi.

Installazione

# Install from PyPI
pip install torchtune

# Install with all extras
pip install torchtune[dev]

# Install from source for latest features
git clone https://github.com/pytorch/torchtune.git
cd torchtune
pip install -e .

# Verify installation
tune --help

Download modelli

# Download Llama 3.1 8B from HuggingFace
tune download meta-llama/Llama-3.1-8B-Instruct \
  --output-dir /tmp/Llama-3.1-8B-Instruct \
  --hf-token <YOUR_HF_TOKEN>

# Download Mistral 7B
tune download mistralai/Mistral-7B-Instruct-v0.3 \
  --output-dir /tmp/Mistral-7B-v0.3

# Download Gemma 2B
tune download google/gemma-2b \
  --output-dir /tmp/gemma-2b

# List available models
tune ls

Esecuzione ricette

# Full fine-tuning on single GPU
tune run full_finetune_single_device \
  --config llama3_1/8B_full_single_device

# LoRA fine-tuning on single GPU
tune run lora_finetune_single_device \
  --config llama3_1/8B_lora_single_device

# QLoRA fine-tuning (4-bit quantization + LoRA)
tune run lora_finetune_single_device \
  --config llama3_1/8B_qlora_single_device

# Distributed full fine-tuning (multi-GPU)
tune run --nproc_per_node 4 full_finetune_distributed \
  --config llama3_1/8B_full

# Distributed LoRA fine-tuning
tune run --nproc_per_node 2 lora_finetune_distributed \
  --config llama3_1/8B_lora

Override configurazione

# Override dataset and training params via CLI
tune run lora_finetune_single_device \
  --config llama3_1/8B_lora_single_device \
  epochs=3 \
  batch_size=4 \
  lr=2e-5 \
  dataset=torchtune.datasets.alpaca_dataset \
  checkpointer.output_dir=/tmp/my_checkpoints

# Override LoRA rank and alpha
tune run lora_finetune_single_device \
  --config llama3_1/8B_lora_single_device \
  model.lora_rank=16 \
  model.lora_alpha=32

# Use a custom config file
tune run lora_finetune_single_device \
  --config ./my_custom_config.yaml

File di configurazione personalizzato

# my_config.yaml
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
  lora_rank: 16
  lora_alpha: 32

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Llama-3.1-8B-Instruct/tokenizer.model

dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset

seed: 42
shuffle: true
batch_size: 2
epochs: 3
lr: 2e-5
optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01

checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-3.1-8B-Instruct
  output_dir: /tmp/fine-tuned-output
  checkpoint_files:
    - model-00001-of-00004.safetensors
    - model-00002-of-00004.safetensors
    - model-00003-of-00004.safetensors
    - model-00004-of-00004.safetensors

device: cuda
dtype: bf16

Dataset personalizzati

# Define a custom dataset using the message format
from torchtune.data import Message
from torchtune.datasets import SFTDataset
from torchtune.models.llama3 import llama3_tokenizer

def my_dataset(tokenizer, source="json", data_files="train.json", split="train"):
    """Custom dataset using sharegpt format."""
    return SFTDataset(
        source=source,
        data_files=data_files,
        split=split,
        message_transform=ShareGPTToMessages(),
        model_transform=tokenizer,
        max_seq_len=2048,
    )

[
  {
    "conversations": [
      {"from": "human", "value": "What is machine learning?"},
      {"from": "gpt", "value": "Machine learning is a subset of AI..."}
    ]
  }
]

Valutazione

# Run generation to test fine-tuned model
tune run generate \
  --config generation \
  checkpointer.checkpoint_dir=/tmp/fine-tuned-output \
  prompt="Explain transformers in simple terms:"

# EleutherAI evaluation harness integration
tune run eleuther_eval \
  --config eleuther_evaluation \
  checkpointer.checkpoint_dir=/tmp/fine-tuned-output \
  tasks=["hellaswag","mmlu"]

Quantizzazione

# QLoRA uses 4-bit NF4 quantization during training
tune run lora_finetune_single_device \
  --config llama3_1/8B_qlora_single_device

# Quantize model after training
tune run quantize \
  --config quantization \
  checkpointer.checkpoint_dir=/tmp/fine-tuned-output

Training distribuito

# Multi-GPU on single node
tune run --nproc_per_node 4 full_finetune_distributed \
  --config llama3_1/8B_full \
  enable_activation_checkpointing=true

# Enable FSDP for memory-efficient distributed training
tune run --nproc_per_node 8 full_finetune_distributed \
  --config llama3_1/70B_full \
  fsdp_cpu_offload=true

# Set specific GPUs
CUDA_VISIBLE_DEVICES=0,1 tune run --nproc_per_node 2 \
  lora_finetune_distributed --config llama3_1/8B_lora

Elenco configurazioni e ricette

# List all built-in configs
tune ls

# Copy a config to modify locally
tune cp llama3_1/8B_lora_single_device ./my_lora_config.yaml

# Copy a recipe to customize
tune cp lora_finetune_single_device ./my_recipe.py

Ottimizzazione memoria

# Enable activation checkpointing in config
enable_activation_checkpointing: true

# Use gradient accumulation for effective larger batches
gradient_accumulation_steps: 4

# Enable compile for faster training
compile: true

Pattern comuni

Task	Command
Download model	`tune download meta-llama/Llama-3.1-8B-Instruct`
List recipes	`tune ls`
Full fine-tune (1 GPU)	`tune run full_finetune_single_device --config ...`
LoRA fine-tune (1 GPU)	`tune run lora_finetune_single_device --config ...`
QLoRA fine-tune	`tune run lora_finetune_single_device --config ..._qlora_...`
Multi-GPU training	`tune run --nproc_per_node N recipe --config ...`
Generate text	`tune run generate --config generation`
Evaluate model	`tune run eleuther_eval --config eleuther_evaluation`
Copy config locally	`tune cp config_name ./local_config.yaml`