TorchTune أوامر
TorchTune هي مكتبة أصلية من PyTorch للضبط الدقيق لنماذج اللغة الكبيرة. توفر وحدات بناء قابلة للتركيب ووصفات جاهزة للضبط الدقيق الكامل و LoRA و QLoRA عبر نماذج مثل Llama 3 و Mistral و Gemma و Phi.
التثبيت
# Install from PyPI
pip install torchtune
# Install with all extras
pip install torchtune[dev]
# Install from source for latest features
git clone https://github.com/pytorch/torchtune.git
cd torchtune
pip install -e .
# Verify installation
tune --help
Downloading Models
# Download Llama 3.1 8B from HuggingFace
tune download meta-llama/Llama-3.1-8B-Instruct \
--output-dir /tmp/Llama-3.1-8B-Instruct \
--hf-token <YOUR_HF_TOKEN>
# Download Mistral 7B
tune download mistralai/Mistral-7B-Instruct-v0.3 \
--output-dir /tmp/Mistral-7B-v0.3
# Download Gemma 2B
tune download google/gemma-2b \
--output-dir /tmp/gemma-2b
# List available models
tune ls
Running Recipes
# Full fine-tuning on single GPU
tune run full_finetune_single_device \
--config llama3_1/8B_full_single_device
# LoRA fine-tuning on single GPU
tune run lora_finetune_single_device \
--config llama3_1/8B_lora_single_device
# QLoRA fine-tuning (4-bit quantization + LoRA)
tune run lora_finetune_single_device \
--config llama3_1/8B_qlora_single_device
# Distributed full fine-tuning (multi-GPU)
tune run --nproc_per_node 4 full_finetune_distributed \
--config llama3_1/8B_full
# Distributed LoRA fine-tuning
tune run --nproc_per_node 2 lora_finetune_distributed \
--config llama3_1/8B_lora
Config Overrides
# Override dataset and training params via CLI
tune run lora_finetune_single_device \
--config llama3_1/8B_lora_single_device \
epochs=3 \
batch_size=4 \
lr=2e-5 \
dataset=torchtune.datasets.alpaca_dataset \
checkpointer.output_dir=/tmp/my_checkpoints
# Override LoRA rank and alpha
tune run lora_finetune_single_device \
--config llama3_1/8B_lora_single_device \
model.lora_rank=16 \
model.lora_alpha=32
# Use a custom config file
tune run lora_finetune_single_device \
--config ./my_custom_config.yaml
Custom Config File
# my_config.yaml
model:
_component_: torchtune.models.llama3_1.lora_llama3_1_8b
lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
lora_rank: 16
lora_alpha: 32
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Llama-3.1-8B-Instruct/tokenizer.model
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
seed: 42
shuffle: true
batch_size: 2
epochs: 3
lr: 2e-5
optimizer:
_component_: torch.optim.AdamW
weight_decay: 0.01
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-3.1-8B-Instruct
output_dir: /tmp/fine-tuned-output
checkpoint_files:
- model-00001-of-00004.safetensors
- model-00002-of-00004.safetensors
- model-00003-of-00004.safetensors
- model-00004-of-00004.safetensors
device: cuda
dtype: bf16
Custom Datasets
# Define a custom dataset using the message format
from torchtune.data import Message
from torchtune.datasets import SFTDataset
from torchtune.models.llama3 import llama3_tokenizer
def my_dataset(tokenizer, source="json", data_files="train.json", split="train"):
"""Custom dataset using sharegpt format."""
return SFTDataset(
source=source,
data_files=data_files,
split=split,
message_transform=ShareGPTToMessages(),
model_transform=tokenizer,
max_seq_len=2048,
)
[
{
"conversations": [
{"from": "human", "value": "What is machine learning?"},
{"from": "gpt", "value": "Machine learning is a subset of AI..."}
]
}
]
Evaluation
# Run generation to test fine-tuned model
tune run generate \
--config generation \
checkpointer.checkpoint_dir=/tmp/fine-tuned-output \
prompt="Explain transformers in simple terms:"
# EleutherAI evaluation harness integration
tune run eleuther_eval \
--config eleuther_evaluation \
checkpointer.checkpoint_dir=/tmp/fine-tuned-output \
tasks=["hellaswag","mmlu"]
Quantization
# QLoRA uses 4-bit NF4 quantization during training
tune run lora_finetune_single_device \
--config llama3_1/8B_qlora_single_device
# Quantize model after training
tune run quantize \
--config quantization \
checkpointer.checkpoint_dir=/tmp/fine-tuned-output
Distributed Training
# Multi-GPU on single node
tune run --nproc_per_node 4 full_finetune_distributed \
--config llama3_1/8B_full \
enable_activation_checkpointing=true
# Enable FSDP for memory-efficient distributed training
tune run --nproc_per_node 8 full_finetune_distributed \
--config llama3_1/70B_full \
fsdp_cpu_offload=true
# Set specific GPUs
CUDA_VISIBLE_DEVICES=0,1 tune run --nproc_per_node 2 \
lora_finetune_distributed --config llama3_1/8B_lora
Listing Configs and Recipes
# List all built-in configs
tune ls
# Copy a config to modify locally
tune cp llama3_1/8B_lora_single_device ./my_lora_config.yaml
# Copy a recipe to customize
tune cp lora_finetune_single_device ./my_recipe.py
Memory Optimization
# Enable activation checkpointing in config
enable_activation_checkpointing: true
# Use gradient accumulation for effective larger batches
gradient_accumulation_steps: 4
# Enable compile for faster training
compile: true
الأنماط الشائعة
| Task | Command |
|---|---|
| Download model | tune download meta-llama/Llama-3.1-8B-Instruct |
| List recipes | tune ls |
| Full fine-tune (1 GPU) | tune run full_finetune_single_device --config ... |
| LoRA fine-tune (1 GPU) | tune run lora_finetune_single_device --config ... |
| QLoRA fine-tune | tune run lora_finetune_single_device --config ..._qlora_... |
| Multi-GPU training | tune run --nproc_per_node N recipe --config ... |
| Generate text | tune run generate --config generation |
| Evaluate model | tune run eleuther_eval --config eleuther_evaluation |
| Copy config locally | tune cp config_name ./local_config.yaml |