Skip to content

Fine-Tuning LLMs in 2026: Axolotl vs Unsloth vs TorchTune vs TRL

· 13 min read · default
llmfine-tuningmachine-learningaideep-learningdevops

Introduction

The LLM fine-tuning landscape in 2026 is simultaneously more capable and more fragmented than ever. Two years ago, fine-tuning meant LoRA on a single GPU with a YAML config and a prayer. Today, teams choose between at least five serious frameworks, each with distinct design philosophies, performance characteristics, and ecosystem integrations. The choice of framework has real consequences for training speed, memory efficiency, model quality, and the operational complexity of your training pipeline.

This guide provides a thorough, practitioner-focused comparison of the four most widely adopted fine-tuning frameworks: Axolotl, Unsloth, TorchTune, and TRL (Transformer Reinforcement Learning). We also cover LLaMA-Factory, which has established a strong position in the Asian ML community and deserves consideration. Each framework has carved out a niche, and understanding those niches is essential for making an informed choice.

The comparison draws from real-world training runs across multiple model architectures, GPU configurations, and training methods. Every benchmark and configuration example in this guide has been validated on current hardware and software versions as of early 2026.

The State of LLM Fine-Tuning in 2026

Fine-tuning in 2026 operates in a fundamentally different environment than it did in 2024. Base models are larger and more capable, which means fine-tuning often achieves excellent results with fewer examples. Quantization-aware training has matured to the point where 4-bit fine-tuned models are competitive with full-precision counterparts. Post-training alignment methods like DPO and GRPO have largely displaced RLHF for preference learning, and the tooling has caught up.

The hardware landscape has shifted too. NVIDIA's H200 and the AMD MI300X have made 80GB+ VRAM accessible in cloud environments, while the RTX 5090 with 32GB has become the go-to consumer-grade training card. Multi-GPU training via FSDP has become the standard approach, displacing DeepSpeed for many workloads due to its tighter PyTorch integration.

On the model side, the open-weight ecosystem has exploded. Llama 4, Mistral Large 2, Qwen 3, and DeepSeek-V3 all provide strong base models for fine-tuning. Each framework's model support coverage has become a key differentiator.

Framework Overview

Axolotl

Axolotl began as a community project to simplify multi-method fine-tuning and has grown into the most feature-complete framework in the ecosystem. It wraps Hugging Face Transformers and PEFT, adding a YAML-based configuration system that covers virtually every training parameter. Axolotl's strength is breadth: it supports more training methods, model architectures, and dataset formats than any other single framework.

Unsloth

Unsloth takes the opposite approach to Axolotl. Rather than wrapping the Hugging Face stack, Unsloth reimplements critical training kernels using Triton, achieving 2-5x speedups over standard implementations. It focuses relentlessly on single-GPU performance and memory efficiency, making it the framework of choice for practitioners working with limited hardware budgets.

TorchTune

TorchTune is Meta's official fine-tuning framework, built from the ground up on native PyTorch primitives. It avoids external dependencies where possible, using torch.compile, DTensor, and FSDP2 instead of third-party libraries. This gives it the tightest integration with the PyTorch ecosystem and the most predictable behavior on new PyTorch releases.

TRL

TRL, maintained by Hugging Face, is the standard library for reinforcement learning from human feedback and related post-training methods. While it supports SFT, its core strength is alignment training: DPO, GRPO, KTO, ORPO, and the full family of preference optimization methods. If your primary workload is alignment rather than supervised fine-tuning, TRL is the natural starting point.

LLaMA-Factory

LLaMA-Factory provides a web UI and CLI for fine-tuning with an emphasis on accessibility. It wraps Hugging Face Transformers and supports a wide range of methods and models. Its web interface makes it popular for teams that want to democratize fine-tuning beyond the ML engineering team.

Architecture and Design Philosophy

The architectural differences between these frameworks are not superficial. They reflect fundamentally different beliefs about what the fine-tuning developer experience should look like.

Axolotl's architecture is configuration-driven. A single YAML file specifies everything: the base model, adapter type, dataset format, training hyperparameters, and hardware settings. This makes Axolotl extremely reproducible. You can hand someone a YAML file and they can recreate your exact training run. The downside is that the configuration space is enormous, and the relationship between options is not always obvious:

base_model: meta-llama/Llama-4-Scout-17B-16E
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

dataset_format: sharegpt
datasets:
  - path: /data/training/conversations.jsonl
    type: sharegpt
    conversation: chatml

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 2e-4
lr_scheduler: cosine
warmup_steps: 100
optimizer: adamw_bnb_8bit

bf16: auto
tf32: true
flash_attention: true
gradient_checkpointing: true

wandb_project: llama4-finetune
wandb_run_id: scout-qlora-v1

Unsloth's architecture centers on custom Triton kernels that replace the standard PyTorch implementations of attention, LoRA forward passes, and loss computation. The API is intentionally minimal:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-4-Scout-17B-16E",
    max_seq_length=8192,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=64,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

TorchTune uses a recipe-based architecture. Each training method is a self-contained Python script (a "recipe") that you can read, understand, and modify. Configuration uses TOML files:

[model]
_component_ = "torchtune.models.llama4.llama4_scout_17b_16e"

[tokenizer]
_component_ = "torchtune.models.llama4.llama4_tokenizer"
path = "/models/llama4-scout/tokenizer.model"

[dataset]
_component_ = "torchtune.datasets.chat_dataset"
source = "/data/training/conversations.jsonl"
conversation_style = "sharegpt"

[optimizer]
_component_ = "torch.optim.AdamW"
lr = 2e-4
weight_decay = 0.01

[training]
batch_size = 2
epochs = 3
gradient_accumulation_steps = 4
compile = true

TRL follows the Hugging Face Trainer pattern, extending it with specialized trainers for each alignment method:

from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Scout-17B-16E",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
)

training_args = SFTConfig(
    output_dir="./output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_steps=100,
    bf16=True,
    max_seq_length=8192,
    packing=True,
    gradient_checkpointing=True,
)

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules="all-linear",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config,
)
trainer.train()

Training Methods Support

The range of supported training methods varies significantly across frameworks. Here is the current support matrix:

SFT (Supervised Fine-Tuning) is supported by all five frameworks. This is table stakes. The differences emerge in how efficiently each framework implements SFT, particularly regarding sample packing, sequence parallelism, and memory optimization.

DPO (Direct Preference Optimization) is fully supported by TRL, Axolotl, and LLaMA-Factory. TorchTune added DPO support in late 2025. Unsloth supports DPO through its TRL integration layer.

GRPO (Group Relative Policy Optimization) emerged as the preferred alignment method for reasoning models following DeepSeek's work. TRL has the most mature GRPO implementation. Axolotl supports it through TRL delegation. TorchTune has native GRPO as of early 2026.

RLHF with PPO remains supported in TRL but has fallen out of favor for most use cases. The complexity and instability of PPO training loops made DPO and GRPO attractive alternatives.

QAT (Quantization-Aware Training) is natively supported in TorchTune through PyTorch's quantization primitives. Unsloth supports QAT through its custom kernels. Axolotl and TRL support QAT via integration with bitsandbytes and GPTQ.

Single-GPU Performance Comparison

Single-GPU performance is where the frameworks show the most dramatic differences. We benchmarked all four on a Llama 3.1 8B QLoRA fine-tune using the same dataset, hyperparameters, and hardware (NVIDIA A100 80GB).

Training configuration: QLoRA r=32, sequence length 4096, batch size 2, gradient accumulation 4, 1000 steps, BF16 precision.

Unsloth consistently delivers the fastest single-GPU training, typically 2-3x faster than TRL on the same configuration. The speedup comes from three sources: fused Triton kernels that combine multiple operations into single GPU kernel launches, a custom LoRA implementation that avoids materializing full-rank intermediate tensors, and an optimized gradient checkpointing implementation that reduces recomputation.

TorchTune with torch.compile enabled achieves roughly 1.5x speedup over TRL on longer training runs, though the compilation step adds several minutes of startup overhead. For short fine-tuning runs under 30 minutes, this compilation cost can negate the runtime improvement.

Axolotl's performance is essentially identical to TRL for equivalent configurations because it uses the same underlying Hugging Face training loop. Axolotl's value is in configuration convenience rather than raw speed.

Memory efficiency follows a similar pattern. Unsloth's custom kernels reduce peak memory usage by 30-50% compared to standard implementations, often allowing you to train on a single GPU where other frameworks would require gradient offloading or a larger card.

Multi-GPU Scaling

For multi-GPU training, the landscape shifts. TorchTune has the strongest multi-GPU story because it builds directly on PyTorch's FSDP2 and DTensor primitives:

tune run --nproc_per_node 8 full_finetune_distributed \
  --config llama4_scout/17B_full.toml

TRL and Axolotl use Hugging Face Accelerate for distributed training, which wraps FSDP or DeepSpeed:

accelerate launch --num_processes 8 \
  --mixed_precision bf16 \
  --use_fsdp \
  --fsdp_sharding_strategy FULL_SHARD \
  train.py

Unsloth's multi-GPU support has historically been its weakest area. The custom Triton kernels were designed for single-GPU execution, and while multi-GPU support has improved through 2025 and 2026, it still requires more manual configuration than the alternatives.

For large-scale training runs across multiple nodes, TorchTune and the DeepSpeed integration in TRL/Axolotl are the most battle-tested options. TorchTune's advantage is that it avoids the version compatibility issues that sometimes arise between Accelerate, DeepSpeed, and Transformers.

Configuration and Developer Experience

Developer experience extends beyond initial setup to encompass debugging, reproducibility, and the learning curve for new team members.

Axolotl's YAML configuration is simultaneously its greatest strength and weakness. A single YAML file completely specifies a training run, making reproduction trivial. However, the YAML files can grow to hundreds of lines, and the documentation for less common options is sometimes incomplete. Debugging a configuration issue often means searching GitHub issues.

Unsloth provides the most Pythonic experience. Configuration is code, which means your IDE provides autocomplete and type checking. The learning curve is gentle for anyone comfortable with PyTorch. The downside is that reproducibility requires sharing Python scripts rather than declarative configuration files.

TorchTune strikes a middle ground with its TOML configuration plus recipe architecture. The recipes are readable Python files that serve as both executable code and documentation. When something goes wrong, you can read the recipe source and understand the execution flow. This transparency is valuable for teams that need to understand and modify the training process.

TRL follows the familiar Hugging Face Trainer pattern. If your team already uses Hugging Face for inference and data processing, TRL requires the least new learning. The TrainingArguments pattern is well-documented and widely understood.

Memory Optimization Techniques

Memory efficiency determines whether you can train on the hardware you have. Each framework offers different techniques:

# Unsloth: automatic memory-efficient LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    use_gradient_checkpointing="unsloth",  # 60% less VRAM than standard
)

# TorchTune: activation checkpointing with selective recomputation
from torchtune.training import ActivationCheckpointing
model = ActivationCheckpointing(model, checkpoint_every_n_layers=2)

Gradient checkpointing is universally supported, but the implementations differ. Unsloth's implementation is the most memory-efficient, selectively recomputing only the cheapest operations. TorchTune's implementation is the most configurable, allowing layer-level granularity.

CPU offloading moves optimizer states to CPU RAM, dramatically reducing GPU memory requirements at the cost of training speed. TRL and Axolotl support this through DeepSpeed ZeRO Stage 3:

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    }
  }
}

Quantized optimizers replace standard Adam with 8-bit or 4-bit variants that use a fraction of the memory. All frameworks support bitsandbytes 8-bit Adam. Unsloth additionally offers custom quantized optimizer implementations.

When to Use Which: Decision Matrix

The choice of framework should be driven by your specific requirements rather than benchmark headlines.

Choose Unsloth when you are working with a single GPU and need maximum training speed and memory efficiency. Unsloth is the clear winner for individual practitioners, small teams, and any scenario where you are training on consumer hardware or a single cloud GPU. If you are running QLoRA on an RTX 4090 or a single A100, Unsloth will get you there fastest.

Choose TorchTune when you need multi-GPU full fine-tuning with the tightest possible PyTorch integration. TorchTune is the right choice for teams that run large-scale training jobs, need reproducibility across PyTorch versions, and want to minimize external dependencies. It is also the best choice if you are fine-tuning Meta's Llama models, as it receives first-party support for new Llama architectures.

Choose TRL when your primary workload is alignment training (DPO, GRPO, KTO, ORPO). TRL has the most mature and complete implementation of preference optimization methods. It is also the natural choice if your workflow is deeply integrated with the Hugging Face ecosystem.

Choose Axolotl when you need maximum flexibility in a single tool. Axolotl supports more model architectures, training methods, and dataset formats than any other framework. It is the right choice for teams that train many different models and need a single, consistent interface.

Choose LLaMA-Factory when you need to empower non-ML-engineers to run fine-tuning jobs. Its web UI lowers the barrier to entry significantly, and its CLI is straightforward for scripted workflows.

Dataset Preparation and Format Handling

One of the most underrated aspects of choosing a fine-tuning framework is how it handles dataset preparation. Real-world training data is messy, inconsistent, and rarely in the exact format a framework expects out of the box.

Axolotl excels here with support for over a dozen dataset formats, including ShareGPT, Alpaca, OpenAI chat completions, JSONL with custom field mappings, and raw completion formats. Its dataset preprocessing pipeline handles tokenization, chat template application, and sample packing in a single pass:

datasets:
  - path: /data/sharegpt_conversations.jsonl
    type: sharegpt
    conversation: chatml
  - path: /data/alpaca_instructions.jsonl
    type: alpaca
  - path: /data/completions.jsonl
    type: completion
    field_instruction: prompt
    field_output: response

TRL uses the standard Hugging Face datasets library and expects data in a conversational format with messages arrays. Converting custom formats requires writing a preprocessing function:

from datasets import load_dataset

def format_conversations(example):
    messages = []
    for turn in example["conversation"]:
        messages.append({
            "role": turn["from"],
            "content": turn["value"]
        })
    return {"messages": messages}

dataset = load_dataset("json", data_files="/data/training.jsonl")
dataset = dataset.map(format_conversations)

Unsloth delegates dataset handling to the user, providing helper functions for common formats but expecting you to handle preprocessing yourself. This gives maximum flexibility at the cost of more boilerplate code.

TorchTune provides dataset builders for common formats and emphasizes type safety in dataset construction. Its dataset classes validate the structure of each example before training, catching format issues early rather than during a long training run.

Sample packing, where multiple short examples are concatenated into a single sequence to maximize GPU utilization, is a critical optimization for datasets with variable-length examples. Axolotl and TRL both support sample packing natively. Unsloth implements its own optimized packing algorithm. TorchTune added sample packing in mid-2025 and its implementation handles edge cases like cross-example attention masking correctly.

Evaluation and Benchmarking During Training

Evaluating model quality during training is essential for detecting overfitting, selecting the best checkpoint, and comparing runs. Each framework approaches in-training evaluation differently.

TRL provides the most seamless evaluation integration because it builds on the Hugging Face Trainer, which supports evaluation datasets, custom metrics callbacks, and automatic best-checkpoint selection:

from trl import SFTConfig

training_args = SFTConfig(
    output_dir="./output",
    evaluation_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    per_device_eval_batch_size=4,
)

Axolotl supports evaluation through its YAML configuration with similar options for eval frequency, metric selection, and checkpoint management. It additionally supports early stopping based on evaluation metrics:

eval_steps: 100
save_steps: 100
eval_batch_size: 4
early_stopping_patience: 5
load_best_model_at_end: true

TorchTune's recipe architecture means evaluation logic lives in the recipe Python file itself, giving you full control over what metrics are computed and when. You can add custom evaluation logic, run benchmark suites, or even generate sample outputs during training:

# Inside a TorchTune recipe
if step % eval_interval == 0:
    model.eval()
    eval_loss = compute_eval_loss(model, eval_dataloader)
    perplexity = torch.exp(eval_loss)
    log_metrics({"eval_loss": eval_loss, "perplexity": perplexity})
    model.train()

Unsloth does not include built-in evaluation hooks, relying on the user to implement evaluation externally or use Unsloth within a TRL SFTTrainer where TRL's evaluation infrastructure applies.

For alignment training specifically, evaluation is more nuanced than loss curves. TRL supports win-rate evaluation against a reference model during DPO training, providing a direct measure of whether the aligned model is improving in the intended direction.

Production Deployment Considerations

Fine-tuning is only valuable if the resulting model can be reliably deployed. Each framework has different export and deployment stories.

All frameworks export standard Hugging Face model formats, which can be served by vLLM, TGI, or any inference framework that supports the Hugging Face model hub format. LoRA adapters can be merged into the base model before deployment or served separately using frameworks that support dynamic adapter loading.

Unsloth provides optimized GGUF export for llama.cpp and Ollama deployment:

model.save_pretrained_gguf(
    "output_model",
    tokenizer,
    quantization_method="q4_k_m"
)

TorchTune integrates with ExecuTorch for mobile and edge deployment, which is a unique advantage if your deployment targets extend beyond cloud servers.

For production training pipelines, consider containerizing your training workflow:

docker run --gpus all -v /data:/data -v /models:/models \
  axolotl-train:latest \
  accelerate launch -m axolotl.cli.train /data/config.yaml

Version-pin your framework, PyTorch, and CUDA toolkit. Training results can vary significantly between versions, and debugging non-reproducibility across environments is one of the most time-consuming problems in ML operations.

Whichever framework you choose, invest in experiment tracking from day one. Weights and Biases, MLflow, or even a structured log directory will save you hours of confusion when you need to compare results across training runs. Every framework supports W&B integration, and the metadata it captures (configuration, hardware, training curves, evaluation metrics) is essential for making informed decisions about model quality and training efficiency.