Axolotl Cheat Sheet

Overview

Axolotl is a configuration-driven fine-tuning framework that wraps HuggingFace Transformers, PEFT, and TRL into a clean YAML interface. Rather than writing training scripts from scratch, you describe your entire training run in a single config.yaml and run axolotl train config.yaml.

Key features: supports full fine-tuning, LoRA, QLoRA, ReLoRA; many built-in dataset formats (alpaca, sharegpt, completion, chat_template); multi-GPU via FSDP or DeepSpeed; integrated evaluation, sample packing, flash attention, and model merging.

Installation

# pip install (CPU/GPU with CUDA 12.1)
pip install axolotl

# With flash attention (recommended for speed)
pip install axolotl[flash-attn]

# From source (latest features)
git clone https://github.com/OpenAccess-AI-Collective/axolotl.git
cd axolotl
pip install -e ".[flash-attn,deepspeed]"

# Docker (recommended for reproducibility)
docker pull winglian/axolotl:main-latest
docker run --gpus all -it -v $(pwd):/workspace winglian/axolotl:main-latest

# Verify
axolotl --help
python -c "import axolotl; print(axolotl.__version__)"

Configuration

Axolotl is configured entirely through YAML. Below are the critical sections.

Minimal LoRA Config

# llama3-lora.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true          # QLoRA
strict: false

# Dataset
datasets:
  - path: mhenrichsen/alpaca_data_cleaned
    type: alpaca             # built-in format handler

dataset_prepared_path: ./last_run_prepared
val_set_size: 0.01
output_dir: ./outputs/llama3-lora

# Sequence
sequence_len: 4096
sample_packing: true        # pack short samples together for efficiency
pad_to_sequence_len: true

# LoRA
adapter: lora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Training
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false      # only train on assistant turns

# Mixed precision
bf16: true
fp16: false
tf32: false

# Flash attention
flash_attention: true

# Logging
logging_steps: 10
eval_steps: 100
save_steps: 100
wandb_project: my-fine-tune   # optional W&B integration

Core CLI Commands

Command	Description
`axolotl train config.yaml`	Start training run
`axolotl train config.yaml --continue-from-checkpoint`	Resume from latest checkpoint
`axolotl train config.yaml --debug`	Enable debug logging
`axolotl evaluate config.yaml`	Run evaluation only
`axolotl merge-lora config.yaml`	Merge LoRA adapter into base model
`axolotl preprocess config.yaml`	Preprocess and cache dataset only
`axolotl inference config.yaml --gradio`	Launch Gradio inference UI
`accelerate launch -m axolotl.cli.train config.yaml`	Multi-GPU launch via accelerate
`deepspeed axolotl/cli/train.py config.yaml`	DeepSpeed training

Dataset Formats

Type	Description	Required Fields
`alpaca`	Instruction/input/output triplets	`instruction`, `output` (+ optional `input`)
`sharegpt`	Multi-turn conversations	`conversations` list with `from`/`value`
`chat_template`	Apply tokenizer’s chat template	`messages` list with role/content
`completion`	Raw text completion	`text`
`input_output`	Simple pairs	`input`, `output`
`context_qa`	Context + question + answer	`context`, `question`, `answer`
`gpteacher`	GPTeacher format	`instruction`, `input`, `response`
`explainchoice`	MCQ with explanation	`question`, `choices`, `explanation`
`json`	Custom JSON with field mapping	configurable via `field_*` params

Advanced Usage

Full Fine-Tuning Config

base_model: mistralai/Mistral-7B-v0.3
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

# No adapter = full fine-tune
adapter:
load_in_4bit: false
load_in_8bit: false

datasets:
  - path: your-org/your-dataset
    type: sharegpt
    conversation: chatml

sequence_len: 8192
sample_packing: true

# Training hyperparameters
num_epochs: 1
micro_batch_size: 1
gradient_accumulation_steps: 8
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 0.00005
weight_decay: 0.01
warmup_ratio: 0.03

# Full fine-tune needs DeepSpeed or FSDP
# See deepspeed config below
deepspeed: configs/deepspeed/zero3_bf16.json

bf16: true
flash_attention: true
output_dir: ./outputs/mistral-full-finetune

Multi-Dataset Training

datasets:
  - path: mhenrichsen/alpaca_data_cleaned
    type: alpaca
    ds_type: json
    split: train

  - path: teknium/OpenHermes-2.5
    type: sharegpt
    conversation: chatml
    split: train[:10000]    # use first 10k examples

  - path: ./local_data/custom.jsonl
    type: completion
    ds_type: json
    data_files:
      - custom.jsonl

# Dataset mixing ratio (optional — defaults to proportional)
# Each dataset sampled according to its size
dataset_exact_dedup: true  # deduplicate across datasets

ShareGPT Dataset Format

// sharegpt format (multi-turn conversation)
{
  "conversations": [
    {"from": "human", "value": "What is photosynthesis?"},
    {"from": "gpt", "value": "Photosynthesis is the process by which plants..."},
    {"from": "human", "value": "What are the reactants?"},
    {"from": "gpt", "value": "The main reactants are carbon dioxide, water, and light."}
  ]
}

QLoRA Config (Memory-Efficient)

base_model: meta-llama/Meta-Llama-3.1-70B-Instruct

# 4-bit quantization
load_in_4bit: true
bnb_4bit_use_double_quant: true
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16

adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true   # apply LoRA to all linear layers

sequence_len: 2048
micro_batch_size: 1
gradient_accumulation_steps: 16
gradient_checkpointing: true

optimizer: paged_adamw_32bit
learning_rate: 0.0001
lr_scheduler: linear

bf16: true
flash_attention: true
xformers_attention: false  # use flash_attention instead

output_dir: ./outputs/llama70b-qlora

DeepSpeed Config (ZeRO Stage 3)

// configs/deepspeed/zero3_bf16.json
{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "gather_16bit_weights_on_model_save": true
  },
  "bf16": {"enabled": true},
  "optimizer": {
    "type": "AdamW",
    "params": {"lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto"}
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {"warmup_min_lr": 0, "warmup_max_lr": "auto", "warmup_num_steps": "auto"}
  },
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto"
}

# Launch multi-GPU training with DeepSpeed
deepspeed --num_gpus=4 -m axolotl.cli.train config.yaml

# or with accelerate
accelerate launch --config_file accelerate_config.yaml \
  -m axolotl.cli.train config.yaml

Evaluation Config

# Add to your training config
eval_steps: 100
evaluation_strategy: steps
eval_batch_size: 4
eval_causal_lm_loss: true

# Or evaluation-only run
# axolotl evaluate config.yaml --checkpoint ./outputs/checkpoint-500

Merging LoRA Adapter

# After training, merge adapter into base model
axolotl merge-lora config.yaml --lora_model_dir ./outputs/llama3-lora

# Or programmatically
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B", torch_dtype="auto")
model = PeftModel.from_pretrained(base, "./outputs/llama3-lora")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
AutoTokenizer.from_pretrained("./outputs/llama3-lora").save_pretrained("./merged-model")

Common Workflows

Workflow 1: Quick LoRA Fine-Tune on Public Dataset

# 1. Download starter config
curl -O https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/llama-3/lora-8b.yaml

# 2. Edit config — change base_model, datasets, output_dir

# 3. Preprocess (optional, caches tokenized data)
axolotl preprocess lora-8b.yaml

# 4. Train
axolotl train lora-8b.yaml

# 5. Merge adapter
axolotl merge-lora lora-8b.yaml

# 6. Test inference
axolotl inference lora-8b.yaml --gradio

Workflow 2: Resume Interrupted Training

# Training saves checkpoints every save_steps
# Resume from latest checkpoint:
axolotl train config.yaml --continue-from-checkpoint

# Or specify exact checkpoint:
axolotl train config.yaml --resume-from-checkpoint ./outputs/checkpoint-500

Workflow 3: Custom Chat Template Dataset

datasets:
  - path: ./data/my_chats.jsonl
    type: chat_template
    chat_template: llama3
    message_field_role: role     # JSON field for role
    message_field_content: content  # JSON field for content
    roles:
      input: [user, system]
      output: [assistant]

Tips and Best Practices

Start with an example config from the examples/ directory in the Axolotl repo — they’re tested and model-specific.
sample_packing: true can double throughput for instruction datasets with short sequences; disable for very long sequences.
train_on_inputs: false is critical for instruction tuning — you only want loss on the assistant’s responses.
lora_target_linear: true is a shortcut to apply LoRA to all linear layers; more thorough than listing modules manually.
Preprocess first with axolotl preprocess config.yaml to catch dataset issues before committing GPU time to training.
dataset_prepared_path caches tokenized data; delete this directory when changing dataset config or sequence length.
Gradient checkpointing is essential for large models — set gradient_checkpointing: true to trade compute for memory.
Use wandb_project or mlflow_experiment_name to track metrics; raw log files are hard to interpret.
bf16: true is preferred over fp16 on Ampere+ GPUs (A100, RTX 30xx/40xx series) for stability.
Check loss curves — loss should steadily decrease; spikes often mean learning rate is too high or data has formatting issues.