Axolotl Cheat Sheet
Overview
Axolotl is a configuration-driven fine-tuning framework that wraps HuggingFace Transformers, PEFT, and TRL into a clean YAML interface. Rather than writing training scripts from scratch, you describe your entire training run in a single config.yaml and run axolotl train config.yaml.
Key features: supports full fine-tuning, LoRA, QLoRA, ReLoRA; many built-in dataset formats (alpaca, sharegpt, completion, chat_template); multi-GPU via FSDP or DeepSpeed; integrated evaluation, sample packing, flash attention, and model merging.
Installation
# pip install (CPU/GPU with CUDA 12.1)
pip install axolotl
# With flash attention (recommended for speed)
pip install axolotl[flash-attn]
# From source (latest features)
git clone https://github.com/OpenAccess-AI-Collective/axolotl.git
cd axolotl
pip install -e ".[flash-attn,deepspeed]"
# Docker (recommended for reproducibility)
docker pull winglian/axolotl:main-latest
docker run --gpus all -it -v $(pwd):/workspace winglian/axolotl:main-latest
# Verify
axolotl --help
python -c "import axolotl; print(axolotl.__version__)"
Configuration
Axolotl is configured entirely through YAML. Below are the critical sections.
Minimal LoRA Config
# llama3-lora.yaml
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: true # QLoRA
strict: false
# Dataset
datasets:
- path: mhenrichsen/alpaca_data_cleaned
type: alpaca # built-in format handler
dataset_prepared_path: ./last_run_prepared
val_set_size: 0.01
output_dir: ./outputs/llama3-lora
# Sequence
sequence_len: 4096
sample_packing: true # pack short samples together for efficiency
pad_to_sequence_len: true
# LoRA
adapter: lora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Training
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false # only train on assistant turns
# Mixed precision
bf16: true
fp16: false
tf32: false
# Flash attention
flash_attention: true
# Logging
logging_steps: 10
eval_steps: 100
save_steps: 100
wandb_project: my-fine-tune # optional W&B integration
Core CLI Commands
| Command | Description |
|---|---|
axolotl train config.yaml | Start training run |
axolotl train config.yaml --continue-from-checkpoint | Resume from latest checkpoint |
axolotl train config.yaml --debug | Enable debug logging |
axolotl evaluate config.yaml | Run evaluation only |
axolotl merge-lora config.yaml | Merge LoRA adapter into base model |
axolotl preprocess config.yaml | Preprocess and cache dataset only |
axolotl inference config.yaml --gradio | Launch Gradio inference UI |
accelerate launch -m axolotl.cli.train config.yaml | Multi-GPU launch via accelerate |
deepspeed axolotl/cli/train.py config.yaml | DeepSpeed training |
Dataset Formats
| Type | Description | Required Fields |
|---|---|---|
alpaca | Instruction/input/output triplets | instruction, output (+ optional input) |
sharegpt | Multi-turn conversations | conversations list with from/value |
chat_template | Apply tokenizer’s chat template | messages list with role/content |
completion | Raw text completion | text |
input_output | Simple pairs | input, output |
context_qa | Context + question + answer | context, question, answer |
gpteacher | GPTeacher format | instruction, input, response |
explainchoice | MCQ with explanation | question, choices, explanation |
json | Custom JSON with field mapping | configurable via field_* params |
Advanced Usage
Full Fine-Tuning Config
base_model: mistralai/Mistral-7B-v0.3
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
# No adapter = full fine-tune
adapter:
load_in_4bit: false
load_in_8bit: false
datasets:
- path: your-org/your-dataset
type: sharegpt
conversation: chatml
sequence_len: 8192
sample_packing: true
# Training hyperparameters
num_epochs: 1
micro_batch_size: 1
gradient_accumulation_steps: 8
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 0.00005
weight_decay: 0.01
warmup_ratio: 0.03
# Full fine-tune needs DeepSpeed or FSDP
# See deepspeed config below
deepspeed: configs/deepspeed/zero3_bf16.json
bf16: true
flash_attention: true
output_dir: ./outputs/mistral-full-finetune
Multi-Dataset Training
datasets:
- path: mhenrichsen/alpaca_data_cleaned
type: alpaca
ds_type: json
split: train
- path: teknium/OpenHermes-2.5
type: sharegpt
conversation: chatml
split: train[:10000] # use first 10k examples
- path: ./local_data/custom.jsonl
type: completion
ds_type: json
data_files:
- custom.jsonl
# Dataset mixing ratio (optional — defaults to proportional)
# Each dataset sampled according to its size
dataset_exact_dedup: true # deduplicate across datasets
ShareGPT Dataset Format
// sharegpt format (multi-turn conversation)
{
"conversations": [
{"from": "human", "value": "What is photosynthesis?"},
{"from": "gpt", "value": "Photosynthesis is the process by which plants..."},
{"from": "human", "value": "What are the reactants?"},
{"from": "gpt", "value": "The main reactants are carbon dioxide, water, and light."}
]
}
QLoRA Config (Memory-Efficient)
base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
# 4-bit quantization
load_in_4bit: true
bnb_4bit_use_double_quant: true
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true # apply LoRA to all linear layers
sequence_len: 2048
micro_batch_size: 1
gradient_accumulation_steps: 16
gradient_checkpointing: true
optimizer: paged_adamw_32bit
learning_rate: 0.0001
lr_scheduler: linear
bf16: true
flash_attention: true
xformers_attention: false # use flash_attention instead
output_dir: ./outputs/llama70b-qlora
DeepSpeed Config (ZeRO Stage 3)
// configs/deepspeed/zero3_bf16.json
{
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"gather_16bit_weights_on_model_save": true
},
"bf16": {"enabled": true},
"optimizer": {
"type": "AdamW",
"params": {"lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto"}
},
"scheduler": {
"type": "WarmupLR",
"params": {"warmup_min_lr": 0, "warmup_max_lr": "auto", "warmup_num_steps": "auto"}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
# Launch multi-GPU training with DeepSpeed
deepspeed --num_gpus=4 -m axolotl.cli.train config.yaml
# or with accelerate
accelerate launch --config_file accelerate_config.yaml \
-m axolotl.cli.train config.yaml
Evaluation Config
# Add to your training config
eval_steps: 100
evaluation_strategy: steps
eval_batch_size: 4
eval_causal_lm_loss: true
# Or evaluation-only run
# axolotl evaluate config.yaml --checkpoint ./outputs/checkpoint-500
Merging LoRA Adapter
# After training, merge adapter into base model
axolotl merge-lora config.yaml --lora_model_dir ./outputs/llama3-lora
# Or programmatically
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B", torch_dtype="auto")
model = PeftModel.from_pretrained(base, "./outputs/llama3-lora")
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
AutoTokenizer.from_pretrained("./outputs/llama3-lora").save_pretrained("./merged-model")
Common Workflows
Workflow 1: Quick LoRA Fine-Tune on Public Dataset
# 1. Download starter config
curl -O https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/llama-3/lora-8b.yaml
# 2. Edit config — change base_model, datasets, output_dir
# 3. Preprocess (optional, caches tokenized data)
axolotl preprocess lora-8b.yaml
# 4. Train
axolotl train lora-8b.yaml
# 5. Merge adapter
axolotl merge-lora lora-8b.yaml
# 6. Test inference
axolotl inference lora-8b.yaml --gradio
Workflow 2: Resume Interrupted Training
# Training saves checkpoints every save_steps
# Resume from latest checkpoint:
axolotl train config.yaml --continue-from-checkpoint
# Or specify exact checkpoint:
axolotl train config.yaml --resume-from-checkpoint ./outputs/checkpoint-500
Workflow 3: Custom Chat Template Dataset
datasets:
- path: ./data/my_chats.jsonl
type: chat_template
chat_template: llama3
message_field_role: role # JSON field for role
message_field_content: content # JSON field for content
roles:
input: [user, system]
output: [assistant]
Tips and Best Practices
- Start with an example config from the
examples/directory in the Axolotl repo — they’re tested and model-specific. sample_packing: truecan double throughput for instruction datasets with short sequences; disable for very long sequences.train_on_inputs: falseis critical for instruction tuning — you only want loss on the assistant’s responses.lora_target_linear: trueis a shortcut to apply LoRA to all linear layers; more thorough than listing modules manually.- Preprocess first with
axolotl preprocess config.yamlto catch dataset issues before committing GPU time to training. dataset_prepared_pathcaches tokenized data; delete this directory when changing dataset config or sequence length.- Gradient checkpointing is essential for large models — set
gradient_checkpointing: trueto trade compute for memory. - Use
wandb_projectormlflow_experiment_nameto track metrics; raw log files are hard to interpret. bf16: trueis preferred overfp16on Ampere+ GPUs (A100, RTX 30xx/40xx series) for stability.- Check loss curves — loss should steadily decrease; spikes often mean learning rate is too high or data has formatting issues.