LoRA & QLoRA Cheat Sheet
Overview
LoRA (Low-Rank Adaptation) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each target layer. Instead of updating all W parameters in a layer, LoRA trains two small matrices A (d×r) and B (r×d) where rank r << d. The update is ΔW = BA, with parameter count reduced from d² to 2dr.
QLoRA combines LoRA with 4-bit NormalFloat (NF4) quantization of the base model, double quantization, and paged optimizers to enable fine-tuning of 65B+ parameter models on a single GPU.
Key concepts: rank (r) controls adapter capacity, alpha (α) scales the updates (effective learning rate = α/r), target modules determine which layers get adapters, and PEFT is the HuggingFace library that implements both techniques.
Installation
# Core PEFT library (LoRA)
pip install peft
# QLoRA requires bitsandbytes for quantization
pip install peft bitsandbytes transformers accelerate
# For training with TRL
pip install peft bitsandbytes transformers accelerate trl datasets
# Verify bitsandbytes CUDA support
python -c "import bitsandbytes as bnb; print(bnb.__version__)"
# Optional: flash attention for faster training
pip install flash-attn --no-build-isolation
Configuration
LoRA Config Reference
| Parameter | Typical Values | Description |
|---|---|---|
r (rank) | 4, 8, 16, 32, 64 | Adapter rank — higher = more params, more capacity |
lora_alpha | = r or 2×r | Scaling factor; effective LR ∝ alpha/r |
lora_dropout | 0.0–0.1 | Dropout on adapter layers (0 = fastest) |
target_modules | see below | Which linear layers to adapt |
bias | "none" | Whether to train bias params (none/all/lora_only) |
task_type | CAUSAL_LM | Task type for correct behavior |
modules_to_save | ["embed_tokens"] | Full-precision layers to also train |
Common Target Modules by Architecture
| Architecture | Common Target Modules |
|---|---|
| LLaMA / Mistral / Qwen | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| GPT-2 / GPT-Neo | c_attn, c_proj, c_fc |
| Falcon | query_key_value, dense, dense_h_to_4h, dense_4h_to_h |
| Phi-3 | qkv_proj, o_proj, gate_up_proj, down_proj |
| Gemma | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Core API Reference
| API | Description |
|---|---|
LoraConfig(r, lora_alpha, target_modules, ...) | Define LoRA configuration |
get_peft_model(model, lora_config) | Wrap model with LoRA adapters |
model.print_trainable_parameters() | Show trainable vs total param counts |
PeftModel.from_pretrained(base, adapter_path) | Load saved adapter onto base model |
model.merge_and_unload() | Merge adapter into base, return plain model |
model.save_pretrained(path) | Save adapter weights only |
model.load_adapter(path) | Load additional adapter |
model.set_adapter(name) | Switch active adapter |
model.disable_adapter() | Disable all adapters (use base model) |
model.add_adapter(name, config) | Add named adapter |
BitsAndBytesConfig(...) | Configure 4-bit / 8-bit quantization |
Advanced Usage
Standard LoRA Fine-Tuning
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
# Load base model in float16
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
# Define LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32, # alpha = 2×r is a common heuristic
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
modules_to_save=["embed_tokens", "lm_head"], # keep these at full precision
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,388,992 || trainable%: 1.03
# The model now has adapter layers; train as usual
QLoRA (4-bit + LoRA)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # double quantization saves ~0.4 bits/param extra
bnb_4bit_quant_type="nf4", # NormalFloat4 — better than fp4 for weight distributions
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16 for speed
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-70B-Instruct",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Prepare model for training (casts LayerNorm to float32, etc.)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# Add LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.config.use_cache = False # disable KV cache during training
model.print_trainable_parameters()
Training with TRL SFTTrainer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_alpaca(example):
if example["input"]:
prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
else:
prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": prompt}
dataset = dataset.map(format_alpaca)
sft_config = SFTConfig(
output_dir="./qlora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
optim="paged_adamw_32bit", # paged optimizer = CPU offloading of optimizer states
save_steps=100,
logging_steps=10,
learning_rate=2e-4,
fp16=False,
bf16=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
dataset_text_field="text",
max_seq_length=2048,
packing=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora_config,
processing_class=tokenizer,
args=sft_config,
)
trainer.train()
trainer.save_model("./qlora-output/final")
Loading and Using Saved Adapters
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./qlora-output/final")
# Inference
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
inputs = tokenizer("### Instruction:\nExplain transformers.\n\n### Response:\n",
return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Merging Adapters into Base Model
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="cpu", # merge on CPU to avoid VRAM limits
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-output/final")
# Merge and unload — returns a plain transformers model
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged-llama-8b")
tokenizer.save_pretrained("./merged-llama-8b")
print("Merged model saved. Ready for vLLM, Ollama, or deployment.")
Multi-Adapter Inference
from peft import PeftModel
# Load base model once
base = AutoModelForCausalLM.from_pretrained("base-model-id", device_map="auto")
# Load multiple adapters
model = PeftModel.from_pretrained(base, "./adapter-coding", adapter_name="coding")
model.load_adapter("./adapter-medical", adapter_name="medical")
model.load_adapter("./adapter-legal", adapter_name="legal")
# Switch between adapters at runtime
model.set_adapter("coding")
output_code = model.generate(**code_inputs)
model.set_adapter("medical")
output_med = model.generate(**medical_inputs)
# Disable adapters to use raw base model
with model.disable_adapter():
output_base = model.generate(**inputs)
Common Workflows
Workflow 1: Rank Selection Strategy
# Test different ranks to find the sweet spot
from peft import LoraConfig, get_peft_model
results = {}
for rank in [4, 8, 16, 32, 64]:
config = LoraConfig(r=rank, lora_alpha=rank*2,
target_modules=["q_proj","v_proj"],
task_type="CAUSAL_LM")
m = get_peft_model(base_model, config)
trainable, total, pct = m.get_nb_trainable_parameters()
results[rank] = {"trainable_M": trainable/1e6, "pct": pct*100}
m = None # free memory
for rank, stats in results.items():
print(f"r={rank:2d}: {stats['trainable_M']:.1f}M params ({stats['pct']:.2f}%)")
Workflow 2: VRAM Estimation Before Training
# Rough VRAM estimates for 7B model
# Full fp16: ~14 GB model + ~56 GB optimizer = ~70 GB total
# LoRA fp16: ~14 GB model + ~1 GB adapters + ~4 GB optimizer = ~19 GB
# QLoRA nf4: ~3.5 GB model + ~1 GB adapters + ~4 GB (paged) = ~8 GB
def estimate_qlora_vram(params_B, r=16, num_targets=14):
model_gb = params_B * 0.5 # 4-bit = 0.5 bytes/param
adapter_M = 2 * r * 4096 * num_targets / 1e6 # rough estimate
adapter_gb = adapter_M * 2 / 1e3 # fp16 adapters
optimizer_gb = adapter_gb * 4 # Adam states
return model_gb + adapter_gb + optimizer_gb
print(f"7B model QLoRA estimate: {estimate_qlora_vram(7):.1f} GB")
print(f"13B model QLoRA estimate: {estimate_qlora_vram(13):.1f} GB")
print(f"70B model QLoRA estimate: {estimate_qlora_vram(70):.1f} GB")
Tips and Best Practices
- Target all projection layers (
q/k/v/o/gate/up/down_proj) for best results — skipping layers limits adaptation capacity. - Alpha = 2×rank (
lora_alpha = 2*r) is a common starting heuristic; some practitioners usealpha = rfor more conservative updates. - Rank 16 is a safe default for instruction tuning; use rank 64+ for domain adaptation or hard reasoning tasks.
- QLoRA saves ~4× VRAM over fp16 LoRA — prefer it whenever fitting the base model is a concern.
prepare_model_for_kbit_training()is required before adding LoRA to a quantized model — it fixes LayerNorm dtypes and disables the KV cache.- Paged optimizers (
paged_adamw_32bit) offload optimizer states to CPU RAM, enabling larger batch sizes. - Do not merge adapters in 4-bit — load the base model in fp16 or bfloat16 for
merge_and_unload()to get a clean model. - Multiple adapters with
PeftModel.load_adapterlet you serve different fine-tunes from one base model — critical for multi-tenant deployments. - Saving only the adapter (a few hundred MB) is vastly more practical than saving the full model; the base model is downloaded separately.
- RSLoRA (rank-stabilized LoRA,
use_rslora=Truein PEFT) improves stability at high ranks (r > 32) by scaling alpha by √r instead of r.