تخطَّ إلى المحتوى

LoRA & QLoRA Cheat Sheet

Overview

LoRA (Low-Rank Adaptation) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each target layer. Instead of updating all W parameters in a layer, LoRA trains two small matrices A (d×r) and B (r×d) where rank r << d. The update is ΔW = BA, with parameter count reduced from d² to 2dr.

QLoRA combines LoRA with 4-bit NormalFloat (NF4) quantization of the base model, double quantization, and paged optimizers to enable fine-tuning of 65B+ parameter models on a single GPU.

Key concepts: rank (r) controls adapter capacity, alpha (α) scales the updates (effective learning rate = α/r), target modules determine which layers get adapters, and PEFT is the HuggingFace library that implements both techniques.

Installation

# Core PEFT library (LoRA)
pip install peft

# QLoRA requires bitsandbytes for quantization
pip install peft bitsandbytes transformers accelerate

# For training with TRL
pip install peft bitsandbytes transformers accelerate trl datasets

# Verify bitsandbytes CUDA support
python -c "import bitsandbytes as bnb; print(bnb.__version__)"

# Optional: flash attention for faster training
pip install flash-attn --no-build-isolation

Configuration

LoRA Config Reference

ParameterTypical ValuesDescription
r (rank)4, 8, 16, 32, 64Adapter rank — higher = more params, more capacity
lora_alpha= r or 2×rScaling factor; effective LR ∝ alpha/r
lora_dropout0.0–0.1Dropout on adapter layers (0 = fastest)
target_modulessee belowWhich linear layers to adapt
bias"none"Whether to train bias params (none/all/lora_only)
task_typeCAUSAL_LMTask type for correct behavior
modules_to_save["embed_tokens"]Full-precision layers to also train

Common Target Modules by Architecture

ArchitectureCommon Target Modules
LLaMA / Mistral / Qwenq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
GPT-2 / GPT-Neoc_attn, c_proj, c_fc
Falconquery_key_value, dense, dense_h_to_4h, dense_4h_to_h
Phi-3qkv_proj, o_proj, gate_up_proj, down_proj
Gemmaq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Core API Reference

APIDescription
LoraConfig(r, lora_alpha, target_modules, ...)Define LoRA configuration
get_peft_model(model, lora_config)Wrap model with LoRA adapters
model.print_trainable_parameters()Show trainable vs total param counts
PeftModel.from_pretrained(base, adapter_path)Load saved adapter onto base model
model.merge_and_unload()Merge adapter into base, return plain model
model.save_pretrained(path)Save adapter weights only
model.load_adapter(path)Load additional adapter
model.set_adapter(name)Switch active adapter
model.disable_adapter()Disable all adapters (use base model)
model.add_adapter(name, config)Add named adapter
BitsAndBytesConfig(...)Configure 4-bit / 8-bit quantization

Advanced Usage

Standard LoRA Fine-Tuning

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model in float16
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

# Define LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,              # alpha = 2×r is a common heuristic
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    modules_to_save=["embed_tokens", "lm_head"],  # keep these at full precision
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,388,992 || trainable%: 1.03

# The model now has adapter layers; train as usual

QLoRA (4-bit + LoRA)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,      # double quantization saves ~0.4 bits/param extra
    bnb_4bit_quant_type="nf4",           # NormalFloat4 — better than fp4 for weight distributions
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 for speed
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-70B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Prepare model for training (casts LayerNorm to float32, etc.)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.config.use_cache = False    # disable KV cache during training
model.print_trainable_parameters()

Training with TRL SFTTrainer

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_alpaca(example):
    if example["input"]:
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": prompt}

dataset = dataset.map(format_alpaca)

sft_config = SFTConfig(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",          # paged optimizer = CPU offloading of optimizer states
    save_steps=100,
    logging_steps=10,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
    args=sft_config,
)
trainer.train()
trainer.save_model("./qlora-output/final")

Loading and Using Saved Adapters

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "./qlora-output/final")

# Inference
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
inputs = tokenizer("### Instruction:\nExplain transformers.\n\n### Response:\n",
                   return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Merging Adapters into Base Model

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="cpu",       # merge on CPU to avoid VRAM limits
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-output/final")

# Merge and unload — returns a plain transformers model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged-llama-8b")
tokenizer.save_pretrained("./merged-llama-8b")

print("Merged model saved. Ready for vLLM, Ollama, or deployment.")

Multi-Adapter Inference

from peft import PeftModel

# Load base model once
base = AutoModelForCausalLM.from_pretrained("base-model-id", device_map="auto")

# Load multiple adapters
model = PeftModel.from_pretrained(base, "./adapter-coding", adapter_name="coding")
model.load_adapter("./adapter-medical", adapter_name="medical")
model.load_adapter("./adapter-legal", adapter_name="legal")

# Switch between adapters at runtime
model.set_adapter("coding")
output_code = model.generate(**code_inputs)

model.set_adapter("medical")
output_med = model.generate(**medical_inputs)

# Disable adapters to use raw base model
with model.disable_adapter():
    output_base = model.generate(**inputs)

Common Workflows

Workflow 1: Rank Selection Strategy

# Test different ranks to find the sweet spot
from peft import LoraConfig, get_peft_model

results = {}
for rank in [4, 8, 16, 32, 64]:
    config = LoraConfig(r=rank, lora_alpha=rank*2,
                        target_modules=["q_proj","v_proj"],
                        task_type="CAUSAL_LM")
    m = get_peft_model(base_model, config)
    trainable, total, pct = m.get_nb_trainable_parameters()
    results[rank] = {"trainable_M": trainable/1e6, "pct": pct*100}
    m = None  # free memory

for rank, stats in results.items():
    print(f"r={rank:2d}: {stats['trainable_M']:.1f}M params ({stats['pct']:.2f}%)")

Workflow 2: VRAM Estimation Before Training

# Rough VRAM estimates for 7B model
# Full fp16:   ~14 GB model + ~56 GB optimizer = ~70 GB total
# LoRA fp16:   ~14 GB model + ~1 GB adapters + ~4 GB optimizer = ~19 GB
# QLoRA nf4:   ~3.5 GB model + ~1 GB adapters + ~4 GB (paged) = ~8 GB

def estimate_qlora_vram(params_B, r=16, num_targets=14):
    model_gb = params_B * 0.5        # 4-bit = 0.5 bytes/param
    adapter_M = 2 * r * 4096 * num_targets / 1e6  # rough estimate
    adapter_gb = adapter_M * 2 / 1e3  # fp16 adapters
    optimizer_gb = adapter_gb * 4     # Adam states
    return model_gb + adapter_gb + optimizer_gb

print(f"7B model QLoRA estimate: {estimate_qlora_vram(7):.1f} GB")
print(f"13B model QLoRA estimate: {estimate_qlora_vram(13):.1f} GB")
print(f"70B model QLoRA estimate: {estimate_qlora_vram(70):.1f} GB")

Tips and Best Practices

  • Target all projection layers (q/k/v/o/gate/up/down_proj) for best results — skipping layers limits adaptation capacity.
  • Alpha = 2×rank (lora_alpha = 2*r) is a common starting heuristic; some practitioners use alpha = r for more conservative updates.
  • Rank 16 is a safe default for instruction tuning; use rank 64+ for domain adaptation or hard reasoning tasks.
  • QLoRA saves ~4× VRAM over fp16 LoRA — prefer it whenever fitting the base model is a concern.
  • prepare_model_for_kbit_training() is required before adding LoRA to a quantized model — it fixes LayerNorm dtypes and disables the KV cache.
  • Paged optimizers (paged_adamw_32bit) offload optimizer states to CPU RAM, enabling larger batch sizes.
  • Do not merge adapters in 4-bit — load the base model in fp16 or bfloat16 for merge_and_unload() to get a clean model.
  • Multiple adapters with PeftModel.load_adapter let you serve different fine-tunes from one base model — critical for multi-tenant deployments.
  • Saving only the adapter (a few hundred MB) is vastly more practical than saving the full model; the base model is downloaded separately.
  • RSLoRA (rank-stabilized LoRA, use_rslora=True in PEFT) improves stability at high ranks (r > 32) by scaling alpha by √r instead of r.