ExLlamaV3 - Fast Quantized LLM Inference on Consumer GPUs Cheatsheet

ExLlamaV3 is a fast inference library for running quantized LLMs on consumer NVIDIA GPUs. Its EXL3 format provides high-quality, variable-bitrate quantization — you can target an average bits-per-weight (e.g. 3.0, 4.25, 6.0) to trade quality against VRAM precisely — letting large models fit on a single 24GB card while keeping strong throughput. It is the successor to ExLlamaV2 and is favored by the local-LLM community for squeezing big models into limited memory.

Installation

Method	Command
pip (prebuilt wheel)	`pip install exllamav3`
From source	`git clone https://github.com/turboderp-org/exllamav3 && cd exllamav3 && pip install -e .`
Requirements	NVIDIA GPU (Ampere+), CUDA, PyTorch
Verify	`python -c "import exllamav3; print('ok')"`

The EXL3 Format

Concept	Meaning
Variable bitrate	Target an average bits-per-weight (bpw), e.g. 2.5–8.0
Per-layer precision	Different layers can use different precision
Quality/size dial	Higher bpw = better quality, more VRAM
Calibration	Uses a calibration dataset during quantization

Quantizing a Model

# Convert an HF model to EXL3 at ~4.0 bits per weight
python -m exllamav3.convert \
  -i meta-llama/Llama-3.1-8B-Instruct \
  -o Llama-3.1-8B-exl3-4.0bpw \
  -b 4.0

Flag	Description
`-i, --in_dir`	Source HF model
`-o, --out_dir`	Output EXL3 directory
`-b, --bits`	Target average bits per weight
`-hb, --head_bits`	Precision for the output head
`-c, --cal_dir`	Custom calibration data

Python Inference

from exllamav3 import Model, Config, Cache, Tokenizer, Generator

config = Config.from_directory("Llama-3.1-8B-exl3-4.0bpw")
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=8192)
model.load()

tokenizer = Tokenizer.from_config(config)
generator = Generator(model=model, cache=cache, tokenizer=tokenizer)

output = generator.generate(prompt="Explain quantization briefly.",
                            max_new_tokens=200)
print(output)

Object	Role
`Config`	Loads model config from an EXL3 dir
`Model`	The quantized model
`Cache`	KV cache (sizable = longer context)
`Generator`	Runs generation

Memory & Context

Lever	Effect
bpw at quant time	Lower bpw → less VRAM, some quality loss
Cache size	Larger → longer context, more VRAM
Cache quantization	Quantized KV cache to extend context
Head bits	Keep the head higher-precision for quality

Choosing a Bitrate (rough guide)

Target bpw	Typical use
2.0–2.5	Fit a very large model in tight VRAM (quality drops)
3.0–3.5	Aggressive but usable
4.0–4.5	Sweet spot for most 24GB setups
6.0+	Near-lossless, more VRAM

Ecosystem Integration

Target	Note
TabbyAPI	OpenAI-compatible server that uses ExLlamaV3
text-generation-webui	Loader support
Aphrodite Engine	Can serve EXL3-quantized models

ExLlamaV3 vs Other Approaches

Aspect	ExLlamaV3	llama.cpp (GGUF)	GPTQ/AWQ
Target	Consumer NVIDIA GPUs	CPU + GPU, cross-platform	GPU
Quantization	Variable-bitrate EXL3	k-quants	Fixed 4-bit
Precision control	Fine (any bpw)	Preset levels	Coarse
Best for	Max quality-per-VRAM on GPU	Portability, CPU	Standard 4-bit serving

ExLlamaV3 - Fast Quantized LLM Inference on Consumer GPUs Cheatsheet

ExLlamaV3 - Fast Quantized LLM Inference on Consumer GPUs Cheatsheet

Installation

The EXL3 Format

Quantizing a Model

Python Inference

Memory & Context

Choosing a Bitrate (rough guide)

Ecosystem Integration

ExLlamaV3 vs Other Approaches

Resources