ExLlamaV3 - Fast Quantized LLM Inference on Consumer GPUs Cheatsheet
ExLlamaV3 is a fast inference library for running quantized LLMs on consumer NVIDIA GPUs. Its EXL3 format provides high-quality, variable-bitrate quantization — you can target an average bits-per-weight (e.g. 3.0, 4.25, 6.0) to trade quality against VRAM precisely — letting large models fit on a single 24GB card while keeping strong throughput. It is the successor to ExLlamaV2 and is favored by the local-LLM community for squeezing big models into limited memory.
Installation
| Method | Command |
|---|
| pip (prebuilt wheel) | pip install exllamav3 |
| From source | git clone https://github.com/turboderp-org/exllamav3 && cd exllamav3 && pip install -e . |
| Requirements | NVIDIA GPU (Ampere+), CUDA, PyTorch |
| Verify | python -c "import exllamav3; print('ok')" |
| Concept | Meaning |
|---|
| Variable bitrate | Target an average bits-per-weight (bpw), e.g. 2.5–8.0 |
| Per-layer precision | Different layers can use different precision |
| Quality/size dial | Higher bpw = better quality, more VRAM |
| Calibration | Uses a calibration dataset during quantization |
Quantizing a Model
# Convert an HF model to EXL3 at ~4.0 bits per weight
python -m exllamav3.convert \
-i meta-llama/Llama-3.1-8B-Instruct \
-o Llama-3.1-8B-exl3-4.0bpw \
-b 4.0
| Flag | Description |
|---|
-i, --in_dir | Source HF model |
-o, --out_dir | Output EXL3 directory |
-b, --bits | Target average bits per weight |
-hb, --head_bits | Precision for the output head |
-c, --cal_dir | Custom calibration data |
Python Inference
from exllamav3 import Model, Config, Cache, Tokenizer, Generator
config = Config.from_directory("Llama-3.1-8B-exl3-4.0bpw")
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=8192)
model.load()
tokenizer = Tokenizer.from_config(config)
generator = Generator(model=model, cache=cache, tokenizer=tokenizer)
output = generator.generate(prompt="Explain quantization briefly.",
max_new_tokens=200)
print(output)
| Object | Role |
|---|
Config | Loads model config from an EXL3 dir |
Model | The quantized model |
Cache | KV cache (sizable = longer context) |
Generator | Runs generation |
Memory & Context
| Lever | Effect |
|---|
| bpw at quant time | Lower bpw → less VRAM, some quality loss |
| Cache size | Larger → longer context, more VRAM |
| Cache quantization | Quantized KV cache to extend context |
| Head bits | Keep the head higher-precision for quality |
Choosing a Bitrate (rough guide)
| Target bpw | Typical use |
|---|
| 2.0–2.5 | Fit a very large model in tight VRAM (quality drops) |
| 3.0–3.5 | Aggressive but usable |
| 4.0–4.5 | Sweet spot for most 24GB setups |
| 6.0+ | Near-lossless, more VRAM |
Ecosystem Integration
| Target | Note |
|---|
| TabbyAPI | OpenAI-compatible server that uses ExLlamaV3 |
| text-generation-webui | Loader support |
| Aphrodite Engine | Can serve EXL3-quantized models |
ExLlamaV3 vs Other Approaches
| Aspect | ExLlamaV3 | llama.cpp (GGUF) | GPTQ/AWQ |
|---|
| Target | Consumer NVIDIA GPUs | CPU + GPU, cross-platform | GPU |
| Quantization | Variable-bitrate EXL3 | k-quants | Fixed 4-bit |
| Precision control | Fine (any bpw) | Preset levels | Coarse |
| Best for | Max quality-per-VRAM on GPU | Portability, CPU | Standard 4-bit serving |
Resources