Skip to content

ExLlamaV3 - Fast Quantized LLM Inference on Consumer GPUs Cheatsheet

ExLlamaV3 - Fast Quantized LLM Inference on Consumer GPUs Cheatsheet

ExLlamaV3 is a fast inference library for running quantized LLMs on consumer NVIDIA GPUs. Its EXL3 format provides high-quality, variable-bitrate quantization — you can target an average bits-per-weight (e.g. 3.0, 4.25, 6.0) to trade quality against VRAM precisely — letting large models fit on a single 24GB card while keeping strong throughput. It is the successor to ExLlamaV2 and is favored by the local-LLM community for squeezing big models into limited memory.

Installation

MethodCommand
pip (prebuilt wheel)pip install exllamav3
From sourcegit clone https://github.com/turboderp-org/exllamav3 && cd exllamav3 && pip install -e .
RequirementsNVIDIA GPU (Ampere+), CUDA, PyTorch
Verifypython -c "import exllamav3; print('ok')"

The EXL3 Format

ConceptMeaning
Variable bitrateTarget an average bits-per-weight (bpw), e.g. 2.5–8.0
Per-layer precisionDifferent layers can use different precision
Quality/size dialHigher bpw = better quality, more VRAM
CalibrationUses a calibration dataset during quantization

Quantizing a Model

# Convert an HF model to EXL3 at ~4.0 bits per weight
python -m exllamav3.convert \
  -i meta-llama/Llama-3.1-8B-Instruct \
  -o Llama-3.1-8B-exl3-4.0bpw \
  -b 4.0
FlagDescription
-i, --in_dirSource HF model
-o, --out_dirOutput EXL3 directory
-b, --bitsTarget average bits per weight
-hb, --head_bitsPrecision for the output head
-c, --cal_dirCustom calibration data

Python Inference

from exllamav3 import Model, Config, Cache, Tokenizer, Generator

config = Config.from_directory("Llama-3.1-8B-exl3-4.0bpw")
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=8192)
model.load()

tokenizer = Tokenizer.from_config(config)
generator = Generator(model=model, cache=cache, tokenizer=tokenizer)

output = generator.generate(prompt="Explain quantization briefly.",
                            max_new_tokens=200)
print(output)
ObjectRole
ConfigLoads model config from an EXL3 dir
ModelThe quantized model
CacheKV cache (sizable = longer context)
GeneratorRuns generation

Memory & Context

LeverEffect
bpw at quant timeLower bpw → less VRAM, some quality loss
Cache sizeLarger → longer context, more VRAM
Cache quantizationQuantized KV cache to extend context
Head bitsKeep the head higher-precision for quality

Choosing a Bitrate (rough guide)

Target bpwTypical use
2.0–2.5Fit a very large model in tight VRAM (quality drops)
3.0–3.5Aggressive but usable
4.0–4.5Sweet spot for most 24GB setups
6.0+Near-lossless, more VRAM

Ecosystem Integration

TargetNote
TabbyAPIOpenAI-compatible server that uses ExLlamaV3
text-generation-webuiLoader support
Aphrodite EngineCan serve EXL3-quantized models

ExLlamaV3 vs Other Approaches

AspectExLlamaV3llama.cpp (GGUF)GPTQ/AWQ
TargetConsumer NVIDIA GPUsCPU + GPU, cross-platformGPU
QuantizationVariable-bitrate EXL3k-quantsFixed 4-bit
Precision controlFine (any bpw)Preset levelsCoarse
Best forMax quality-per-VRAM on GPUPortability, CPUStandard 4-bit serving

Resources