Aphrodite Engine - High-Throughput LLM Serving Cheatsheet
Aphrodite Engine is a high-performance LLM inference and serving engine forked from vLLM. It builds on vLLM’s PagedAttention and continuous batching, then adds two big differentiators: the broadest quantization-format coverage of any engine (GGUF, GPTQ, AWQ, ExLlamaV3, AQLM, BitNet, Marlin, and more, plus quantized KV cache) and advanced samplers (DRY, XTC, Mirostat) that matter for chat/creative use. It exposes an OpenAI-compatible API, so it drops into existing clients.
Installation
| Method | Command |
|---|
| pip | pip install aphrodite-engine |
| Run the server | aphrodite run <model> |
| Docker | docker run --gpus all -p 2242:2242 alpindale/aphrodite-openai |
| Requirements | NVIDIA GPU + CUDA (AMD/ROCm supported on some builds) |
| Verify | aphrodite --version |
Starting a Server
# Serve a model with an OpenAI-compatible API on port 2242
aphrodite run meta-llama/Llama-3.1-8B-Instruct
# Serve a quantized GGUF model
aphrodite run ./model.Q4_K_M.gguf --quantization gguf
# Multi-GPU tensor parallelism
aphrodite run big-model --tensor-parallel-size 2
| Flag | Description |
|---|
--quantization | Format: gguf, gptq, awq, exl3, fp8, … |
--tensor-parallel-size N | Split the model across N GPUs |
--max-model-len N | Context length |
--gpu-memory-utilization 0.9 | Fraction of VRAM to use |
--kv-cache-dtype fp8 | Quantize the KV cache to save memory |
--dtype | Model compute dtype (auto, half, bfloat16) |
--port 2242 | API port |
--api-keys KEY | Require an API key |
OpenAI-Compatible API
curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","prompt":"Hello","max_tokens":50}'
| Endpoint | Purpose |
|---|
/v1/completions | Text completion |
/v1/chat/completions | Chat completion |
/v1/models | List loaded models |
/v1/embeddings | Embeddings (embedding models) |
| Format | Use |
|---|
| GGUF | llama.cpp-style k-quants (community models) |
| GPTQ / AWQ | Popular 4-bit weight-only formats |
| ExLlamaV3 (exl3) | High-quality variable-bit quantization |
| AQLM / QuIP# | Extreme low-bit |
| Marlin | Fast GPTQ/AWQ kernels |
| fp8 | 8-bit float weights/activations |
The breadth is the point: Aphrodite can serve community-quantized models that vanilla vLLM cannot load.
Advanced Samplers
| Sampler | Effect |
|---|
| DRY | ”Don’t Repeat Yourself” — reduces verbatim repetition |
| XTC | Exclude Top Choices — boosts creativity |
| Mirostat | Targets a perplexity setpoint |
min_p, top_a, tfs | Additional truncation samplers |
Pass these as extra fields in the request (e.g. "dry_multiplier": 0.8).
| Option | Effect |
|---|
--tensor-parallel-size | Scale across GPUs |
--kv-cache-dtype fp8 | Fit longer context / more concurrency |
--max-num-seqs | Concurrency (batch) limit |
--enforce-eager | Disable CUDA graphs (debugging) |
--quantization | Trade quality for memory/speed |
Aphrodite vs Other Inference Engines
| Aspect | Aphrodite | vLLM | llama.cpp |
|---|
| Base | vLLM fork | Original | C/C++ |
| Quant coverage | Broadest | Growing | GGUF k-quants |
| Samplers | DRY/XTC/Mirostat | Standard | Many |
| Best for | Community-quantized models, chat | Max throughput, standard formats | CPU/edge, GGUF |
Resources