Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Aphrodite Engine is a high-performance LLM inference and serving engine forked from vLLM. It builds on vLLM’s PagedAttention and continuous batching, then adds two big differentiators: the broadest quantization-format coverage of any engine (GGUF, GPTQ, AWQ, ExLlamaV3, AQLM, BitNet, Marlin, and more, plus quantized KV cache) and advanced samplers (DRY, XTC, Mirostat) that matter for chat/creative use. It exposes an OpenAI-compatible API, so it drops into existing clients.

Installation

Method	Command
pip	`pip install aphrodite-engine`
Run the server	`aphrodite run <model>`
Docker	`docker run --gpus all -p 2242:2242 alpindale/aphrodite-openai`
Requirements	NVIDIA GPU + CUDA (AMD/ROCm supported on some builds)
Verify	`aphrodite --version`

Starting a Server

# Serve a model with an OpenAI-compatible API on port 2242
aphrodite run meta-llama/Llama-3.1-8B-Instruct

# Serve a quantized GGUF model
aphrodite run ./model.Q4_K_M.gguf --quantization gguf

# Multi-GPU tensor parallelism
aphrodite run big-model --tensor-parallel-size 2

Flag	Description
`--quantization`	Format: `gguf`, `gptq`, `awq`, `exl3`, `fp8`, …
`--tensor-parallel-size N`	Split the model across N GPUs
`--max-model-len N`	Context length
`--gpu-memory-utilization 0.9`	Fraction of VRAM to use
`--kv-cache-dtype fp8`	Quantize the KV cache to save memory
`--dtype`	Model compute dtype (`auto`, `half`, `bfloat16`)
`--port 2242`	API port
`--api-keys KEY`	Require an API key

OpenAI-Compatible API

curl http://localhost:2242/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","prompt":"Hello","max_tokens":50}'

Endpoint	Purpose
`/v1/completions`	Text completion
`/v1/chat/completions`	Chat completion
`/v1/models`	List loaded models
`/v1/embeddings`	Embeddings (embedding models)

Quantization Formats

Format	Use
GGUF	llama.cpp-style k-quants (community models)
GPTQ / AWQ	Popular 4-bit weight-only formats
ExLlamaV3 (exl3)	High-quality variable-bit quantization
AQLM / QuIP#	Extreme low-bit
Marlin	Fast GPTQ/AWQ kernels
fp8	8-bit float weights/activations

The breadth is the point: Aphrodite can serve community-quantized models that vanilla vLLM cannot load.

Advanced Samplers

Sampler	Effect
DRY	”Don’t Repeat Yourself” — reduces verbatim repetition
XTC	Exclude Top Choices — boosts creativity
Mirostat	Targets a perplexity setpoint
`min_p`, `top_a`, `tfs`	Additional truncation samplers

Pass these as extra fields in the request (e.g. "dry_multiplier": 0.8).

Performance Tuning

Option	Effect
`--tensor-parallel-size`	Scale across GPUs
`--kv-cache-dtype fp8`	Fit longer context / more concurrency
`--max-num-seqs`	Concurrency (batch) limit
`--enforce-eager`	Disable CUDA graphs (debugging)
`--quantization`	Trade quality for memory/speed

Aphrodite vs Other Inference Engines

Aspect	Aphrodite	vLLM	llama.cpp
Base	vLLM fork	Original	C/C++
Quant coverage	Broadest	Growing	GGUF k-quants
Samplers	DRY/XTC/Mirostat	Standard	Many
Best for	Community-quantized models, chat	Max throughput, standard formats	CPU/edge, GGUF

Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Installation

Starting a Server

OpenAI-Compatible API

Quantization Formats

Advanced Samplers

Performance Tuning

Aphrodite vs Other Inference Engines

Resources