Skip to content

Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Aphrodite Engine is a high-performance LLM inference and serving engine forked from vLLM. It builds on vLLM’s PagedAttention and continuous batching, then adds two big differentiators: the broadest quantization-format coverage of any engine (GGUF, GPTQ, AWQ, ExLlamaV3, AQLM, BitNet, Marlin, and more, plus quantized KV cache) and advanced samplers (DRY, XTC, Mirostat) that matter for chat/creative use. It exposes an OpenAI-compatible API, so it drops into existing clients.

Installation

MethodCommand
pippip install aphrodite-engine
Run the serveraphrodite run <model>
Dockerdocker run --gpus all -p 2242:2242 alpindale/aphrodite-openai
RequirementsNVIDIA GPU + CUDA (AMD/ROCm supported on some builds)
Verifyaphrodite --version

Starting a Server

# Serve a model with an OpenAI-compatible API on port 2242
aphrodite run meta-llama/Llama-3.1-8B-Instruct

# Serve a quantized GGUF model
aphrodite run ./model.Q4_K_M.gguf --quantization gguf

# Multi-GPU tensor parallelism
aphrodite run big-model --tensor-parallel-size 2
FlagDescription
--quantizationFormat: gguf, gptq, awq, exl3, fp8, …
--tensor-parallel-size NSplit the model across N GPUs
--max-model-len NContext length
--gpu-memory-utilization 0.9Fraction of VRAM to use
--kv-cache-dtype fp8Quantize the KV cache to save memory
--dtypeModel compute dtype (auto, half, bfloat16)
--port 2242API port
--api-keys KEYRequire an API key

OpenAI-Compatible API

curl http://localhost:2242/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","prompt":"Hello","max_tokens":50}'
EndpointPurpose
/v1/completionsText completion
/v1/chat/completionsChat completion
/v1/modelsList loaded models
/v1/embeddingsEmbeddings (embedding models)

Quantization Formats

FormatUse
GGUFllama.cpp-style k-quants (community models)
GPTQ / AWQPopular 4-bit weight-only formats
ExLlamaV3 (exl3)High-quality variable-bit quantization
AQLM / QuIP#Extreme low-bit
MarlinFast GPTQ/AWQ kernels
fp88-bit float weights/activations

The breadth is the point: Aphrodite can serve community-quantized models that vanilla vLLM cannot load.

Advanced Samplers

SamplerEffect
DRY”Don’t Repeat Yourself” — reduces verbatim repetition
XTCExclude Top Choices — boosts creativity
MirostatTargets a perplexity setpoint
min_p, top_a, tfsAdditional truncation samplers

Pass these as extra fields in the request (e.g. "dry_multiplier": 0.8).

Performance Tuning

OptionEffect
--tensor-parallel-sizeScale across GPUs
--kv-cache-dtype fp8Fit longer context / more concurrency
--max-num-seqsConcurrency (batch) limit
--enforce-eagerDisable CUDA graphs (debugging)
--quantizationTrade quality for memory/speed

Aphrodite vs Other Inference Engines

AspectAphroditevLLMllama.cpp
BasevLLM forkOriginalC/C++
Quant coverageBroadestGrowingGGUF k-quants
SamplersDRY/XTC/MirostatStandardMany
Best forCommunity-quantized models, chatMax throughput, standard formatsCPU/edge, GGUF

Resources