Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Aphrodite Engine ist ein hochperformanter LLM Inference und Serving Engine gefärkt aus vLLM. Es baut auf vLLM PagedAttention und Continuous Batching auf, dann fügt zwei Große Differenzierer hinzu: die breiteste Quantisierungs-Format Abdeckung von jedem Engine (GGUF, GPTQ, AWQ, ExLlamaV3, AQLM, BitNet, Marlin, und mehr, plus Quantisierte KV Cache) und Advanced Samplers (DRY, XTC, Mirostat), die für Chat/Creative Nutzung zählen. Es exponiert eine OpenAI-kompatible API, damit es in bestehende Clients passt.

Installation

Methode	Befehl
pip	`pip install aphrodite-engine`
Starte den Server	`aphrodite run <model>`
Docker	`docker run --gpus all -p 2242:2242 alpindale/aphrodite-openai`
Anforderungen	NVIDIA GPU + CUDA (AMD/ROCm auf einigen Builds unterstützt)
Überprüfen	`aphrodite --version`

Starte einen Server

# Serve ein Modell mit einem OpenAI-kompatiblen API auf Port 2242
aphrodite run meta-llama/Llama-3.1-8B-Instruct

# Serve ein Quantisiertes GGUF Modell
aphrodite run ./model.Q4_K_M.gguf --quantization gguf

# Multi-GPU Tensor Parallelism
aphrodite run big-model --tensor-parallel-size 2

Flag	Beschreibung
`--quantization`	Format: `gguf`, `gptq`, `awq`, `exl3`, `fp8`, …
`--tensor-parallel-size N`	Teile das Modell über N GPUs
`--max-model-len N`	Context Length
`--gpu-memory-utilization 0.9`	Fraktion des VRAM zu nutzen
`--kv-cache-dtype fp8`	Quantisiere den KV Cache zum Speicher sparen
`--dtype`	Modell Compute Dtype (`auto`, `half`, `bfloat16`)
`--port 2242`	API Port
`--api-keys KEY`	Erfordere einen API Schlüssel

OpenAI-Kompatible API

curl http://localhost:2242/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","prompt":"Hello","max_tokens":50}'

Endpoint	Zweck
`/v1/completions`	Text Completion
`/v1/chat/completions`	Chat Completion
`/v1/models`	Liste geladen Modelle auf
`/v1/embeddings`	Embeddings (Embedding Modelle)

Quantisierungs Formate

Format	Nutzung
GGUF	llama.cpp-Style k-Quants (Community Modelle)
GPTQ / AWQ	Beliebte 4-Bit Weight-Only Formate
ExLlamaV3 (exl3)	Hochqualität Variable-Bit Quantisierung
AQLM / QuIP#	Extreme Low-Bit
Marlin	Schnelle GPTQ/AWQ Kernels
fp8	8-Bit Float Weights/Activations

Die Breite ist der Punkt: Aphrodite kann Community-Quantisierte Modelle serve, die Vanilla vLLM nicht laden kann.

Advanced Samplers

Sampler	Effekt
DRY	”Nicht wiederhole Dich selbst” — reduziert verbatim Wiederholung
XTC	Ausschließe Top Choises — steigert Kreativität
Mirostat	Ziele auf eine Perplexität Setpoint
`min_p`, `top_a`, `tfs`	Zusätzliche Kürzung Samplers

Durchreichen diese als Extra Felder in der Anfrage (z.B. "dry_multiplier": 0.8).

Performance Tuning

Option	Effekt
`--tensor-parallel-size`	Skaliere über GPUs
`--kv-cache-dtype fp8`	Passe länger Kontext / mehr Concurrency an
`--max-num-seqs`	Concurrency (Batch) Limit
`--enforce-eager`	Deaktiviere CUDA Graphs (Debugging)
`--quantization`	Trade Qualität für Speicher/Geschwindigkeit

Aphrodite vs Andere Inference Engines

Aspekt	Aphrodite	vLLM	llama.cpp
Base	vLLM Fork	Original	C/C++
Quant Abdeckung	Breiteste	Wachsend	GGUF k-Quants
Samplers	DRY/XTC/Mirostat	Standard	Viele
Beste für	Community-Quantisierte Modelle, Chat	Max Throughput, Standard Formate	CPU/Edge, GGUF

Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Aphrodite Engine - High-Throughput LLM Serving Cheatsheet

Installation

Starte einen Server

OpenAI-Kompatible API

Quantisierungs Formate

Advanced Samplers

Performance Tuning

Aphrodite vs Andere Inference Engines

Ressourcen