llama.cpp ist eine leistungsstarke LLM-Inferenz-Engine in C/C++, die quantisierte Modelle auf Consumer-Hardware ausführt. Es unterstützt das GGUF-Format, CPU- und GPU-Inferenz (CUDA, Metal, Vulkan), Server-Modus mit OpenAI-kompatibler API und eine breite Palette von Modellarchitekturen.
Installation
# Clone and build with CUDA support
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Build with Metal (macOS Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)
# Build with Vulkan (cross-platform GPU)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
# CPU-only build
cmake -B build
cmake --build build --config Release -j$(nproc)
# Install Python bindings
pip install llama-cpp-python
# Install with CUDA support for Python
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
Modellkonvertierung
# Convert HuggingFace model to GGUF format
python convert_hf_to_gguf.py \
/path/to/hf-model \
--outfile model-f16.gguf \
--outtype f16
# Convert with BF16
python convert_hf_to_gguf.py \
/path/to/hf-model \
--outfile model-bf16.gguf \
--outtype bf16
Quantisierung
# Quantize to different precision levels
./build/bin/llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./build/bin/llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
./build/bin/llama-quantize model-f16.gguf model-q8_0.gguf Q8_0
./build/bin/llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
./build/bin/llama-quantize model-f16.gguf model-q2_k.gguf Q2_K
# Importance matrix quantization (better quality)
./build/bin/llama-imatrix \
-m model-f16.gguf \
-f calibration_data.txt \
-o imatrix.dat
./build/bin/llama-quantize \
--imatrix imatrix.dat \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
Quantisierungsstufen
| Type | Bits | Size (7B) | Quality | Speed |
|---|
| Q2_K | 2 | ~2.7 GB | Low | Fastest |
| Q3_K_M | 3 | ~3.3 GB | Fair | Fast |
| Q4_K_M | 4 | ~4.1 GB | Good | Fast |
| Q5_K_M | 5 | ~4.8 GB | Very Good | Medium |
| Q6_K | 6 | ~5.5 GB | Excellent | Medium |
| Q8_0 | 8 | ~7.2 GB | Near FP16 | Slower |
| F16 | 16 | ~13.5 GB | Baseline | Slowest |
Textgenerierung (CLI)
# Basic generation
./build/bin/llama-cli \
-m model-q4_k_m.gguf \
-p "Explain machine learning:" \
-n 256
# Interactive chat mode
./build/bin/llama-cli \
-m model-q4_k_m.gguf \
--interactive \
--color \
-n -1
# With sampling parameters
./build/bin/llama-cli \
-m model-q4_k_m.gguf \
-p "Write a poem about AI:" \
-n 200 \
--temp 0.7 \
--top-p 0.9 \
--top-k 40 \
--repeat-penalty 1.1
# GPU offloading (put 35 layers on GPU)
./build/bin/llama-cli \
-m model-q4_k_m.gguf \
-p "Hello" \
-ngl 35
# Full GPU offload
./build/bin/llama-cli \
-m model-q4_k_m.gguf \
-p "Hello" \
-ngl 999
Server-Modus
# Start OpenAI-compatible API server
./build/bin/llama-server \
-m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 999
# Server with context size and parallel requests
./build/bin/llama-server \
-m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \
-np 4 \
-ngl 999
# Server with chat template
./build/bin/llama-server \
-m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
--chat-template llama3 \
-ngl 999
API-Nutzung
# Chat completions (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers."}
],
"temperature": 0.7,
"max_tokens": 256
}'
# Text completions
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The meaning of life is",
"temperature": 0.8,
"n_predict": 128
}'
# Embeddings
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "Hello world"}'
# Health check
curl http://localhost:8080/health
Python-Bindings
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="model-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=-1, # Offload all layers to GPU
n_threads=8, # CPU threads
verbose=False,
)
# Text completion
output = llm(
"Explain quantum computing:",
max_tokens=256,
temperature=0.7,
top_p=0.9,
echo=False,
)
print(output["choices"][0]["text"])
# Chat completion
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "What is deep learning?"},
],
max_tokens=256,
temperature=0.7,
)
print(output["choices"][0]["message"]["content"])
Benchmarking
# Benchmark model performance
./build/bin/llama-bench \
-m model-q4_k_m.gguf \
-ngl 999 \
-p 512 \
-n 128
# Perplexity evaluation
./build/bin/llama-perplexity \
-m model-q4_k_m.gguf \
-f wikitext-2-raw/wiki.test.raw \
-ngl 999
Häufige CLI-Optionen
| Flag | Description |
|---|
-m | Model file path |
-p | Prompt text |
-n | Max tokens to generate |
-c | Context size (default 2048) |
-ngl | Number of layers offloaded to GPU |
-t | Number of CPU threads |
--temp | Temperature (0.0 = greedy) |
--top-p | Top-p sampling |
--top-k | Top-k sampling |
--repeat-penalty | Repetition penalty |
-b | Batch size for prompt processing |
-np | Number of parallel sequences (server) |
--interactive | Interactive chat mode |
--color | Colorize output |