Ollama
Ollama ist ein Werkzeug für den Betrieb großer Sprachmodelle vor Ort auf Ihrer Maschine, bietet Privatsphäre, Kontrolle und Offline-Zugang zu AI-Modellen wie Llama, Mistral und CodeLlama.
Installation und Inbetriebnahme
| | Command | Description | |
| --- | --- |
| | curl -fsSL https://ollama.ai/install.sh \ | sh
| Install Ollama on Linux/macOS | |
| | brew install ollama
| Install via Homebrew (macOS) | |
| | ollama --version
| Check installed version | |
| | ollama serve
| Start Ollama server | |
| | ollama ps
| List running models | |
| | ollama list
| List installed models | |
Modellmanagement
| | Command | Description | |
| --- | --- |
| | ollama pull llama3.1
| Download Llama 3.1 model | |
| | ollama pull mistral
| Download Mistral model | |
| | ollama pull codellama
| Download CodeLlama model | |
| | ollama pull gemma:7b
| Download specific model size | |
| | ollama show llama3.1
| Show model information | |
| | ollama rm mistral
| Remove model | |
Beliebte Modelle
Allgemeine Zielmodelle
| | Command | Description | |
| --- | --- |
| | ollama pull llama3.1:8b
| Llama 3.1 8B parameters | |
| | ollama pull llama3.1:70b
| Llama 3.1 70B parameters | |
| | ollama pull mistral:7b
| Mistral 7B model | |
| | ollama pull mixtral:8x7b
| Mixtral 8x7B mixture of experts | |
| | ollama pull gemma:7b
| Google Gemma 7B | |
| | ollama pull phi3:mini
| Microsoft Phi-3 Mini | |
Code-Specialized Modelle
| | Command | Description | |
| --- | --- |
| | ollama pull codellama:7b
| CodeLlama 7B for coding | |
| | ollama pull codellama:13b
| CodeLlama 13B for coding | |
| | ollama pull codegemma:7b
| CodeGemma for code generation | |
| | ollama pull deepseek-coder:6.7b
| DeepSeek Coder model | |
| | ollama pull starcoder2:7b
| StarCoder2 for code | |
Spezialisierte Modelle
| | Command | Description | |
| --- | --- |
| | ollama pull llava:7b
| LLaVA multimodal model | |
| | ollama pull nomic-embed-text
| Text embedding model | |
| | ollama pull all-minilm
| Sentence embedding model | |
| | ollama pull mxbai-embed-large
| Large embedding model | |
Laufmodelle
| | Command | Description | |
| --- | --- |
| | ollama run llama3.1
| Start interactive chat with Llama 3.1 | |
| | ollama run mistral "Hello, how are you?"
| Single prompt to Mistral | |
| | ollama run codellama "Write a Python function"
| Code generation with CodeLlama | |
| | ollama run llava "Describe this image" --image photo.jpg
| Multimodal with image | |
Chat Schnittstelle
| | Command | Description | |
| --- | --- |
| | ollama run llama3.1
| Start interactive chat | |
| | /bye
| Exit chat session | |
| | /clear
| Clear chat history | |
| | /save chat.txt
| Save chat to file | |
| | /load chat.txt
| Load chat from file | |
| | /multiline
| Enable multiline input | |
API Verwendung
REST API
| | Command | Description | |
| --- | --- |
| | curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello"}'
| Generate text via API | |
| | curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}]}'
| Chat via API | |
| | curl http://localhost:11434/api/tags
| List models via API | |
| | curl http://localhost:11434/api/show -d '{"name":"llama3.1"}'
| Show model info via API | |
Antworten zu optimieren
| | Command | Description | |
| --- | --- |
| | curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello","stream":true}'
| Stream response | |
| | curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'
| Stream chat | |
Modellkonfiguration
Temperatur und Parameter
| | Command | Description | |
| --- | --- |
| | ollama run llama3.1 --temperature 0.7
| Set temperature | |
| | ollama run llama3.1 --top-p 0.9
| Set top-p sampling | |
| | ollama run llama3.1 --top-k 40
| Set top-k sampling | |
| | ollama run llama3.1 --repeat-penalty 1.1
| Set repeat penalty | |
| | ollama run llama3.1 --seed 42
| Set random seed | |
Kontext und Speicher
| | Command | Description | |
| --- | --- |
| | ollama run llama3.1 --ctx-size 4096
| Set context window size | |
| | ollama run llama3.1 --batch-size 512
| Set batch size | |
| | ollama run llama3.1 --threads 8
| Set number of threads | |
Kundenspezifische Modelle
Modelldateien erstellen
| | Command | Description | |
| --- | --- |
| | ollama create mymodel -f Modelfile
| Create custom model | |
| | ollama create mymodel -f Modelfile --quantize q4_0
| Create with quantization | |
Modelldateibeispiele
```dockerfile
Basic Modelfile
FROM llama3.1 PARAMETER temperature 0.8 PARAMETER top_p 0.9 SYSTEM "You are a helpful coding assistant." ```_
```dockerfile
Advanced Modelfile
FROM codellama:7b PARAMETER temperature 0.2 PARAMETER top_k 40 PARAMETER repeat_penalty 1.1 SYSTEM """You are an expert programmer. Always provide: 1. Clean, well-commented code 2. Explanation of the solution 3. Best practices and optimizations""" ```_
Integrationsbeispiele
Python Integration
```python import requests import json
def chat_with_ollama(prompt, model="llama3.1"): url = "http://localhost:11434/api/generate" data = { "model": model, "prompt": prompt, "stream": False } response = requests.post(url, json=data) return response.json()["response"]
Usage
result = chat_with_ollama("Explain quantum computing") print(result) ```_
Integration von JavaScript
```javascript async function chatWithOllama(prompt, model = "llama3.1") { const response = await fetch("http://localhost:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: model, prompt: prompt, stream: false }) }); const data = await response.json(); return data.response; }
// Usage chatWithOllama("Write a JavaScript function").then(console.log); ```_
Bash Integration
```bash
!/bin/bash
ollama_chat() { local prompt="$1" local model="${2:-llama3.1}" curl -s http://localhost:11434/api/generate \ -d "{\"model\":\"$model\",\"prompt\":\"$prompt\",\"stream\":false}" \ | jq -r '.response' }
Usage
ollama_chat "Explain Docker containers" ```_
Leistungsoptimierung
| | Command | Description | |
| --- | --- |
| | ollama run llama3.1 --gpu-layers 32
| Use GPU acceleration | |
| | ollama run llama3.1 --memory-limit 8GB
| Set memory limit | |
| | ollama run llama3.1 --cpu-threads 8
| Set CPU threads | |
| | ollama run llama3.1 --batch-size 1024
| Optimize batch size | |
Umweltvariablen
| | Variable | Description | |
| --- | --- |
| | OLLAMA_HOST
| Set server host (default: 127.0.0.1:11434) | |
| | OLLAMA_MODELS
| Set models directory | |
| | OLLAMA_NUM_PARALLEL
| Number of parallel requests | |
| | OLLAMA_MAX_LOADED_MODELS
| Max models in memory | |
| | OLLAMA_FLASH_ATTENTION
| Enable flash attention | |
| | OLLAMA_GPU_OVERHEAD
| GPU memory overhead | |
Docker Nutzung
| | Command | Description | |
| --- | --- |
| | docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
| Run Ollama in Docker | |
| | docker exec -it ollama ollama run llama3.1
| Run model in container | |
| | docker exec -it ollama ollama pull mistral
| Pull model in container | |
Docker komponiert
yaml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
volumes:
ollama:
_
Überwachung und Debugging
| | Command | Description | |
| --- | --- |
| | ollama logs
| View Ollama logs | |
| | ollama ps
| Show running models and memory usage | |
| | curl http://localhost:11434/api/version
| Check API version | |
| | curl http://localhost:11434/api/tags
| List available models | |
Modell Quantisierung
| | Command | Description | |
| --- | --- |
| | ollama create mymodel -f Modelfile --quantize q4_0
| 4-bit quantization | |
| | ollama create mymodel -f Modelfile --quantize q5_0
| 5-bit quantization | |
| | ollama create mymodel -f Modelfile --quantize q8_0
| 8-bit quantization | |
| | ollama create mymodel -f Modelfile --quantize f16
| 16-bit float | |
Einbettungsmodelle
| | Command | Description | |
| --- | --- |
| | ollama pull nomic-embed-text
| Pull text embedding model | |
| | curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Hello world"}'
| Generate embeddings | |
Fehlerbehebung
| | Command | Description | |
| --- | --- |
| | ollama --help
| Show help information | |
| | ollama serve --help
| Show server options | |
| | ps aux \ | grep ollama
| Check if Ollama is running | |
| | lsof -i :11434
| Check port usage | |
| | ollama rm --all
| Remove all models | |
Best Practices
- Modellgröße basierend auf verfügbarem RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
- Verwenden Sie GPU Beschleunigung, wenn verfügbar für bessere Leistung
- Implementierung einer korrekten Fehlerbehandlung in API-Integrationen
- Überwachen Sie die Speichernutzung bei mehreren Modellen
- Verwenden Sie quantisierte Modelle für ressourcenorientierte Umgebungen
- Cache häufig verwendete Modelle lokal
- Stellen Sie geeignete Kontextgrößen für Ihren Anwendungsfall fest
- Verwenden Sie Streaming für lange Antworten, um Benutzererlebnis zu verbessern
- Implementierung Rate Begrenzung für die Produktion API Nutzung
- Regelmäßige Modellaktualisierungen für verbesserte Leistung und Leistungsfähigkeit
Allgemeine Anwendungsfälle
Code Generation
bash
ollama run codellama "Create a REST API in Python using FastAPI"
_
Textanalyse
bash
ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"
_
Kreatives Schreiben
bash
ollama run mistral "Write a short story about time travel"
_
Datenverarbeitung
bash
ollama run llama3.1 "Convert this JSON to CSV format: {...}"
_