Ollama¶
Ollama ist ein Werkzeug für den Betrieb großer Sprachmodelle vor Ort auf Ihrer Maschine, bietet Privatsphäre, Kontrolle und Offline-Zugang zu AI-Modellen wie Llama, Mistral und CodeLlama.
Installation und Inbetriebnahme
| Command | Description |
|---|---|
| INLINE_CODE_10 | Install Ollama on Linux/macOS |
| INLINE_CODE_11 | Install via Homebrew (macOS) |
| INLINE_CODE_12 | Check installed version |
| INLINE_CODE_13 | Start Ollama server |
| INLINE_CODE_14 | List running models |
| INLINE_CODE_15 | List installed models |
/ Modellmanagement
| Command | Description |
|---|---|
| INLINE_CODE_16 | Download Llama 3.1 model |
| INLINE_CODE_17 | Download Mistral model |
| INLINE_CODE_18 | Download CodeLlama model |
| INLINE_CODE_19 | Download specific model size |
| INLINE_CODE_20 | Show model information |
| INLINE_CODE_21 | Remove model |
Beliebte Modelle
Allgemeines Zielmodell¶
| Command | Description |
|---|---|
| INLINE_CODE_22 | Llama 3.1 8B parameters |
| INLINE_CODE_23 | Llama 3.1 70B parameters |
| INLINE_CODE_24 | Mistral 7B model |
| INLINE_CODE_25 | Mixtral 8x7B mixture of experts |
| INLINE_CODE_26 | Google Gemma 7B |
| INLINE_CODE_27 | Microsoft Phi-3 Mini |
Code-Specialized Models¶
| Command | Description |
|---|---|
| INLINE_CODE_28 | CodeLlama 7B for coding |
| INLINE_CODE_29 | CodeLlama 13B for coding |
| INLINE_CODE_30 | CodeGemma for code generation |
| INLINE_CODE_31 | DeepSeek Coder model |
| INLINE_CODE_32 | StarCoder2 for code |
Spezialisierte Modelle¶
| Command | Description |
|---|---|
| INLINE_CODE_33 | LLaVA multimodal model |
| INLINE_CODE_34 | Text embedding model |
| INLINE_CODE_35 | Sentence embedding model |
| INLINE_CODE_36 | Large embedding model |
Laufmodelle
| Command | Description |
|---|---|
| INLINE_CODE_37 | Start interactive chat with Llama 3.1 |
| INLINE_CODE_38 | Single prompt to Mistral |
| INLINE_CODE_39 | Code generation with CodeLlama |
| INLINE_CODE_40 | Multimodal with image |
oder Chat Schnittstelle
| Command | Description |
|---|---|
| INLINE_CODE_41 | Start interactive chat |
| INLINE_CODE_42 | Exit chat session |
| INLINE_CODE_43 | Clear chat history |
| INLINE_CODE_44 | Save chat to file |
| INLINE_CODE_45 | Load chat from file |
| INLINE_CODE_46 | Enable multiline input |
Die API Verwendung
REST API¶
| Command | Description |
|---|---|
| INLINE_CODE_47 | Generate text via API |
| INLINE_CODE_48 | Chat via API |
| INLINE_CODE_49 | List models via API |
| INLINE_CODE_50 | Show model info via API |
Streaming Responses¶
| Command | Description |
|---|---|
| INLINE_CODE_51 | Stream response |
| INLINE_CODE_52 | Stream chat |
Modellkonfiguration
Temperatur und Parameter |Command|Description| |---------|-------------| |INLINE_CODE_53|Set temperature| |INLINE_CODE_54|Set top-p sampling| |INLINE_CODE_55|Set top-k sampling| |INLINE_CODE_56|Set repeat penalty| |INLINE_CODE_57|Set random seed|
Context and Memory¶
| Command | Description |
|---|---|
| INLINE_CODE_58 | Set context window size |
| INLINE_CODE_59 | Set batch size |
| INLINE_CODE_60 | Set number of threads |
Individuelle Modelle
Erstellen von Modelfiles¶
| Command | Description |
|---|---|
| INLINE_CODE_61 | Create custom model |
| INLINE_CODE_62 | Create with quantization |
Modelfile Beispiele¶
# Basic Modelfile
FROM llama3.1
PARAMETER temperature 0.8
PARAMETER top_p 0.9
SYSTEM "You are a helpful coding assistant."
```_
```dockerfile
# Advanced Modelfile
FROM codellama:7b
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
SYSTEM """You are an expert programmer. Always provide:
1. Clean, well-commented code
2. Explanation of the solution
3. Best practices and optimizations"""
Integrationsbeispiele
Python Integration¶
```python import requests import json
def chat_with_ollama(prompt, model="llama3.1"): url = "http://localhost:11434/api/generate" data = { "model": model, "prompt": prompt, "stream": False } response = requests.post(url, json=data) return response.json()["response"]
Usage¶
result = chat_with_ollama("Explain quantum computing") print(result) ```_
JavaScript Integration¶
```javascript async function chatWithOllama(prompt, model = "llama3.1") { const response = await fetch("http://localhost:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: model, prompt: prompt, stream: false }) }); const data = await response.json(); return data.response; }
// Usage chatWithOllama("Write a JavaScript function").then(console.log); ```_
Bash Integration¶
```bash
!/bin/bash¶
ollama_chat() { local prompt="\(1" local model="\)" curl -s http://localhost:11434/api/generate \ -d "{\"model\":\"\(model\",\"prompt\":\"\)prompt\",\"stream\":false}" \ | jq -r '.response' }
Usage¶
ollama_chat "Explain Docker containers" ```_
 Leistungsoptimierung
| Command | Description |
|---|---|
| INLINE_CODE_63 | Use GPU acceleration |
| INLINE_CODE_64 | Set memory limit |
| INLINE_CODE_65 | Set CPU threads |
| INLINE_CODE_66 | Optimize batch size |
| _ | |
| Umweltvariablen |
| Variable | Description |
|---|---|
| INLINE_CODE_67 | Set server host (default: 127.0.0.1:11434) |
| INLINE_CODE_68 | Set models directory |
| INLINE_CODE_69 | Number of parallel requests |
| INLINE_CODE_70 | Max models in memory |
| INLINE_CODE_71 | Enable flash attention |
| INLINE_CODE_72 | GPU memory overhead |
/ Docker Nutzung
| Command | Description |
|---|---|
| INLINE_CODE_73 | Run Ollama in Docker |
| INLINE_CODE_74 | Run model in container |
| INLINE_CODE_75 | Pull model in container |
Docker Compose¶
yaml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
volumes:
ollama:_
Überwachung und Debugging
| Command | Description |
|---|---|
| INLINE_CODE_76 | View Ollama logs |
| INLINE_CODE_77 | Show running models and memory usage |
| INLINE_CODE_78 | Check API version |
| INLINE_CODE_79 | List available models |
Modell Quantisierung
| Command | Description |
|---|---|
| INLINE_CODE_80 | 4-bit quantization |
| INLINE_CODE_81 | 5-bit quantization |
| INLINE_CODE_82 | 8-bit quantization |
| INLINE_CODE_83 | 16-bit float |
| _ | |
| Einbettungsmodelle |
| Command | Description |
|---|---|
| INLINE_CODE_84 | Pull text embedding model |
| INLINE_CODE_85 | Generate embeddings |
| _ | |
| Fehlerbehebung |
| Command | Description |
|---|---|
| INLINE_CODE_86 | Show help information |
| INLINE_CODE_87 | Show server options |
| INLINE_CODE_88 | Check if Ollama is running |
| INLINE_CODE_89 | Check port usage |
| INLINE_CODE_90 | Remove all models |
| _ | |
| oder Best Practices |
- Modellgröße basierend auf verfügbarem RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
- Verwenden Sie GPU Beschleunigung, wenn verfügbar für bessere Leistung
- Implementierung einer korrekten Fehlerbehandlung in API-Integrationen
- Überwachen Sie die Speichernutzung bei mehreren Modellen
- Verwenden Sie quantisierte Modelle für ressourcenbelastete Umgebungen
- Cache häufig verwendete Modelle lokal
- Stellen Sie geeignete Kontextgrößen für Ihren Anwendungsfall fest
- Verwenden Sie Streaming für lange Antworten, um Benutzererlebnis zu verbessern
- Implement Rate Begrenzung für die Produktion API Nutzung
- Regelmäßige Modellaktualisierungen für verbesserte Leistung und Leistungsfähigkeit
Häufige Anwendungsfälle
Code Generation¶
bash
ollama run codellama "Create a REST API in Python using FastAPI"_
Textanalyse¶
bash
ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"_
Kreatives Schreiben¶
bash
ollama run mistral "Write a short story about time travel"_
Datenverarbeitung¶
bash
ollama run llama3.1 "Convert this JSON to CSV format: {...}"_