Ollama¶

generieren

Ollama ist ein Werkzeug für den Betrieb großer Sprachmodelle vor Ort auf Ihrer Maschine, bietet Privatsphäre, Kontrolle und Offline-Zugang zu AI-Modellen wie Llama, Mistral und CodeLlama.

Installation und Inbetriebnahme¶

Command	Description
`curl -fsSL https://ollama.ai/install.sh \\| sh`	Install Ollama on Linux/macOS
`brew install ollama`	Install via Homebrew (macOS)
`ollama --version`	Check installed version
`ollama serve`	Start Ollama server
`ollama ps`	List running models
`ollama list`	List installed models

Modellmanagement¶

Command	Description
`ollama pull llama3.1`	Download Llama 3.1 model
`ollama pull mistral`	Download Mistral model
`ollama pull codellama`	Download CodeLlama model
`ollama pull gemma:7b`	Download specific model size
`ollama show llama3.1`	Show model information
`ollama rm mistral`	Remove model

Beliebte Modelle¶

Allgemeine Zielmodelle¶

Command	Description
`ollama pull llama3.1:8b`	Llama 3.1 8B parameters
`ollama pull llama3.1:70b`	Llama 3.1 70B parameters
`ollama pull mistral:7b`	Mistral 7B model
`ollama pull mixtral:8x7b`	Mixtral 8x7B mixture of experts
`ollama pull gemma:7b`	Google Gemma 7B
`ollama pull phi3:mini`	Microsoft Phi-3 Mini

Code-Specialized Modelle¶

Command	Description
`ollama pull codellama:7b`	CodeLlama 7B for coding
`ollama pull codellama:13b`	CodeLlama 13B for coding
`ollama pull codegemma:7b`	CodeGemma for code generation
`ollama pull deepseek-coder:6.7b`	DeepSeek Coder model
`ollama pull starcoder2:7b`	StarCoder2 for code

Spezialisierte Modelle¶

Command	Description
`ollama pull llava:7b`	LLaVA multimodal model
`ollama pull nomic-embed-text`	Text embedding model
`ollama pull all-minilm`	Sentence embedding model
`ollama pull mxbai-embed-large`	Large embedding model

Laufmodelle¶

Command	Description
`ollama run llama3.1`	Start interactive chat with Llama 3.1
`ollama run mistral "Hello, how are you?"`	Single prompt to Mistral
`ollama run codellama "Write a Python function"`	Code generation with CodeLlama
`ollama run llava "Describe this image" --image photo.jpg`	Multimodal with image

Chat Schnittstelle¶

Command	Description
`ollama run llama3.1`	Start interactive chat
`/bye`	Exit chat session
`/clear`	Clear chat history
`/save chat.txt`	Save chat to file
`/load chat.txt`	Load chat from file
`/multiline`	Enable multiline input

API Verwendung¶

REST API¶

Command	Description
`curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello"}'`	Generate text via API
`curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}]}'`	Chat via API
`curl http://localhost:11434/api/tags`	List models via API
`curl http://localhost:11434/api/show -d '{"name":"llama3.1"}'`	Show model info via API

Antworten zu optimieren¶

Command	Description
`curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello","stream":true}'`	Stream response
`curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'`	Stream chat

Modellkonfiguration¶

Temperatur und Parameter¶

Command	Description
`ollama run llama3.1 --temperature 0.7`	Set temperature
`ollama run llama3.1 --top-p 0.9`	Set top-p sampling
`ollama run llama3.1 --top-k 40`	Set top-k sampling
`ollama run llama3.1 --repeat-penalty 1.1`	Set repeat penalty
`ollama run llama3.1 --seed 42`	Set random seed

Kontext und Speicher¶

Command	Description
`ollama run llama3.1 --ctx-size 4096`	Set context window size
`ollama run llama3.1 --batch-size 512`	Set batch size
`ollama run llama3.1 --threads 8`	Set number of threads

Kundenspezifische Modelle¶

Modelldateien erstellen¶

Command	Description
`ollama create mymodel -f Modelfile`	Create custom model
`ollama create mymodel -f Modelfile --quantize q4_0`	Create with quantization

Modelldateibeispiele¶

```dockerfile

Basic Modelfile¶

FROM llama3.1 PARAMETER temperature 0.8 PARAMETER top_p 0.9 SYSTEM "You are a helpful coding assistant." ```_

```dockerfile

Advanced Modelfile¶

FROM codellama:7b PARAMETER temperature 0.2 PARAMETER top_k 40 PARAMETER repeat_penalty 1.1 SYSTEM """You are an expert programmer. Always provide: 1. Clean, well-commented code 2. Explanation of the solution 3. Best practices and optimizations""" ```_

Integrationsbeispiele¶

Python Integration¶

```python import requests import json

def chat_with_ollama(prompt, model="llama3.1"): url = "http://localhost:11434/api/generate" data = { "model": model, "prompt": prompt, "stream": False } response = requests.post(url, json=data) return response.json()["response"]

Usage¶

result = chat_with_ollama("Explain quantum computing") print(result) ```_

Integration von JavaScript¶

```javascript async function chatWithOllama(prompt, model = "llama3.1") { const response = await fetch("http://localhost:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: model, prompt: prompt, stream: false }) }); const data = await response.json(); return data.response; }

// Usage chatWithOllama("Write a JavaScript function").then(console.log); ```_

Bash Integration¶

```bash

!/bin/bash¶

ollama_chat() { local prompt="\(1" local model="\)" curl -s http://localhost:11434/api/generate \ -d "{\"model\":\"\(model\",\"prompt\":\"\)prompt\",\"stream\":false}" \ | jq -r '.response' }

Usage¶

ollama_chat "Explain Docker containers" ```_

Leistungsoptimierung¶

Command	Description
`ollama run llama3.1 --gpu-layers 32`	Use GPU acceleration
`ollama run llama3.1 --memory-limit 8GB`	Set memory limit
`ollama run llama3.1 --cpu-threads 8`	Set CPU threads
`ollama run llama3.1 --batch-size 1024`	Optimize batch size

Umweltvariablen¶

Variable	Description
`OLLAMA_HOST`	Set server host (default: 127.0.0.1:11434)
`OLLAMA_MODELS`	Set models directory
`OLLAMA_NUM_PARALLEL`	Number of parallel requests
`OLLAMA_MAX_LOADED_MODELS`	Max models in memory
`OLLAMA_FLASH_ATTENTION`	Enable flash attention
`OLLAMA_GPU_OVERHEAD`	GPU memory overhead

Docker Nutzung¶

Command	Description
`docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama`	Run Ollama in Docker
`docker exec -it ollama ollama run llama3.1`	Run model in container
`docker exec -it ollama ollama pull mistral`	Pull model in container

Docker komponiert¶

yaml version: '3.8' services: ollama: image: ollama/ollama ports: - "11434:11434" volumes: - ollama:/root/.ollama environment: - OLLAMA_HOST=0.0.0.0:11434 volumes: ollama:_

Überwachung und Debugging¶

Command	Description
`ollama logs`	View Ollama logs
`ollama ps`	Show running models and memory usage
`curl http://localhost:11434/api/version`	Check API version
`curl http://localhost:11434/api/tags`	List available models

Modell Quantisierung¶

Command	Description
`ollama create mymodel -f Modelfile --quantize q4_0`	4-bit quantization
`ollama create mymodel -f Modelfile --quantize q5_0`	5-bit quantization
`ollama create mymodel -f Modelfile --quantize q8_0`	8-bit quantization
`ollama create mymodel -f Modelfile --quantize f16`	16-bit float

Einbettungsmodelle¶

Command	Description
`ollama pull nomic-embed-text`	Pull text embedding model
`curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Hello world"}'`	Generate embeddings

Fehlerbehebung¶

Command	Description
`ollama --help`	Show help information
`ollama serve --help`	Show server options
`ps aux \\| grep ollama`	Check if Ollama is running
`lsof -i :11434`	Check port usage
`ollama rm --all`	Remove all models

Best Practices¶

Modellgröße basierend auf verfügbarem RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
Verwenden Sie GPU Beschleunigung, wenn verfügbar für bessere Leistung
Implementierung einer korrekten Fehlerbehandlung in API-Integrationen
Überwachen Sie die Speichernutzung bei mehreren Modellen
Verwenden Sie quantisierte Modelle für ressourcenorientierte Umgebungen
Cache häufig verwendete Modelle lokal
Stellen Sie geeignete Kontextgrößen für Ihren Anwendungsfall fest
Verwenden Sie Streaming für lange Antworten, um Benutzererlebnis zu verbessern
Implementierung Rate Begrenzung für die Produktion API Nutzung
Regelmäßige Modellaktualisierungen für verbesserte Leistung und Leistungsfähigkeit

Allgemeine Anwendungsfälle¶

Code Generation¶

bash ollama run codellama "Create a REST API in Python using FastAPI"_

Textanalyse¶

bash ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"_

Kreatives Schreiben¶

bash ollama run mistral "Write a short story about time travel"_

Datenverarbeitung¶

bash ollama run llama3.1 "Convert this JSON to CSV format: {...}"_