Ollama is a tool for running large language models locally on your machine, providing privacy, control, and offline access to AI models like Llama, Mistral, and CodeLlama.
Installation & Setup
| Command | Description |
|---|
| `curl -fsSL https://ollama.ai/install.sh \ | sh` |
brew install ollama | Install via Homebrew (macOS) |
ollama --version | Check installed version |
ollama serve | Start Ollama server |
ollama ps | List running models |
ollama list | List installed models |
Model Management
| Command | Description |
|---|
ollama pull llama3.1 | Download Llama 3.1 model |
ollama pull mistral | Download Mistral model |
ollama pull codellama | Download CodeLlama model |
ollama pull gemma:7b | Download specific model size |
ollama show llama3.1 | Show model information |
ollama rm mistral | Remove model |
Popular Models
General Purpose Models
| Command | Description |
|---|
ollama pull llama3.1:8b | Llama 3.1 8B parameters |
ollama pull llama3.1:70b | Llama 3.1 70B parameters |
ollama pull mistral:7b | Mistral 7B model |
ollama pull mixtral:8x7b | Mixtral 8x7B mixture of experts |
ollama pull gemma:7b | Google Gemma 7B |
ollama pull phi3:mini | Microsoft Phi-3 Mini |
Code-Specialized Models
| Command | Description |
|---|
ollama pull codellama:7b | CodeLlama 7B for coding |
ollama pull codellama:13b | CodeLlama 13B for coding |
ollama pull codegemma:7b | CodeGemma for code generation |
ollama pull deepseek-coder:6.7b | DeepSeek Coder model |
ollama pull starcoder2:7b | StarCoder2 for code |
Specialized Models
| Command | Description |
|---|
ollama pull llava:7b | LLaVA multimodal model |
ollama pull nomic-embed-text | Text embedding model |
ollama pull all-minilm | Sentence embedding model |
ollama pull mxbai-embed-large | Large embedding model |
Running Models
| Command | Description |
|---|
ollama run llama3.1 | Start interactive chat with Llama 3.1 |
ollama run mistral "Hello, how are you?" | Single prompt to Mistral |
ollama run codellama "Write a Python function" | Code generation with CodeLlama |
ollama run llava "Describe this image" --image photo.jpg | Multimodal with image |
Chat Interface
| Command | Description |
|---|
ollama run llama3.1 | Start interactive chat |
/bye | Exit chat session |
/clear | Clear chat history |
/save chat.txt | Save chat to file |
/load chat.txt | Load chat from file |
/multiline | Enable multiline input |
API Usage
REST API
| Command | Description |
|---|
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello"}' | Generate text via API |
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}]}' | Chat via API |
curl http://localhost:11434/api/tags | List models via API |
curl http://localhost:11434/api/show -d '{"name":"llama3.1"}' | Show model info via API |
Streaming Responses
| Command | Description |
|---|
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello","stream":true}' | Stream response |
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}' | Stream chat |
Model Configuration
Temperature and Parameters
| Command | Description |
|---|
ollama run llama3.1 --temperature 0.7 | Set temperature |
ollama run llama3.1 --top-p 0.9 | Set top-p sampling |
ollama run llama3.1 --top-k 40 | Set top-k sampling |
ollama run llama3.1 --repeat-penalty 1.1 | Set repeat penalty |
ollama run llama3.1 --seed 42 | Set random seed |
Context and Memory
| Command | Description |
|---|
ollama run llama3.1 --ctx-size 4096 | Set context window size |
ollama run llama3.1 --batch-size 512 | Set batch size |
ollama run llama3.1 --threads 8 | Set number of threads |
Custom Models
Creating Modelfiles
| Command | Description |
|---|
ollama create mymodel -f Modelfile | Create custom model |
ollama create mymodel -f Modelfile --quantize q4_0 | Create with quantization |
Modelfile Examples
# Basic Modelfile
FROM llama3.1
PARAMETER temperature 0.8
PARAMETER top_p 0.9
SYSTEM "You are a helpful coding assistant."
# Advanced Modelfile
FROM codellama:7b
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
SYSTEM """You are an expert programmer. Always provide:
1. Clean, well-commented code
2. Explanation of the solution
3. Best practices and optimizations"""
Integration Examples
Python Integration
import requests
import json
def chat_with_ollama(prompt, model="llama3.1"):
url = "http://localhost:11434/api/generate"
data = {
"model": model,
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=data)
return response.json()["response"]
# Usage
result = chat_with_ollama("Explain quantum computing")
print(result)
JavaScript Integration
async function chatWithOllama(prompt, model = "llama3.1") {
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false
})
});
const data = await response.json();
return data.response;
}
// Usage
chatWithOllama("Write a JavaScript function").then(console.log);
Bash Integration
#!/bin/bash
ollama_chat() {
local prompt="$1"
local model="${2:-llama3.1}"
curl -s http://localhost:11434/api/generate \
-d "{\"model\":\"$model\",\"prompt\":\"$prompt\",\"stream\":false}" \
| jq -r '.response'
}
# Usage
ollama_chat "Explain Docker containers"
| Command | Description |
|---|
ollama run llama3.1 --gpu-layers 32 | Use GPU acceleration |
ollama run llama3.1 --memory-limit 8GB | Set memory limit |
ollama run llama3.1 --cpu-threads 8 | Set CPU threads |
ollama run llama3.1 --batch-size 1024 | Optimize batch size |
Environment Variables
| Variable | Description |
|---|
OLLAMA_HOST | Set server host (default: 127.0.0.1:11434) |
OLLAMA_MODELS | Set models directory |
OLLAMA_NUM_PARALLEL | Number of parallel requests |
OLLAMA_MAX_LOADED_MODELS | Max models in memory |
OLLAMA_FLASH_ATTENTION | Enable flash attention |
OLLAMA_GPU_OVERHEAD | GPU memory overhead |
Docker Usage
| Command | Description |
|---|
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama | Run Ollama in Docker |
docker exec -it ollama ollama run llama3.1 | Run model in container |
docker exec -it ollama ollama pull mistral | Pull model in container |
Docker Compose
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
volumes:
ollama:
Monitoring & Debugging
| Command | Description |
|---|
ollama logs | View Ollama logs |
ollama ps | Show running models and memory usage |
curl http://localhost:11434/api/version | Check API version |
curl http://localhost:11434/api/tags | List available models |
Model Quantization
| Command | Description |
|---|
ollama create mymodel -f Modelfile --quantize q4_0 | 4-bit quantization |
ollama create mymodel -f Modelfile --quantize q5_0 | 5-bit quantization |
ollama create mymodel -f Modelfile --quantize q8_0 | 8-bit quantization |
ollama create mymodel -f Modelfile --quantize f16 | 16-bit float |
Embedding Models
| Command | Description |
|---|
ollama pull nomic-embed-text | Pull text embedding model |
curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Hello world"}' | Generate embeddings |
Troubleshooting
| Command | Description |
|---|
ollama --help | Show help information |
ollama serve --help | Show server options |
| `ps aux \ | grep ollama` |
lsof -i :11434 | Check port usage |
ollama rm --all | Remove all models |
Best Practices
- Choose model size based on available RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
- Use GPU acceleration when available for better performance
- Implement proper error handling in API integrations
- Monitor memory usage when running multiple models
- Use quantized models for resource-constrained environments
- Cache frequently used models locally
- Set appropriate context sizes for your use case
- Use streaming for long responses to improve user experience
- Implement rate limiting for production API usage
- Regular model updates for improved performance and capabilities
Common Use Cases
Code Generation
ollama run codellama "Create a REST API in Python using FastAPI"
Text Analysis
ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"
Creative Writing
ollama run mistral "Write a short story about time travel"
Data Processing
ollama run llama3.1 "Convert this JSON to CSV format: {...}"