Ollama
📋 Copy All Commands
📄 Generate PDF
Ollama is a tool for running large language models locally on your machine, providing privacy, control, and offline access to AI models like Llama, Mistral, and CodeLlama.
Installation & Setup
Command
Description
curl -fsSL https://ollama.ai/install.sh \| sh
Install Ollama on Linux/macOS
brew install ollama
Install via Homebrew (macOS)
ollama --version
Check installed version
ollama serve
Start Ollama server
ollama ps
List running models
ollama list
List installed models
Model Management
Command
Description
ollama pull llama3.1
Download Llama 3.1 model
ollama pull mistral
Download Mistral model
ollama pull codellama
Download CodeLlama model
ollama pull gemma:7b
Download specific model size
ollama show llama3.1
Show model information
ollama rm mistral
Remove model
Popular Models
General Purpose Models
Command
Description
ollama pull llama3.1:8b
Llama 3.1 8B parameters
ollama pull llama3.1:70b
Llama 3.1 70B parameters
ollama pull mistral:7b
Mistral 7B model
ollama pull mixtral:8x7b
Mixtral 8x7B mixture of experts
ollama pull gemma:7b
Google Gemma 7B
ollama pull phi3:mini
Microsoft Phi-3 Mini
Code-Specialized Models
Command
Description
ollama pull codellama:7b
CodeLlama 7B for coding
ollama pull codellama:13b
CodeLlama 13B for coding
ollama pull codegemma:7b
CodeGemma for code generation
ollama pull deepseek-coder:6.7b
DeepSeek Coder model
ollama pull starcoder2:7b
StarCoder2 for code
Specialized Models
Command
Description
ollama pull llava:7b
LLaVA multimodal model
ollama pull nomic-embed-text
Text embedding model
ollama pull all-minilm
Sentence embedding model
ollama pull mxbai-embed-large
Large embedding model
Running Models
Command
Description
ollama run llama3.1
Start interactive chat with Llama 3.1
ollama run mistral "Hello, how are you?"
Single prompt to Mistral
ollama run codellama "Write a Python function"
Code generation with CodeLlama
ollama run llava "Describe this image" --image photo.jpg
Multimodal with image
Chat Interface
Command
Description
ollama run llama3.1
Start interactive chat
/bye
Exit chat session
/clear
Clear chat history
/save chat.txt
Save chat to file
/load chat.txt
Load chat from file
/multiline
Enable multiline input
API Usage
REST API
Command
Description
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello"}'
Generate text via API
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}]}'
Chat via API
curl http://localhost:11434/api/tags
List models via API
curl http://localhost:11434/api/show -d '{"name":"llama3.1"}'
Show model info via API
Streaming Responses
Command
Description
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello","stream":true}'
Stream response
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'
Stream chat
Model Configuration
Temperature and Parameters
Command
Description
ollama run llama3.1 --temperature 0.7
Set temperature
ollama run llama3.1 --top-p 0.9
Set top-p sampling
ollama run llama3.1 --top-k 40
Set top-k sampling
ollama run llama3.1 --repeat-penalty 1.1
Set repeat penalty
ollama run llama3.1 --seed 42
Set random seed
Context and Memory
Command
Description
ollama run llama3.1 --ctx-size 4096
Set context window size
ollama run llama3.1 --batch-size 512
Set batch size
ollama run llama3.1 --threads 8
Set number of threads
Custom Models
Creating Modelfiles
Command
Description
ollama create mymodel -f Modelfile
Create custom model
ollama create mymodel -f Modelfile --quantize q4_0
Create with quantization
Modelfile Examples
# Basic Modelfile
FROM llama3.1
PARAMETER temperature 0 .8
PARAMETER top_p 0 .9
SYSTEM "You are a helpful coding assistant."
# Advanced Modelfile
FROM codellama:7b
PARAMETER temperature 0 .2
PARAMETER top_k 40
PARAMETER repeat_penalty 1 .1
SYSTEM """You are an expert programmer. Always provide:
1 . Clean, well-commented code
2 . Explanation of the solution
3 . Best practices and optimizations"""
Integration Examples
Python Integration
import requests
import json
def chat_with_ollama ( prompt , model = "llama3.1" ):
url = "http://localhost:11434/api/generate"
data = {
"model" : model ,
"prompt" : prompt ,
"stream" : False
}
response = requests . post ( url , json = data )
return response . json ()[ "response" ]
# Usage
result = chat_with_ollama ( "Explain quantum computing" )
print ( result )
JavaScript Integration
async function chatWithOllama ( prompt , model = "llama3.1" ) {
const response = await fetch ( "http://localhost:11434/api/generate" , {
method : "POST" ,
headers : { "Content-Type" : "application/json" },
body : JSON . stringify ({
model : model ,
prompt : prompt ,
stream : false
})
});
const data = await response . json ();
return data . response ;
}
// Usage
chatWithOllama ( "Write a JavaScript function" ). then ( console . log );
Bash Integration
#!/bin/bash
ollama_chat() {
local prompt = " $1 "
local model = " ${ 2 :- llama3 .1 } "
curl -s http://localhost:11434/api/generate \
-d "{\"model\":\" $model \",\"prompt\":\" $prompt \",\"stream\":false}" \
| jq -r '.response'
}
# Usage
ollama_chat "Explain Docker containers"
Command
Description
ollama run llama3.1 --gpu-layers 32
Use GPU acceleration
ollama run llama3.1 --memory-limit 8GB
Set memory limit
ollama run llama3.1 --cpu-threads 8
Set CPU threads
ollama run llama3.1 --batch-size 1024
Optimize batch size
Environment Variables
Variable
Description
OLLAMA_HOST
Set server host (default: 127.0.0.1:11434)
OLLAMA_MODELS
Set models directory
OLLAMA_NUM_PARALLEL
Number of parallel requests
OLLAMA_MAX_LOADED_MODELS
Max models in memory
OLLAMA_FLASH_ATTENTION
Enable flash attention
OLLAMA_GPU_OVERHEAD
GPU memory overhead
Docker Usage
Command
Description
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Run Ollama in Docker
docker exec -it ollama ollama run llama3.1
Run model in container
docker exec -it ollama ollama pull mistral
Pull model in container
Docker Compose
version : '3.8'
services :
ollama :
image : ollama/ollama
ports :
- "11434:11434"
volumes :
- ollama:/root/.ollama
environment :
- OLLAMA_HOST=0.0.0.0:11434
volumes :
ollama :
Monitoring & Debugging
Command
Description
ollama logs
View Ollama logs
ollama ps
Show running models and memory usage
curl http://localhost:11434/api/version
Check API version
curl http://localhost:11434/api/tags
List available models
Model Quantization
Command
Description
ollama create mymodel -f Modelfile --quantize q4_0
4-bit quantization
ollama create mymodel -f Modelfile --quantize q5_0
5-bit quantization
ollama create mymodel -f Modelfile --quantize q8_0
8-bit quantization
ollama create mymodel -f Modelfile --quantize f16
16-bit float
Embedding Models
Command
Description
ollama pull nomic-embed-text
Pull text embedding model
curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Hello world"}'
Generate embeddings
Troubleshooting
Command
Description
ollama --help
Show help information
ollama serve --help
Show server options
ps aux \| grep ollama
Check if Ollama is running
lsof -i :11434
Check port usage
ollama rm --all
Remove all models
Best Practices
Choose model size based on available RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
Use GPU acceleration when available for better performance
Implement proper error handling in API integrations
Monitor memory usage when running multiple models
Use quantized models for resource-constrained environments
Cache frequently used models locally
Set appropriate context sizes for your use case
Use streaming for long responses to improve user experience
Implement rate limiting for production API usage
Regular model updates for improved performance and capabilities
Common Use Cases
Code Generation
ollama run codellama "Create a REST API in Python using FastAPI"
Text Analysis
ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"
Creative Writing
ollama run mistral "Write a short story about time travel"
Data Processing
ollama run llama3.1 "Convert this JSON to CSV format: {...}"