Skip to content

Ollama

Ollama is a tool for running large language models locally on your machine, providing privacy, control, and offline access to AI models like Llama, Mistral, and CodeLlama.

CommandDescription
curl -fsSL https://ollama.ai/install.sh | shInstall Ollama on Linux/macOS
brew install ollamaInstall via Homebrew (macOS)
ollama --versionCheck installed version
ollama serveStart Ollama server
ollama psList running models
ollama listList installed models
CommandDescription
ollama pull llama3.1Download Llama 3.1 model
ollama pull mistralDownload Mistral model
ollama pull codellamaDownload CodeLlama model
ollama pull gemma:7bDownload specific model size
ollama show llama3.1Show model information
ollama rm mistralRemove model
CommandDescription
ollama pull llama3.1:8bLlama 3.1 8B parameters
ollama pull llama3.1:70bLlama 3.1 70B parameters
ollama pull mistral:7bMistral 7B model
ollama pull mixtral:8x7bMixtral 8x7B mixture of experts
ollama pull gemma:7bGoogle Gemma 7B
ollama pull phi3:miniMicrosoft Phi-3 Mini
CommandDescription
ollama pull codellama:7bCodeLlama 7B for coding
ollama pull codellama:13bCodeLlama 13B for coding
ollama pull codegemma:7bCodeGemma for code generation
ollama pull deepseek-coder:6.7bDeepSeek Coder model
ollama pull starcoder2:7bStarCoder2 for code
CommandDescription
ollama pull llava:7bLLaVA multimodal model
ollama pull nomic-embed-textText embedding model
ollama pull all-minilmSentence embedding model
ollama pull mxbai-embed-largeLarge embedding model
CommandDescription
ollama run llama3.1Start interactive chat with Llama 3.1
ollama run mistral "Hello, how are you?"Single prompt to Mistral
ollama run codellama "Write a Python function"Code generation with CodeLlama
ollama run llava "Describe this image" --image photo.jpgMultimodal with image
CommandDescription
ollama run llama3.1Start interactive chat
/byeExit chat session
/clearClear chat history
/save chat.txtSave chat to file
/load chat.txtLoad chat from file
/multilineEnable multiline input
CommandDescription
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello"}'Generate text via API
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}]}'Chat via API
curl http://localhost:11434/api/tagsList models via API
curl http://localhost:11434/api/show -d '{"name":"llama3.1"}'Show model info via API
CommandDescription
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello","stream":true}'Stream response
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'Stream chat
CommandDescription
ollama run llama3.1 --temperature 0.7Set temperature
ollama run llama3.1 --top-p 0.9Set top-p sampling
ollama run llama3.1 --top-k 40Set top-k sampling
ollama run llama3.1 --repeat-penalty 1.1Set repeat penalty
ollama run llama3.1 --seed 42Set random seed
CommandDescription
ollama run llama3.1 --ctx-size 4096Set context window size
ollama run llama3.1 --batch-size 512Set batch size
ollama run llama3.1 --threads 8Set number of threads
CommandDescription
ollama create mymodel -f ModelfileCreate custom model
ollama create mymodel -f Modelfile --quantize q4_0Create with quantization
# Basic Modelfile
FROM llama3.1
PARAMETER temperature 0.8
PARAMETER top_p 0.9
SYSTEM "You are a helpful coding assistant."
# Advanced Modelfile
FROM codellama:7b
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
SYSTEM """You are an expert programmer. Always provide:
1. Clean, well-commented code
2. Explanation of the solution
3. Best practices and optimizations"""
import requests
import json

def chat_with_ollama(prompt, model="llama3.1"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(url, json=data)
    return response.json()["response"]

# Usage
result = chat_with_ollama("Explain quantum computing")
print(result)
async function chatWithOllama(prompt, model = "llama3.1") {
    const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
            model: model,
            prompt: prompt,
            stream: false
        })
    });
    const data = await response.json();
    return data.response;
}

// Usage
chatWithOllama("Write a JavaScript function").then(console.log);
#!/bin/bash
ollama_chat() {
    local prompt="$1"
    local model="${2:-llama3.1}"
    curl -s http://localhost:11434/api/generate \
        -d "{\"model\":\"$model\",\"prompt\":\"$prompt\",\"stream\":false}" \
        | jq -r '.response'
}

# Usage
ollama_chat "Explain Docker containers"
CommandDescription
ollama run llama3.1 --gpu-layers 32Use GPU acceleration
ollama run llama3.1 --memory-limit 8GBSet memory limit
ollama run llama3.1 --cpu-threads 8Set CPU threads
ollama run llama3.1 --batch-size 1024Optimize batch size
VariableDescription
OLLAMA_HOSTSet server host (default: 127.0.0.1:11434)
OLLAMA_MODELSSet models directory
OLLAMA_NUM_PARALLELNumber of parallel requests
OLLAMA_MAX_LOADED_MODELSMax models in memory
OLLAMA_FLASH_ATTENTIONEnable flash attention
OLLAMA_GPU_OVERHEADGPU memory overhead
CommandDescription
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaRun Ollama in Docker
docker exec -it ollama ollama run llama3.1Run model in container
docker exec -it ollama ollama pull mistralPull model in container
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
volumes:
  ollama:
CommandDescription
ollama logsView Ollama logs
ollama psShow running models and memory usage
curl http://localhost:11434/api/versionCheck API version
curl http://localhost:11434/api/tagsList available models
CommandDescription
ollama create mymodel -f Modelfile --quantize q4_04-bit quantization
ollama create mymodel -f Modelfile --quantize q5_05-bit quantization
ollama create mymodel -f Modelfile --quantize q8_08-bit quantization
ollama create mymodel -f Modelfile --quantize f1616-bit float
CommandDescription
ollama pull nomic-embed-textPull text embedding model
curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Hello world"}'Generate embeddings
CommandDescription
ollama --helpShow help information
ollama serve --helpShow server options
ps aux | grep ollamaCheck if Ollama is running
lsof -i :11434Check port usage
ollama rm --allRemove all models
  • Choose model size based on available RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
  • Use GPU acceleration when available for better performance
  • Implement proper error handling in API integrations
  • Monitor memory usage when running multiple models
  • Use quantized models for resource-constrained environments
  • Cache frequently used models locally
  • Set appropriate context sizes for your use case
  • Use streaming for long responses to improve user experience
  • Implement rate limiting for production API usage
  • Regular model updates for improved performance and capabilities
ollama run codellama "Create a REST API in Python using FastAPI"
ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"
ollama run mistral "Write a short story about time travel"
ollama run llama3.1 "Convert this JSON to CSV format: {...}"