تخطَّ إلى المحتوى

Ollama

Ollama is a tool for running large language models locally on your machine, providing privacy, control, and offline access to AI models like Llama, Mistral, and CodeLlama.

Installation & Setup

CommandDescription
`curl -fsSL https://ollama.ai/install.sh \sh`
brew install ollamaInstall via Homebrew (macOS)
ollama --versionCheck installed version
ollama serveStart Ollama server
ollama psList running models
ollama listList installed models

Model Management

CommandDescription
ollama pull llama3.1Download Llama 3.1 model
ollama pull mistralDownload Mistral model
ollama pull codellamaDownload CodeLlama model
ollama pull gemma:7bDownload specific model size
ollama show llama3.1Show model information
ollama rm mistralRemove model

General Purpose Models

CommandDescription
ollama pull llama3.1:8bLlama 3.1 8B parameters
ollama pull llama3.1:70bLlama 3.1 70B parameters
ollama pull mistral:7bMistral 7B model
ollama pull mixtral:8x7bMixtral 8x7B mixture of experts
ollama pull gemma:7bGoogle Gemma 7B
ollama pull phi3:miniMicrosoft Phi-3 Mini

Code-Specialized Models

CommandDescription
ollama pull codellama:7bCodeLlama 7B for coding
ollama pull codellama:13bCodeLlama 13B for coding
ollama pull codegemma:7bCodeGemma for code generation
ollama pull deepseek-coder:6.7bDeepSeek Coder model
ollama pull starcoder2:7bStarCoder2 for code

Specialized Models

CommandDescription
ollama pull llava:7bLLaVA multimodal model
ollama pull nomic-embed-textText embedding model
ollama pull all-minilmSentence embedding model
ollama pull mxbai-embed-largeLarge embedding model

Running Models

CommandDescription
ollama run llama3.1Start interactive chat with Llama 3.1
ollama run mistral "Hello, how are you?"Single prompt to Mistral
ollama run codellama "Write a Python function"Code generation with CodeLlama
ollama run llava "Describe this image" --image photo.jpgMultimodal with image

Chat Interface

CommandDescription
ollama run llama3.1Start interactive chat
/byeExit chat session
/clearClear chat history
/save chat.txtSave chat to file
/load chat.txtLoad chat from file
/multilineEnable multiline input

API Usage

REST API

CommandDescription
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello"}'Generate text via API
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}]}'Chat via API
curl http://localhost:11434/api/tagsList models via API
curl http://localhost:11434/api/show -d '{"name":"llama3.1"}'Show model info via API

Streaming Responses

CommandDescription
curl http://localhost:11434/api/generate -d '{"model":"llama3.1","prompt":"Hello","stream":true}'Stream response
curl http://localhost:11434/api/chat -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hello"}],"stream":true}'Stream chat

Model Configuration

Temperature and Parameters

CommandDescription
ollama run llama3.1 --temperature 0.7Set temperature
ollama run llama3.1 --top-p 0.9Set top-p sampling
ollama run llama3.1 --top-k 40Set top-k sampling
ollama run llama3.1 --repeat-penalty 1.1Set repeat penalty
ollama run llama3.1 --seed 42Set random seed

Context and Memory

CommandDescription
ollama run llama3.1 --ctx-size 4096Set context window size
ollama run llama3.1 --batch-size 512Set batch size
ollama run llama3.1 --threads 8Set number of threads

Custom Models

Creating Modelfiles

CommandDescription
ollama create mymodel -f ModelfileCreate custom model
ollama create mymodel -f Modelfile --quantize q4_0Create with quantization

Modelfile Examples

# Basic Modelfile
FROM llama3.1
PARAMETER temperature 0.8
PARAMETER top_p 0.9
SYSTEM "You are a helpful coding assistant."
# Advanced Modelfile
FROM codellama:7b
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
SYSTEM """You are an expert programmer. Always provide:
1. Clean, well-commented code
2. Explanation of the solution
3. Best practices and optimizations"""

Integration Examples

Python Integration

import requests
import json

def chat_with_ollama(prompt, model="llama3.1"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(url, json=data)
    return response.json()["response"]

# Usage
result = chat_with_ollama("Explain quantum computing")
print(result)

JavaScript Integration

async function chatWithOllama(prompt, model = "llama3.1") {
    const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
            model: model,
            prompt: prompt,
            stream: false
        })
    });
    const data = await response.json();
    return data.response;
}

// Usage
chatWithOllama("Write a JavaScript function").then(console.log);

Bash Integration

#!/bin/bash
ollama_chat() {
    local prompt="$1"
    local model="${2:-llama3.1}"
    curl -s http://localhost:11434/api/generate \
        -d "{\"model\":\"$model\",\"prompt\":\"$prompt\",\"stream\":false}" \
        | jq -r '.response'
}

# Usage
ollama_chat "Explain Docker containers"

Performance Optimization

CommandDescription
ollama run llama3.1 --gpu-layers 32Use GPU acceleration
ollama run llama3.1 --memory-limit 8GBSet memory limit
ollama run llama3.1 --cpu-threads 8Set CPU threads
ollama run llama3.1 --batch-size 1024Optimize batch size

Environment Variables

VariableDescription
OLLAMA_HOSTSet server host (default: 127.0.0.1:11434)
OLLAMA_MODELSSet models directory
OLLAMA_NUM_PARALLELNumber of parallel requests
OLLAMA_MAX_LOADED_MODELSMax models in memory
OLLAMA_FLASH_ATTENTIONEnable flash attention
OLLAMA_GPU_OVERHEADGPU memory overhead

Docker Usage

CommandDescription
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaRun Ollama in Docker
docker exec -it ollama ollama run llama3.1Run model in container
docker exec -it ollama ollama pull mistralPull model in container

Docker Compose

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
volumes:
  ollama:

Monitoring & Debugging

CommandDescription
ollama logsView Ollama logs
ollama psShow running models and memory usage
curl http://localhost:11434/api/versionCheck API version
curl http://localhost:11434/api/tagsList available models

Model Quantization

CommandDescription
ollama create mymodel -f Modelfile --quantize q4_04-bit quantization
ollama create mymodel -f Modelfile --quantize q5_05-bit quantization
ollama create mymodel -f Modelfile --quantize q8_08-bit quantization
ollama create mymodel -f Modelfile --quantize f1616-bit float

Embedding Models

CommandDescription
ollama pull nomic-embed-textPull text embedding model
curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"Hello world"}'Generate embeddings

Troubleshooting

CommandDescription
ollama --helpShow help information
ollama serve --helpShow server options
`ps aux \grep ollama`
lsof -i :11434Check port usage
ollama rm --allRemove all models

Best Practices

  • Choose model size based on available RAM (7B ≈ 4GB, 13B ≈ 8GB, 70B ≈ 40GB)
  • Use GPU acceleration when available for better performance
  • Implement proper error handling in API integrations
  • Monitor memory usage when running multiple models
  • Use quantized models for resource-constrained environments
  • Cache frequently used models locally
  • Set appropriate context sizes for your use case
  • Use streaming for long responses to improve user experience
  • Implement rate limiting for production API usage
  • Regular model updates for improved performance and capabilities

Common Use Cases

Code Generation

ollama run codellama "Create a REST API in Python using FastAPI"

Text Analysis

ollama run llama3.1 "Analyze the sentiment of this text: 'I love this product!'"

Creative Writing

ollama run mistral "Write a short story about time travel"

Data Processing

ollama run llama3.1 "Convert this JSON to CSV format: {...}"