コンテンツにスキップ

エッジでのLLM推論:llama.cppとMLXでローカルに大規模言語モデルを実行する

· 13 min read · default
llminferenceedge-computinglocal-aimachine-learningaidevops

The economics and privacy landscape of LLM inference have shifted dramatically. Running a capable language model on your own hardware is no longer a research curiosity. It is a practical option for development, production workloads, and privacy-sensitive applications. Models that rival GPT-3.5 in capability can run on a laptop. Models that approach GPT-4 quality can run on a single workstation with consumer GPUs.

This guide covers the full stack of local LLM inference: the GGUF model format, llama.cpp for cross-platform CPU and GPU execution, Apple MLX for native Apple Silicon performance, Ollama for simplified model management, vLLM and SGLang for production serving, quantization strategies that trade accuracy for memory, and practical application architectures. By the end, you will have the knowledge to choose the right tool for your specific hardware, workload, and quality requirements.

ローカルLLM推論の根拠

Cloud API inference is convenient but comes with real costs that compound over time.

Financial costs. A moderate workload of 10 million tokens per day costs roughly $30-75/day with commercial APIs. The same workload on a local machine with an RTX 4090 costs only the electricity after the initial hardware purchase. At scale, the payback period on hardware is measured in weeks.

Latency. Cloud APIs add network round-trip time, queuing delays, and rate limiting. Local inference starts generating tokens immediately. For interactive applications like code completion or chat interfaces, the difference between 200ms time-to-first-token (local) and 800ms (cloud API) is noticeable.

Privacy and compliance. Regulated industries (healthcare, finance, legal, government) often cannot send data to third-party APIs. Local inference keeps all data on-premises. No data processing agreements, no audit trail concerns, no risk of training data contamination.

Availability. Local models work offline, during API outages, and without internet connectivity. For edge deployments, field operations, or air-gapped environments, this is not optional.

Customization. Local deployment gives you full control over the model, including fine-tuning, custom system prompts without provider restrictions, and the ability to run experimental or niche models not available through commercial APIs.

GGUFの理解:ユニバーサルモデルフォーマット

GGUF (GPT-Generated Unified Format) has become the standard format for local LLM deployment. Developed by the llama.cpp project, it replaced the older GGML format and provides a self-contained file that includes model weights, tokenizer data, and metadata.

Key properties of GGUF files:

Header:
  - Magic number and version
  - Model architecture (llama, mistral, phi, etc.)
  - Hyperparameters (layers, heads, embedding dimension)
  - Tokenizer type and vocabulary
  - Quantization method per tensor

Tensor data:
  - Weights stored in specified quantization format
  - Memory-mapped for efficient loading
  - Single file contains everything needed for inference

GGUF files are available on Hugging Face with standardized naming:

# Naming convention: model-name-size-quantization.gguf
# Examples:
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf     # 4-bit, medium quality
Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf     # 5-bit, good quality
Meta-Llama-3.1-8B-Instruct-Q8_0.gguf       # 8-bit, near-lossless
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf    # Large model, 4-bit

The self-contained nature of GGUF means deployment is copying a single file. No Python environments, no dependency installation, no tokenizer configuration. This simplicity is a major advantage for production deployment and distribution.

llama.cpp詳細解説

llama.cpp is the foundational project for local LLM inference. Written in C/C++ with minimal dependencies, it runs on virtually any hardware: x86 CPUs, ARM CPUs, NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Apple Silicon (Metal), and even Vulkan for broader GPU support.

Building from Source

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# CPU-only build
cmake -B build
cmake --build build --config Release -j$(nproc)

# CUDA build (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Metal build (Apple Silicon) - enabled by default on macOS
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(nproc)

# ROCm build (AMD GPUs)
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release -j$(nproc)

# Vulkan build (cross-platform GPU)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

Quantization Levels

Quantization reduces model precision to decrease memory requirements and increase inference speed. llama.cpp supports multiple quantization levels:

Quantization Bits/Weight Memory (7B) Memory (70B) Quality Impact
F16 16.0 14.0 GB 140 GB Baseline
Q8_0 8.5 7.5 GB 75 GB Negligible
Q6_K 6.6 5.8 GB 58 GB Minimal
Q5_K_M 5.7 5.0 GB 50 GB Very small
Q4_K_M 4.8 4.2 GB 42 GB Small
Q4_0 4.5 3.9 GB 39 GB Moderate
Q3_K_M 3.9 3.4 GB 34 GB Noticeable
Q2_K 3.4 2.9 GB 29 GB Significant
IQ2_XS 2.3 2.1 GB 21 GB Large

Q4_K_M is the sweet spot for most users. It provides good quality with reasonable memory usage. Q5_K_M is worth the extra memory if you have it. Below Q3_K_M, quality degrades noticeably for complex reasoning tasks.

Running Inference

# Basic interactive chat
./build/bin/llama-cli \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --chat-template llama3 \
  -c 8192 \
  -ngl 99 \
  --temp 0.7

# Explanation of key flags:
# -m          Model file path
# -c 8192     Context length (tokens)
# -ngl 99     GPU layers (99 = offload everything to GPU)
# --temp 0.7  Sampling temperature

# Batch processing from a prompt file
./build/bin/llama-cli \
  -m models/model.gguf \
  -f prompt.txt \
  -n 512 \
  --no-display-prompt

# Quantize a model yourself
./build/bin/llama-quantize \
  models/original-f16.gguf \
  models/quantized-q4km.gguf \
  Q4_K_M

GPU Offloading

GPU offloading moves transformer layers from CPU RAM to GPU VRAM for faster computation. The -ngl flag controls how many layers to offload:

# Full GPU offloading (fastest, requires enough VRAM)
./build/bin/llama-cli -m model.gguf -ngl 99

# Partial offloading (split between GPU and CPU)
# Useful when model does not fit entirely in VRAM
./build/bin/llama-cli -m model.gguf -ngl 20

# Multi-GPU (split across GPUs)
./build/bin/llama-cli -m model.gguf -ngl 99 \
  --split-mode layer \
  --tensor-split 0.5,0.5

The performance difference between CPU and GPU inference is dramatic. A 7B Q4_K_M model on a modern CPU generates roughly 15-25 tokens/second. The same model fully offloaded to an RTX 4090 generates 100-150 tokens/second.

llama.cppサーバー:OpenAI互換API

llama.cpp includes a server that exposes an OpenAI-compatible API, making it a drop-in replacement for api.openai.com in existing applications:

# Start the server
./build/bin/llama-server \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99 \
  --parallel 4 \
  --cont-batching

# Key server flags:
# --parallel 4      Handle 4 concurrent requests
# --cont-batching   Enable continuous batching for throughput
# --metrics         Enable Prometheus metrics endpoint
# --api-key KEY     Require API key for requests

Client usage is identical to the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # Or your configured key
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the PagedAttention algorithm."}
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Embeddings endpoint
embedding = client.embeddings.create(
    model="local-model",
    input="This is a test sentence for embedding."
)
print(f"Embedding dimension: {len(embedding.data[0].embedding)}")

The server also supports batched inference for throughput-optimized workloads:

# High-throughput batch processing
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Summarize this text..."}],
    "temperature": 0.3,
    "max_tokens": 256
  }'

Apple MLXフレームワーク

Apple's MLX framework is designed specifically for Apple Silicon, exploiting the unified memory architecture where CPU and GPU share the same physical memory. This eliminates the memory copy overhead that plagues discrete GPU setups.

Setup and Basic Usage

# Install MLX and the LM toolkit
pip install mlx mlx-lm

# Download and run a model
mlx_lm.generate \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --prompt "Explain how unified memory benefits LLM inference" \
  --max-tokens 512

# Start an OpenAI-compatible server
mlx_lm.server \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --port 8080

MLX Python API

import mlx.core as mx
from mlx_lm import load, generate

# Load model
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")

# Generate text
response = generate(
    model,
    tokenizer,
    prompt="Write a Python function to parse JSON logs",
    max_tokens=512,
    temp=0.7
)
print(response)

# Streaming generation
for token in generate(
    model, tokenizer,
    prompt="Explain containerization",
    max_tokens=256,
    stream=True
):
    print(token, end="", flush=True)

Quantizing Models for MLX

# Convert a Hugging Face model to MLX format with quantization
mlx_lm.convert \
  --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --mlx-path ./mlx-llama-3.1-8b-4bit \
  --quantize \
  --q-bits 4 \
  --q-group-size 64

# Fine-tune with LoRA
mlx_lm.lora \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --data ./training_data \
  --batch-size 4 \
  --lora-layers 16 \
  --iters 1000

MLX's unified memory advantage becomes dramatic with larger models. On a MacBook Pro with 64GB unified memory, you can run a 70B parameter model at 4-bit quantization (approximately 42GB) that would require a dedicated GPU with 48GB+ VRAM on a traditional setup. The M-series chips access this memory with bandwidth exceeding 200 GB/s (M3 Max), enabling competitive inference speeds.

Apple SiliconでのMLX vs llama.cppベンチマーク

Both frameworks run well on Apple Silicon, but with different performance characteristics:

Model: Llama 3.1 8B Instruct Q4_K_M
Hardware: MacBook Pro M3 Max, 64GB RAM

                    llama.cpp (Metal)    MLX
Prompt processing:  1,847 tok/s          2,103 tok/s
Token generation:   62.3 tok/s           58.7 tok/s
Time to first token: 145 ms              112 ms
Memory usage:       4.8 GB               5.1 GB

Model: Llama 3.1 70B Instruct Q4_K_M
Hardware: Mac Studio M2 Ultra, 192GB RAM

                    llama.cpp (Metal)    MLX
Prompt processing:  487 tok/s            612 tok/s
Token generation:   18.2 tok/s           16.8 tok/s
Time to first token: 890 ms              720 ms
Memory usage:       42 GB                44 GB

MLX tends to win on prompt processing speed and time-to-first-token due to its optimized Metal compute kernels. llama.cpp often edges ahead on sustained token generation. The differences are small enough that the choice should be driven by ecosystem fit rather than raw performance. If you are building a Python application on macOS, MLX is the natural choice. If you need cross-platform support, llama.cpp is the safer bet.

Ollama:LLMのDocker

Ollama wraps llama.cpp in a Docker-like experience for model management. It handles downloading, quantization selection, GPU detection, and API serving with minimal configuration.

# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b

# List available models
ollama list

# Model details
ollama show llama3.1:8b

# Run with specific parameters
ollama run llama3.1:8b --verbose

# Serve API (starts automatically on install)
# API available at http://localhost:11434

Modelfile: Custom Model Configuration

# Modelfile for a custom assistant
FROM llama3.1:8b

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

SYSTEM """You are a senior DevOps engineer. You provide concise,
accurate answers about infrastructure, containers, CI/CD pipelines,
and cloud architecture. Always include relevant commands and
configuration examples."""

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
# Build and run custom model
ollama create devops-assistant -f Modelfile
ollama run devops-assistant

Ollama API

import requests
import json

# Chat completion
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [
            {"role": "user", "content": "Explain Kubernetes pod scheduling"}
        ],
        "stream": False
    }
)
print(response.json()["message"]["content"])

# Streaming
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [{"role": "user", "content": "List Docker best practices"}],
        "stream": True
    },
    stream=True
)
for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data["message"]["content"], end="", flush=True)

# Embeddings
response = requests.post(
    "http://localhost:11434/api/embed",
    json={
        "model": "llama3.1:8b",
        "input": "Kubernetes is a container orchestration platform"
    }
)
embedding = response.json()["embeddings"][0]

本番セルフホスティングのためのvLLM

When you need production-grade serving with high throughput, vLLM is the standard choice. Its PagedAttention algorithm manages GPU memory like a virtual memory system, dramatically improving throughput for concurrent requests.

# Install
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --max-num-batched-tokens 32768
# vLLM Python API for offline batch processing
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,
    max_model_len=8192,
    gpu_memory_utilization=0.9
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
    top_p=0.9
)

# Batch process multiple prompts efficiently
prompts = [
    "Explain microservices architecture",
    "What is event-driven design?",
    "Describe the CAP theorem"
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Output: {output.outputs[0].text[:100]}...")
    print()

vLLM's continuous batching means it does not wait for an entire batch to complete before starting new requests. As individual requests finish, their GPU memory is immediately recycled for new requests, maintaining high utilization under variable load.

SGLang:構造化生成と制約付きデコーディング

SGLang (Structured Generation Language) specializes in constrained output, ensuring LLM responses conform to specified formats. This is critical for building reliable applications that parse LLM output programmatically.

# SGLang structured generation
import sglang as sgl

@sgl.function
def extract_entity(s, text):
    s += sgl.user(f"Extract entities from: {text}")
    s += sgl.assistant(
        sgl.gen("result", max_tokens=256,
                regex=r'\{"name": "[^"]+", "type": "(person|org|location)"\}')
    )

# JSON mode with schema enforcement
@sgl.function
def analyze_code(s, code):
    s += sgl.system("You are a code analyzer.")
    s += sgl.user(f"Analyze this code:\n```\n{code}\n```")
    s += sgl.assistant(
        sgl.gen("analysis",
                max_tokens=512,
                json_schema={
                    "type": "object",
                    "properties": {
                        "language": {"type": "string"},
                        "complexity": {"type": "string", "enum": ["low", "medium", "high"]},
                        "issues": {
                            "type": "array",
                            "items": {"type": "string"}
                        },
                        "suggestions": {
                            "type": "array",
                            "items": {"type": "string"}
                        }
                    },
                    "required": ["language", "complexity", "issues", "suggestions"]
                })
    )

SGLang's RadixAttention engine caches common prompt prefixes across requests, reducing redundant computation when many requests share the same system prompt or few-shot examples.

量子化戦略:GPTQ vs AWQ vs GGUF vs EXL2

Different quantization methods optimize for different hardware and use cases:

Method Best For GPU Required Quality Speed
GGUF CPU + GPU hybrid Optional Good Moderate
GPTQ NVIDIA GPU only Yes Good Fast
AWQ NVIDIA GPU only Yes Better Fast
EXL2 NVIDIA GPU only Yes Best Fastest
# GPTQ quantization (requires GPU)
pip install auto-gptq
python -c "
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

model_name = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(model_name, device_map='auto')
model.quantize(tokenizer, quant_config={'bits': 4, 'group_size': 128})
model.save_quantized('./llama-3.1-8b-gptq-4bit')
"

# AWQ quantization
pip install autoawq
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4})
model.save_quantized('./llama-3.1-8b-awq-4bit')
"

For most users, the decision tree is straightforward:

  • Apple Silicon: Use MLX 4-bit or GGUF with llama.cpp Metal
  • NVIDIA GPU with enough VRAM: AWQ or EXL2 for best quality and speed
  • NVIDIA GPU with limited VRAM: GGUF with partial GPU offloading
  • CPU only: GGUF with Q4_K_M quantization
  • Cross-platform distribution: GGUF (works everywhere)

メモリ要件計算機

A practical formula for estimating memory requirements:

def estimate_memory_gb(params_billions, bits_per_weight, context_length=4096):
    """Estimate total memory needed for LLM inference."""
    # Model weights
    weight_memory = params_billions * bits_per_weight / 8  # GB

    # KV cache (approximate)
    # Assumes: 2 (K+V) * num_layers * hidden_dim * 2 bytes * context_length
    # Simplified: ~0.5 GB per billion params per 4096 context tokens
    kv_cache = params_billions * 0.5 * (context_length / 4096)

    # Overhead (activations, buffers)
    overhead = weight_memory * 0.1

    total = weight_memory + kv_cache + overhead
    return round(total, 1)

# Examples
models = [
    ("7B Q4_K_M", 7, 4.8),
    ("7B Q8_0", 7, 8.5),
    ("13B Q4_K_M", 13, 4.8),
    ("70B Q4_K_M", 70, 4.8),
    ("70B Q8_0", 70, 8.5),
]

for name, params, bits in models:
    mem = estimate_memory_gb(params, bits)
    print(f"{name:15s} -> {mem:6.1f} GB")

# Output:
# 7B Q4_K_M       ->    7.7 GB
# 7B Q8_0         ->   11.0 GB
# 13B Q4_K_M      ->   13.8 GB
# 70B Q4_K_M      ->   73.5 GB
# 70B Q8_0        ->  108.5 GB

Rule of thumb: you need approximately 1.1x the model weight size for inference with a moderate context length. For long contexts (32K+), add significant headroom for the KV cache.

ローカルAIアプリの構築:OllamaとChromaDBによるRAG

A practical example of building a local RAG (Retrieval-Augmented Generation) application:

import chromadb
import requests
import json
from pathlib import Path

# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

def get_embedding(text):
    """Get embeddings from local Ollama."""
    response = requests.post(
        "http://localhost:11434/api/embed",
        json={"model": "llama3.1:8b", "input": text}
    )
    return response.json()["embeddings"][0]

def index_documents(docs_dir):
    """Index documents into ChromaDB."""
    for filepath in Path(docs_dir).glob("**/*.md"):
        content = filepath.read_text()
        # Chunk the document
        chunks = chunk_text(content, max_tokens=512, overlap=50)
        for i, chunk in enumerate(chunks):
            doc_id = f"{filepath.stem}_{i}"
            embedding = get_embedding(chunk)
            collection.add(
                documents=[chunk],
                embeddings=[embedding],
                ids=[doc_id],
                metadatas=[{"source": str(filepath)}]
            )
    print(f"Indexed {collection.count()} chunks")

def chunk_text(text, max_tokens=512, overlap=50):
    """Simple chunking by paragraphs with overlap."""
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = []
    current_length = 0

    for para in paragraphs:
        para_length = len(para.split())
        if current_length + para_length > max_tokens and current_chunk:
            chunks.append("\n\n".join(current_chunk))
            # Keep last paragraph for overlap
            current_chunk = current_chunk[-1:] if overlap > 0 else []
            current_length = len(current_chunk[0].split()) if current_chunk else 0
        current_chunk.append(para)
        current_length += para_length

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))
    return chunks

def query_rag(question):
    """Query the RAG system."""
    # Retrieve relevant chunks
    query_embedding = get_embedding(question)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=5
    )

    # Build context
    context = "\n\n---\n\n".join(results["documents"][0])

    # Generate answer with local LLM
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama3.1:8b",
            "messages": [
                {
                    "role": "system",
                    "content": (
                        "Answer the question using only the provided context. "
                        "If the context does not contain enough information, "
                        "say so. Cite your sources."
                    )
                },
                {
                    "role": "user",
                    "content": f"Context:\n{context}\n\nQuestion: {question}"
                }
            ],
            "stream": False
        }
    )
    return response.json()["message"]["content"]

# Usage
index_documents("./docs")
answer = query_rag("How do I configure Kubernetes pod autoscaling?")
print(answer)

This entire pipeline runs locally. No data leaves the machine, no API keys are needed, and the system works offline. For a production deployment, you would add a web frontend, persistent caching, and document update handling, but the core RAG architecture remains the same.

プライバシーとコンプライアンスの利点

Local LLM deployment provides concrete compliance benefits:

Data sovereignty. Data never crosses network boundaries. This satisfies data residency requirements (GDPR, CCPA, PIPEDA) without complex data processing agreements.

Audit simplicity. There is no third-party vendor to audit. The model runs on your hardware, processes your data, and produces output that stays on your systems.

No training data risk. Cloud API providers may use your data to improve their models (check terms carefully). Local models have zero risk of your proprietary data ending up in someone else's training set.

Air-gapped deployment. For classified environments or high-security facilities, local models can run completely disconnected from any network.

# Air-gapped deployment workflow
# 1. On connected machine: download model
ollama pull llama3.1:8b
# Model stored in ~/.ollama/models/

# 2. Transfer to air-gapped machine (USB, approved media)
tar -czf ollama-models.tar.gz ~/.ollama/models/

# 3. On air-gapped machine: restore and run
tar -xzf ollama-models.tar.gz -C ~/
ollama serve &
ollama run llama3.1:8b

ローカル維持 vs クラウドAPI使用の判断

The decision is not binary. Most organizations will use a hybrid approach:

Stay local when:

  • Processing sensitive, regulated, or proprietary data
  • Latency-sensitive interactive applications
  • High-volume, predictable workloads (cheaper long-term)
  • Offline or air-gapped environments
  • You need full control over the model and its behavior

Use cloud APIs when:

  • You need frontier model capabilities (GPT-4 class, Claude Opus class)
  • Workloads are bursty and unpredictable
  • You lack the hardware for large models
  • Time-to-deployment matters more than cost
  • You need capabilities beyond text (vision, audio, tool use) that local models cannot match

The hybrid pattern: Use local models for data processing, summarization, classification, and embedding generation. Use cloud APIs for complex reasoning, creative tasks, and quality-critical outputs. This captures most of the cost savings and privacy benefits while maintaining access to frontier capabilities when they matter.

The local inference ecosystem has reached a maturity level where it is a genuine production option, not a compromise. The tools are stable, the models are capable, and the hardware requirements are within reach of any development team. Whether you start with Ollama for simplicity, llama.cpp for control, or vLLM for scale, the path from experiment to production deployment is well-trodden and well-documented.