DeepSeek

DeepSeek provides state-of-the-art open-weight models with a public API. Known for DeepSeek-Coder (code generation), DeepSeek-R1 (chain-of-thought reasoning), and competitive pricing. Models are fully open-source and can be self-hosted.

API: https://api.deepseek.com
Docs: https://platform.deepseek.com/docs
Models on Hub: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai

Installation

Python SDK (OpenAI-compatible)

# DeepSeek uses the OpenAI SDK with a custom base_url
pip install openai

# Or use httpx for direct REST calls
pip install httpx

# For local deployment via Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull deepseek-r1:7b
ollama pull deepseek-coder-v2:16b

vLLM Self-Hosted Inference

# Install vLLM (requires CUDA 11.8+)
pip install vllm

# Serve DeepSeek model
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --tensor-parallel-size 1

# OpenAI-compatible endpoint now at http://localhost:8000/v1

Direct with Transformers

pip install transformers accelerate torch bitsandbytes

Configuration

API Client Setup

from openai import OpenAI

# DeepSeek API (cloud)
client = OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxxxxxx",  # From platform.deepseek.com
    base_url="https://api.deepseek.com",
)

# Self-hosted via Ollama
client_local = OpenAI(
    api_key="ollama",                   # Any non-empty string
    base_url="http://localhost:11434/v1",
)

# Self-hosted via vLLM
client_vllm = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

Environment Variables

export DEEPSEEK_API_KEY="sk-xxxxxxxxxxxxxxxxxxxx"
export DEEPSEEK_BASE_URL="https://api.deepseek.com"
# Optional: use a local endpoint
export DEEPSEEK_BASE_URL="http://localhost:11434/v1"

Core API / Commands

Available Models

Model	Context	Strengths	Pricing (Input/Output per 1M tokens)
`deepseek-chat`	64K	General chat, instruction following	$0.27 / $1.10
`deepseek-reasoner`	64K	Chain-of-thought, math, logic	$0.55 / $2.19
`deepseek-coder`	16K	Code generation, completion	$0.14 / $0.28
`deepseek-ai/DeepSeek-R1` (HF)	128K	Full open-weight reasoning model	Self-hosted
`deepseek-ai/DeepSeek-V3` (HF)	128K	Full open-weight general model	Self-hosted
`deepseek-r1:7b` (Ollama)	32K	Lightweight local reasoning	Free (local)
`deepseek-coder-v2:16b` (Ollama)	32K	Local code model	Free (local)

API Endpoints

Endpoint	Method	Description
`/chat/completions`	POST	Chat completions (main endpoint)
`/completions`	POST	Text completion (legacy/FIM)
`/models`	GET	List available models
`/beta/completions`	POST	FIM (fill-in-the-middle) completions

Advanced Usage

Chat Completions

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxxxxxxxxxxxxxxxxxx",
    base_url="https://api.deepseek.com",
)

# Basic chat
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

# Streaming response
stream = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Write a Python quicksort."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

DeepSeek-R1 Reasoning (with thinking tokens)

# DeepSeek-R1 returns reasoning_content separately
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[
        {"role": "user", "content": "What is 17 * 23? Show your reasoning."},
    ],
)

# Access chain-of-thought reasoning
thinking = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

print("THINKING:\n", thinking)
print("\nANSWER:\n", answer)

# Multi-turn with reasoning preserved
messages = [{"role": "user", "content": "Solve: if 3x + 7 = 22, find x"}]
resp = client.chat.completions.create(model="deepseek-reasoner", messages=messages)

# Append reasoning + answer for context continuity
messages.append({
    "role": "assistant",
    "content": resp.choices[0].message.content,
    # Note: reasoning_content is NOT passed back in multi-turn (by design)
})
messages.append({"role": "user", "content": "Now solve 5x - 3 = 27"})

Fill-in-the-Middle (FIM) for Code Completion

# FIM uses special tokens: <｜fim▁begin｜>, <｜fim▁hole｜>, <｜fim▁end｜>
response = client.completions.create(
    model="deepseek-chat",
    prompt="<｜fim▁begin｜>def quicksort(arr):\n    <｜fim▁hole｜>\n    return arr<｜fim▁end｜>",
    max_tokens=256,
    stop=["<｜fim▁end｜>"],
)
print(response.choices[0].text)

# Pythonic FIM helper
def fim_complete(prefix: str, suffix: str, model: str = "deepseek-chat") -> str:
    prompt = f"<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>"
    response = client.completions.create(
        model=model,
        prompt=prompt,
        max_tokens=512,
        temperature=0.0,
    )
    return response.choices[0].text

Function Calling

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

# Check if model called a function
if response.choices[0].finish_reason == "tool_calls":
    tool_call = response.choices[0].message.tool_calls[0]
    import json
    args = json.loads(tool_call.function.arguments)
    print(f"Function: {tool_call.function.name}, Args: {args}")

Local Deployment with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True,         # Reduce VRAM from 14GB to ~5GB
)

# Format prompt with DeepSeek chat template
messages = [{"role": "user", "content": "Explain the Riemann hypothesis."}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.6,
        do_sample=True,
    )
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Common Workflows

Code Review Assistant

def review_code(code: str, language: str = "python") -> str:
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "system",
                "content": "You are an expert code reviewer. Be concise, specific, and actionable.",
            },
            {
                "role": "user",
                "content": f"Review this {language} code:\n\n```{language}\n{code}\n```",
            },
        ],
        temperature=0.3,
        max_tokens=1024,
    )
    return response.choices[0].message.content

Ollama Quick Start

# Pull and run interactively
ollama run deepseek-r1:7b

# Serve as API
ollama serve &

# Chat via curl
curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:7b",
  "messages": [{"role": "user", "content": "Explain recursion"}],
  "stream": false
}'

# List local models
ollama list

# Remove model
ollama rm deepseek-r1:7b

Batch Processing with Rate Limiting

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(
    api_key="sk-xxxxxxxxxxxxxxxxxxxx",
    base_url="https://api.deepseek.com",
)

async def process_batch(prompts: list[str], concurrency: int = 5) -> list[str]:
    semaphore = asyncio.Semaphore(concurrency)

    async def process_one(prompt: str) -> str:
        async with semaphore:
            resp = await async_client.chat.completions.create(
                model="deepseek-chat",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=256,
            )
            return resp.choices[0].message.content

    return await asyncio.gather(*[process_one(p) for p in prompts])

results = asyncio.run(process_batch(["Summarize AI.", "Explain ML.", "Define LLM."]))

Tips and Best Practices

Topic	Recommendation
Model choice	Use `deepseek-reasoner` for math/logic; `deepseek-chat` for general tasks; coder models for code
R1 reasoning	Don’t truncate `reasoning_content` — it reflects quality of final answer
Temperature	Use 0.0–0.3 for code/factual tasks; 0.7–1.0 for creative writing
FIM	Use FIM for code completion tasks; pass `stop=["<｜fim▁end｜>"]` to avoid trailing tokens
Context length	Keep prompts well under limit; 64K context doesn’t mean 64K output
Cost control	Cache system prompts via `prefix_caching` (supported by DeepSeek API)
Local vs cloud	Use 7B distilled models locally for dev/testing; cloud API for production
vLLM serving	Add `--enable-prefix-caching` flag for repeated system prompts
Multi-turn reasoning	Do not pass `reasoning_content` back to the model in subsequent turns
Rate limits	API has token-per-minute limits; use async clients with semaphores for batches
Quantization	4-bit quantization halves VRAM with <5% quality drop for most tasks