DeepSeek provides state-of-the-art open-weight models with a public API. Known for DeepSeek-Coder (code generation), DeepSeek-R1 (chain-of-thought reasoning), and competitive pricing. Models are fully open-source and can be self-hosted.
API: https://api.deepseek.com
Docs: https://platform.deepseek.com/docs
Models on Hub: https://huggingface.co/deepseek-ai
GitHub: https://github.com/deepseek-ai
Installation
Python SDK (OpenAI-compatible)
# DeepSeek uses the OpenAI SDK with a custom base_url
pip install openai
# Or use httpx for direct REST calls
pip install httpx
# For local deployment via Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull deepseek-r1:7b
ollama pull deepseek-coder-v2:16b
vLLM Self-Hosted Inference
# Install vLLM (requires CUDA 11.8+)
pip install vllm
# Serve DeepSeek model
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--dtype bfloat16 \
--max-model-len 16384 \
--tensor-parallel-size 1
# OpenAI-compatible endpoint now at http://localhost:8000/v1
pip install transformers accelerate torch bitsandbytes
Configuration
API Client Setup
from openai import OpenAI
# DeepSeek API (cloud)
client = OpenAI(
api_key="sk-xxxxxxxxxxxxxxxxxxxx", # From platform.deepseek.com
base_url="https://api.deepseek.com",
)
# Self-hosted via Ollama
client_local = OpenAI(
api_key="ollama", # Any non-empty string
base_url="http://localhost:11434/v1",
)
# Self-hosted via vLLM
client_vllm = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
Environment Variables
export DEEPSEEK_API_KEY="sk-xxxxxxxxxxxxxxxxxxxx"
export DEEPSEEK_BASE_URL="https://api.deepseek.com"
# Optional: use a local endpoint
export DEEPSEEK_BASE_URL="http://localhost:11434/v1"
Core API / Commands
Available Models
| Model | Context | Strengths | Pricing (Input/Output per 1M tokens) |
|---|
deepseek-chat | 64K | General chat, instruction following | $0.27 / $1.10 |
deepseek-reasoner | 64K | Chain-of-thought, math, logic | $0.55 / $2.19 |
deepseek-coder | 16K | Code generation, completion | $0.14 / $0.28 |
deepseek-ai/DeepSeek-R1 (HF) | 128K | Full open-weight reasoning model | Self-hosted |
deepseek-ai/DeepSeek-V3 (HF) | 128K | Full open-weight general model | Self-hosted |
deepseek-r1:7b (Ollama) | 32K | Lightweight local reasoning | Free (local) |
deepseek-coder-v2:16b (Ollama) | 32K | Local code model | Free (local) |
API Endpoints
| Endpoint | Method | Description |
|---|
/chat/completions | POST | Chat completions (main endpoint) |
/completions | POST | Text completion (legacy/FIM) |
/models | GET | List available models |
/beta/completions | POST | FIM (fill-in-the-middle) completions |
Advanced Usage
Chat Completions
from openai import OpenAI
client = OpenAI(
api_key="sk-xxxxxxxxxxxxxxxxxxxx",
base_url="https://api.deepseek.com",
)
# Basic chat
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in 3 sentences."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
# Streaming response
stream = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "Write a Python quicksort."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
DeepSeek-R1 Reasoning (with thinking tokens)
# DeepSeek-R1 returns reasoning_content separately
response = client.chat.completions.create(
model="deepseek-reasoner",
messages=[
{"role": "user", "content": "What is 17 * 23? Show your reasoning."},
],
)
# Access chain-of-thought reasoning
thinking = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content
print("THINKING:\n", thinking)
print("\nANSWER:\n", answer)
# Multi-turn with reasoning preserved
messages = [{"role": "user", "content": "Solve: if 3x + 7 = 22, find x"}]
resp = client.chat.completions.create(model="deepseek-reasoner", messages=messages)
# Append reasoning + answer for context continuity
messages.append({
"role": "assistant",
"content": resp.choices[0].message.content,
# Note: reasoning_content is NOT passed back in multi-turn (by design)
})
messages.append({"role": "user", "content": "Now solve 5x - 3 = 27"})
Fill-in-the-Middle (FIM) for Code Completion
# FIM uses special tokens: <|fim▁begin|>, <|fim▁hole|>, <|fim▁end|>
response = client.completions.create(
model="deepseek-chat",
prompt="<|fim▁begin|>def quicksort(arr):\n <|fim▁hole|>\n return arr<|fim▁end|>",
max_tokens=256,
stop=["<|fim▁end|>"],
)
print(response.choices[0].text)
# Pythonic FIM helper
def fim_complete(prefix: str, suffix: str, model: str = "deepseek-chat") -> str:
prompt = f"<|fim▁begin|>{prefix}<|fim▁hole|>{suffix}<|fim▁end|>"
response = client.completions.create(
model=model,
prompt=prompt,
max_tokens=512,
temperature=0.0,
)
return response.choices[0].text
Function Calling
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
# Check if model called a function
if response.choices[0].finish_reason == "tool_calls":
tool_call = response.choices[0].message.tool_calls[0]
import json
args = json.loads(tool_call.function.arguments)
print(f"Function: {tool_call.function.name}, Args: {args}")
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_4bit=True, # Reduce VRAM from 14GB to ~5GB
)
# Format prompt with DeepSeek chat template
messages = [{"role": "user", "content": "Explain the Riemann hypothesis."}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.6,
do_sample=True,
)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Common Workflows
Code Review Assistant
def review_code(code: str, language: str = "python") -> str:
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "You are an expert code reviewer. Be concise, specific, and actionable.",
},
{
"role": "user",
"content": f"Review this {language} code:\n\n```{language}\n{code}\n```",
},
],
temperature=0.3,
max_tokens=1024,
)
return response.choices[0].message.content
Ollama Quick Start
# Pull and run interactively
ollama run deepseek-r1:7b
# Serve as API
ollama serve &
# Chat via curl
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1:7b",
"messages": [{"role": "user", "content": "Explain recursion"}],
"stream": false
}'
# List local models
ollama list
# Remove model
ollama rm deepseek-r1:7b
Batch Processing with Rate Limiting
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI(
api_key="sk-xxxxxxxxxxxxxxxxxxxx",
base_url="https://api.deepseek.com",
)
async def process_batch(prompts: list[str], concurrency: int = 5) -> list[str]:
semaphore = asyncio.Semaphore(concurrency)
async def process_one(prompt: str) -> str:
async with semaphore:
resp = await async_client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
)
return resp.choices[0].message.content
return await asyncio.gather(*[process_one(p) for p in prompts])
results = asyncio.run(process_batch(["Summarize AI.", "Explain ML.", "Define LLM."]))
Tips and Best Practices
| Topic | Recommendation |
|---|
| Model choice | Use deepseek-reasoner for math/logic; deepseek-chat for general tasks; coder models for code |
| R1 reasoning | Don’t truncate reasoning_content — it reflects quality of final answer |
| Temperature | Use 0.0–0.3 for code/factual tasks; 0.7–1.0 for creative writing |
| FIM | Use FIM for code completion tasks; pass stop=["<|fim▁end|>"] to avoid trailing tokens |
| Context length | Keep prompts well under limit; 64K context doesn’t mean 64K output |
| Cost control | Cache system prompts via prefix_caching (supported by DeepSeek API) |
| Local vs cloud | Use 7B distilled models locally for dev/testing; cloud API for production |
| vLLM serving | Add --enable-prefix-caching flag for repeated system prompts |
| Multi-turn reasoning | Do not pass reasoning_content back to the model in subsequent turns |
| Rate limits | API has token-per-minute limits; use async clients with semaphores for batches |
| Quantization | 4-bit quantization halves VRAM with <5% quality drop for most tasks |