ChromaDB Cheat Sheet
Overview
ChromaDB is an open-source embedding database built specifically for AI applications. It provides a simple API for storing document embeddings alongside metadata, then retrieving them via semantic similarity search. ChromaDB supports both in-memory (ephemeral) and persistent storage modes, a Docker-based HTTP server for production, and first-class integrations with LangChain, LlamaIndex, and OpenAI.
Collections are ChromaDB’s primary abstraction — analogous to tables in a relational database. Each collection stores documents, their vector embeddings, associated metadata, and unique IDs. The database handles embedding generation automatically when an embedding function is configured, or accepts pre-computed vectors directly. Metadata filtering via a MongoDB-like where clause lets you combine semantic search with structured constraints, enabling powerful hybrid retrieval strategies.
ChromaDB is well-suited for prototyping RAG pipelines locally and scaling to production via the HTTP server mode. Its Python and JavaScript clients share a nearly identical API, making it straightforward to switch between embedded and server modes without rewriting application code.
Installation
Python Client
# Install ChromaDB
pip install chromadb
# Install with embedding function extras
pip install chromadb[openai] # OpenAI embeddings
pip install chromadb[sentence-transformers] # Local embeddings
# Install LangChain integration
pip install langchain-chroma
# Install LlamaIndex integration
pip install llama-index-vector-stores-chroma
Docker Server Mode
# Pull and run ChromaDB server
docker pull chromadb/chroma
docker run -d \
-p 8000:8000 \
-v chroma-data:/chroma/chroma \
--name chromadb \
chromadb/chroma
# Run with authentication enabled
docker run -d \
-p 8000:8000 \
-v chroma-data:/chroma/chroma \
-e CHROMA_SERVER_AUTH_CREDENTIALS="admin:password" \
-e CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.basic_authn.BasicAuthenticationServerProvider" \
chromadb/chroma
# Docker Compose
cat > docker-compose.yml << 'EOF'
version: "3.9"
services:
chromadb:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma-data:/chroma/chroma
environment:
- IS_PERSISTENT=TRUE
- ANONYMIZED_TELEMETRY=FALSE
volumes:
chroma-data:
EOF
docker compose up -d
JavaScript Client
npm install chromadb
npm install chromadb-default-embed # Default embedding function
Configuration
Client Initialization
import chromadb
from chromadb.config import Settings
# Ephemeral (in-memory) client — data lost on exit
client = chromadb.EphemeralClient()
# Persistent client — data saved to disk
client = chromadb.PersistentClient(path="./chroma_db")
# HTTP client — connect to Docker server
client = chromadb.HttpClient(host="localhost", port=8000)
# HTTP client with authentication
client = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(
chroma_client_auth_provider="chromadb.auth.basic_authn.BasicAuthClientProvider",
chroma_client_auth_credentials="admin:password"
)
)
# Async HTTP client
client = await chromadb.AsyncHttpClient(host="localhost", port=8000)
Embedding Functions
from chromadb.utils import embedding_functions
# Default embedding function (all-MiniLM-L6-v2, runs locally)
default_ef = embedding_functions.DefaultEmbeddingFunction()
# OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-3-small"
)
# Sentence Transformers (local)
sentence_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2"
)
# Cohere embeddings
cohere_ef = embedding_functions.CohereEmbeddingFunction(
api_key="...",
model_name="embed-english-v3.0"
)
# HuggingFace Inference API
hf_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="hf_...",
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
Core Commands/API
| Method | Description |
|---|---|
client.create_collection(name) | Create a new collection |
client.get_collection(name) | Get an existing collection |
client.get_or_create_collection(name) | Get or create collection |
client.delete_collection(name) | Delete a collection permanently |
client.list_collections() | List all collections |
client.reset() | Delete all data (use with caution) |
collection.add(documents, ids) | Add documents with auto-embedding |
collection.add(embeddings, ids) | Add pre-computed embeddings |
collection.upsert(documents, ids) | Add or update documents |
collection.update(ids, documents) | Update existing documents |
collection.delete(ids) | Delete documents by ID |
collection.get(ids) | Retrieve documents by ID |
collection.query(query_texts, n_results) | Semantic similarity search |
collection.query(query_embeddings, n_results) | Search with pre-computed vector |
collection.peek(limit) | Preview first N documents |
collection.count() | Count documents in collection |
collection.modify(name, metadata) | Rename or update collection metadata |
Advanced Usage
Adding Documents with Metadata
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="./chroma_db")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"} # cosine | l2 | ip
)
# Add documents — embeddings generated automatically
collection.add(
documents=[
"ChromaDB is an open-source vector database.",
"Qdrant is built in Rust for high performance.",
"pgvector adds vector search to PostgreSQL."
],
metadatas=[
{"source": "docs", "category": "chromadb", "version": "0.5"},
{"source": "docs", "category": "qdrant", "version": "1.9"},
{"source": "docs", "category": "pgvector", "version": "0.7"}
],
ids=["doc1", "doc2", "doc3"]
)
# Add pre-computed embeddings
import numpy as np
embeddings = np.random.rand(3, 1536).tolist()
collection.add(
embeddings=embeddings,
documents=["text1", "text2", "text3"],
ids=["e1", "e2", "e3"]
)
Querying with Metadata Filters
# Basic semantic search
results = collection.query(
query_texts=["vector database performance"],
n_results=5
)
# Access results
for doc, meta, dist, id_ in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
results["ids"][0]
):
print(f"[{dist:.4f}] {id_}: {doc[:60]} | {meta}")
# Filter by metadata — where clause operators
results = collection.query(
query_texts=["fast similarity search"],
n_results=3,
where={"category": "qdrant"}, # exact match
)
results = collection.query(
query_texts=["SQL integration"],
n_results=3,
where={"version": {"$gte": "0.6"}}, # comparison
)
results = collection.query(
query_texts=["open-source database"],
n_results=5,
where={"$or": [ # boolean logic
{"category": "chromadb"},
{"category": "pgvector"}
]},
where_document={"$contains": "PostgreSQL"} # document text filter
)
# Include specific fields in results
results = collection.query(
query_texts=["embeddings"],
n_results=3,
include=["documents", "metadatas", "distances", "embeddings"]
)
Batch Operations and Upsert
import uuid
def chunk_text(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Batch ingest a document
document_text = "..." * 1000
chunks = chunk_text(document_text)
collection.upsert(
documents=chunks,
metadatas=[{"chunk_index": i, "source": "report.pdf"} for i in range(len(chunks))],
ids=[str(uuid.uuid4()) for _ in chunks]
)
# Update metadata on existing docs
collection.update(
ids=["doc1", "doc2"],
metadatas=[
{"source": "docs", "category": "chromadb", "reviewed": True},
{"source": "docs", "category": "qdrant", "reviewed": True}
]
)
# Delete by metadata filter
collection.delete(where={"category": "outdated"})
LangChain Integration
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create or load a Chroma vector store
vectorstore = Chroma(
collection_name="langchain_docs",
embedding_function=embeddings,
persist_directory="./chroma_langchain"
)
# Add documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("document.pdf")
pages = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = splitter.split_documents(pages)
vectorstore.add_documents(docs)
# Build RAG chain
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5, "filter": {"source": "document.pdf"}}
)
llm = ChatOpenAI(model="gpt-4o-mini")
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = chain.invoke("What is the main topic?")
LlamaIndex Integration
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
chroma_client = chromadb.PersistentClient(path="./chroma_llama")
chroma_collection = chroma_client.get_or_create_collection("llama_docs")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build index from documents
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
# Query the index
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Summarize the key findings.")
print(response)
# Load existing index (no re-embedding)
index = VectorStoreIndex.from_vector_store(vector_store)
Common Workflows
RAG Pipeline from Scratch
import chromadb
from chromadb.utils import embedding_functions
import openai
client = chromadb.PersistentClient(path="./rag_db")
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...", model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection("rag", embedding_function=ef)
def ingest_documents(docs: list[dict]):
"""docs: [{"text": ..., "source": ..., "id": ...}]"""
collection.upsert(
documents=[d["text"] for d in docs],
metadatas=[{"source": d["source"]} for d in docs],
ids=[d["id"] for d in docs]
)
def retrieve(query: str, k: int = 5, source_filter: str = None) -> list[str]:
where = {"source": source_filter} if source_filter else None
results = collection.query(
query_texts=[query], n_results=k, where=where
)
return results["documents"][0]
def answer(query: str) -> str:
context = "\n\n".join(retrieve(query))
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer using this context:\n{context}"},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
Multi-Modal Collection (Images)
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
image_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()
collection = client.get_or_create_collection(
name="images",
embedding_function=image_ef,
data_loader=image_loader
)
import numpy as np
# Add images by URI
collection.add(
uris=["./img1.jpg", "./img2.png"],
ids=["img1", "img2"],
metadatas=[{"label": "cat"}, {"label": "dog"}]
)
# Query by image URI or text
results = collection.query(
query_uris=["./query.jpg"],
n_results=3,
include=["uris", "distances", "metadatas"]
)
Backup and Restore
# Backup persistent storage
tar -czf chroma_backup_$(date +%Y%m%d).tar.gz ./chroma_db/
# Docker volume backup
docker run --rm \
-v chroma-data:/data \
-v $(pwd):/backup \
alpine tar czf /backup/chroma_backup.tar.gz -C /data .
# Restore
tar -xzf chroma_backup_20240101.tar.gz -C ./chroma_db/
Tips and Best Practices
| Tip | Details |
|---|---|
| Choose cosine distance | Set metadata={"hnsw:space": "cosine"} at collection creation; cannot be changed later |
Use upsert over add | Prevents duplicate ID errors during incremental ingestion |
| Pre-compute embeddings | Generate embeddings in batch with openai.embeddings.create(input=[...]) then pass directly |
| Filter before query | Narrow the search space with where to improve precision and speed |
| Tune HNSW parameters | hnsw:M (connectivity, default 16) and hnsw:ef_construction (build accuracy, default 100) control index quality |
| Monitor collection size | Use collection.count() to track growth; plan for scaling to server mode |
| Avoid tiny chunks | Chunks under ~100 tokens lose context; 300-800 tokens is typically optimal |
| Use IDs for deduplication | Stable, deterministic IDs (e.g., hash of content) allow safe re-ingestion |
| Enable telemetry opt-out | Set ANONYMIZED_TELEMETRY=FALSE in production environments |
Test with peek() | Use collection.peek(5) to verify data shape before running full queries |