Ir al contenido

ChromaDB Cheat Sheet

Overview

ChromaDB is an open-source embedding database built specifically for AI applications. It provides a simple API for storing document embeddings alongside metadata, then retrieving them via semantic similarity search. ChromaDB supports both in-memory (ephemeral) and persistent storage modes, a Docker-based HTTP server for production, and first-class integrations with LangChain, LlamaIndex, and OpenAI.

Collections are ChromaDB’s primary abstraction — analogous to tables in a relational database. Each collection stores documents, their vector embeddings, associated metadata, and unique IDs. The database handles embedding generation automatically when an embedding function is configured, or accepts pre-computed vectors directly. Metadata filtering via a MongoDB-like where clause lets you combine semantic search with structured constraints, enabling powerful hybrid retrieval strategies.

ChromaDB is well-suited for prototyping RAG pipelines locally and scaling to production via the HTTP server mode. Its Python and JavaScript clients share a nearly identical API, making it straightforward to switch between embedded and server modes without rewriting application code.

Installation

Python Client

# Install ChromaDB
pip install chromadb

# Install with embedding function extras
pip install chromadb[openai]      # OpenAI embeddings
pip install chromadb[sentence-transformers]  # Local embeddings

# Install LangChain integration
pip install langchain-chroma

# Install LlamaIndex integration
pip install llama-index-vector-stores-chroma

Docker Server Mode

# Pull and run ChromaDB server
docker pull chromadb/chroma
docker run -d \
  -p 8000:8000 \
  -v chroma-data:/chroma/chroma \
  --name chromadb \
  chromadb/chroma

# Run with authentication enabled
docker run -d \
  -p 8000:8000 \
  -v chroma-data:/chroma/chroma \
  -e CHROMA_SERVER_AUTH_CREDENTIALS="admin:password" \
  -e CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.basic_authn.BasicAuthenticationServerProvider" \
  chromadb/chroma

# Docker Compose
cat > docker-compose.yml << 'EOF'
version: "3.9"
services:
  chromadb:
    image: chromadb/chroma:latest
    ports:
      - "8000:8000"
    volumes:
      - chroma-data:/chroma/chroma
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE
volumes:
  chroma-data:
EOF
docker compose up -d

JavaScript Client

npm install chromadb
npm install chromadb-default-embed   # Default embedding function

Configuration

Client Initialization

import chromadb
from chromadb.config import Settings

# Ephemeral (in-memory) client — data lost on exit
client = chromadb.EphemeralClient()

# Persistent client — data saved to disk
client = chromadb.PersistentClient(path="./chroma_db")

# HTTP client — connect to Docker server
client = chromadb.HttpClient(host="localhost", port=8000)

# HTTP client with authentication
client = chromadb.HttpClient(
    host="localhost",
    port=8000,
    settings=Settings(
        chroma_client_auth_provider="chromadb.auth.basic_authn.BasicAuthClientProvider",
        chroma_client_auth_credentials="admin:password"
    )
)

# Async HTTP client
client = await chromadb.AsyncHttpClient(host="localhost", port=8000)

Embedding Functions

from chromadb.utils import embedding_functions

# Default embedding function (all-MiniLM-L6-v2, runs locally)
default_ef = embedding_functions.DefaultEmbeddingFunction()

# OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-small"
)

# Sentence Transformers (local)
sentence_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)

# Cohere embeddings
cohere_ef = embedding_functions.CohereEmbeddingFunction(
    api_key="...",
    model_name="embed-english-v3.0"
)

# HuggingFace Inference API
hf_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="hf_...",
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

Core Commands/API

MethodDescription
client.create_collection(name)Create a new collection
client.get_collection(name)Get an existing collection
client.get_or_create_collection(name)Get or create collection
client.delete_collection(name)Delete a collection permanently
client.list_collections()List all collections
client.reset()Delete all data (use with caution)
collection.add(documents, ids)Add documents with auto-embedding
collection.add(embeddings, ids)Add pre-computed embeddings
collection.upsert(documents, ids)Add or update documents
collection.update(ids, documents)Update existing documents
collection.delete(ids)Delete documents by ID
collection.get(ids)Retrieve documents by ID
collection.query(query_texts, n_results)Semantic similarity search
collection.query(query_embeddings, n_results)Search with pre-computed vector
collection.peek(limit)Preview first N documents
collection.count()Count documents in collection
collection.modify(name, metadata)Rename or update collection metadata

Advanced Usage

Adding Documents with Metadata

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_db")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=openai_ef,
    metadata={"hnsw:space": "cosine"}   # cosine | l2 | ip
)

# Add documents — embeddings generated automatically
collection.add(
    documents=[
        "ChromaDB is an open-source vector database.",
        "Qdrant is built in Rust for high performance.",
        "pgvector adds vector search to PostgreSQL."
    ],
    metadatas=[
        {"source": "docs", "category": "chromadb", "version": "0.5"},
        {"source": "docs", "category": "qdrant",   "version": "1.9"},
        {"source": "docs", "category": "pgvector",  "version": "0.7"}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Add pre-computed embeddings
import numpy as np
embeddings = np.random.rand(3, 1536).tolist()
collection.add(
    embeddings=embeddings,
    documents=["text1", "text2", "text3"],
    ids=["e1", "e2", "e3"]
)

Querying with Metadata Filters

# Basic semantic search
results = collection.query(
    query_texts=["vector database performance"],
    n_results=5
)

# Access results
for doc, meta, dist, id_ in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0],
    results["ids"][0]
):
    print(f"[{dist:.4f}] {id_}: {doc[:60]}  | {meta}")

# Filter by metadata — where clause operators
results = collection.query(
    query_texts=["fast similarity search"],
    n_results=3,
    where={"category": "qdrant"},           # exact match
)

results = collection.query(
    query_texts=["SQL integration"],
    n_results=3,
    where={"version": {"$gte": "0.6"}},     # comparison
)

results = collection.query(
    query_texts=["open-source database"],
    n_results=5,
    where={"$or": [                          # boolean logic
        {"category": "chromadb"},
        {"category": "pgvector"}
    ]},
    where_document={"$contains": "PostgreSQL"}  # document text filter
)

# Include specific fields in results
results = collection.query(
    query_texts=["embeddings"],
    n_results=3,
    include=["documents", "metadatas", "distances", "embeddings"]
)

Batch Operations and Upsert

import uuid

def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Batch ingest a document
document_text = "..." * 1000
chunks = chunk_text(document_text)

collection.upsert(
    documents=chunks,
    metadatas=[{"chunk_index": i, "source": "report.pdf"} for i in range(len(chunks))],
    ids=[str(uuid.uuid4()) for _ in chunks]
)

# Update metadata on existing docs
collection.update(
    ids=["doc1", "doc2"],
    metadatas=[
        {"source": "docs", "category": "chromadb", "reviewed": True},
        {"source": "docs", "category": "qdrant",   "reviewed": True}
    ]
)

# Delete by metadata filter
collection.delete(where={"category": "outdated"})

LangChain Integration

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create or load a Chroma vector store
vectorstore = Chroma(
    collection_name="langchain_docs",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain"
)

# Add documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("document.pdf")
pages = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = splitter.split_documents(pages)
vectorstore.add_documents(docs)

# Build RAG chain
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5, "filter": {"source": "document.pdf"}}
)

llm = ChatOpenAI(model="gpt-4o-mini")
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = chain.invoke("What is the main topic?")

LlamaIndex Integration

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

chroma_client = chromadb.PersistentClient(path="./chroma_llama")
chroma_collection = chroma_client.get_or_create_collection("llama_docs")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index from documents
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

# Query the index
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Summarize the key findings.")
print(response)

# Load existing index (no re-embedding)
index = VectorStoreIndex.from_vector_store(vector_store)

Common Workflows

RAG Pipeline from Scratch

import chromadb
from chromadb.utils import embedding_functions
import openai

client = chromadb.PersistentClient(path="./rag_db")
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...", model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection("rag", embedding_function=ef)

def ingest_documents(docs: list[dict]):
    """docs: [{"text": ..., "source": ..., "id": ...}]"""
    collection.upsert(
        documents=[d["text"] for d in docs],
        metadatas=[{"source": d["source"]} for d in docs],
        ids=[d["id"] for d in docs]
    )

def retrieve(query: str, k: int = 5, source_filter: str = None) -> list[str]:
    where = {"source": source_filter} if source_filter else None
    results = collection.query(
        query_texts=[query], n_results=k, where=where
    )
    return results["documents"][0]

def answer(query: str) -> str:
    context = "\n\n".join(retrieve(query))
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer using this context:\n{context}"},
            {"role": "user",   "content": query}
        ]
    )
    return response.choices[0].message.content

Multi-Modal Collection (Images)

from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader

image_ef = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader()

collection = client.get_or_create_collection(
    name="images",
    embedding_function=image_ef,
    data_loader=image_loader
)

import numpy as np
# Add images by URI
collection.add(
    uris=["./img1.jpg", "./img2.png"],
    ids=["img1", "img2"],
    metadatas=[{"label": "cat"}, {"label": "dog"}]
)

# Query by image URI or text
results = collection.query(
    query_uris=["./query.jpg"],
    n_results=3,
    include=["uris", "distances", "metadatas"]
)

Backup and Restore

# Backup persistent storage
tar -czf chroma_backup_$(date +%Y%m%d).tar.gz ./chroma_db/

# Docker volume backup
docker run --rm \
  -v chroma-data:/data \
  -v $(pwd):/backup \
  alpine tar czf /backup/chroma_backup.tar.gz -C /data .

# Restore
tar -xzf chroma_backup_20240101.tar.gz -C ./chroma_db/

Tips and Best Practices

TipDetails
Choose cosine distanceSet metadata={"hnsw:space": "cosine"} at collection creation; cannot be changed later
Use upsert over addPrevents duplicate ID errors during incremental ingestion
Pre-compute embeddingsGenerate embeddings in batch with openai.embeddings.create(input=[...]) then pass directly
Filter before queryNarrow the search space with where to improve precision and speed
Tune HNSW parametershnsw:M (connectivity, default 16) and hnsw:ef_construction (build accuracy, default 100) control index quality
Monitor collection sizeUse collection.count() to track growth; plan for scaling to server mode
Avoid tiny chunksChunks under ~100 tokens lose context; 300-800 tokens is typically optimal
Use IDs for deduplicationStable, deterministic IDs (e.g., hash of content) allow safe re-ingestion
Enable telemetry opt-outSet ANONYMIZED_TELEMETRY=FALSE in production environments
Test with peek()Use collection.peek(5) to verify data shape before running full queries