Cognita Cheat Sheet

Overview

Cognita is an open-source modular RAG framework developed by TrueFoundry that organizes RAG pipelines into distinct, swappable components: data loaders, parsers, embedders, vector stores, retrievers, and query engines. It provides both a no-code UI and a Python API for building production-grade retrieval-augmented generation systems with full control over each pipeline stage.

The framework emphasizes production readiness with features like incremental indexing, configurable chunking strategies, metadata filtering, hybrid search, and reranking. Cognita supports multiple LLM providers and vector databases out of the box and can be deployed locally or on Kubernetes via TrueFoundry.

Installation

Local Setup

git clone https://github.com/truefoundry/cognita.git
cd cognita

# Install dependencies
pip install -r requirements.txt

# Copy and edit configuration
cp .env.example .env
# Edit .env with your API keys and settings

# Start the backend
python -m uvicorn backend.server:app --host 0.0.0.0 --port 8000

# Start the frontend (new terminal)
cd frontend
npm install
npm run dev

Docker Compose

git clone https://github.com/truefoundry/cognita.git
cd cognita

cp .env.example .env
# Configure .env

docker compose up -d
# API at http://localhost:8000
# UI at http://localhost:3000

Environment Variables

# .env configuration
OPENAI_API_KEY=sk-...
EMBEDDING_MODEL=text-embedding-3-small
LLM_MODEL=gpt-4o
VECTOR_DB=qdrant
QDRANT_URL=http://localhost:6333
METADATA_STORE_TYPE=sqlite
METADATA_STORE_CONFIG='{"url": "sqlite:///cognita.db"}'

Core Components

Pipeline Architecture

Component	Role	Options
Data Source	Where documents come from	Local files, S3, GCS, web URLs
Parser	Extract text from documents	PDF, DOCX, HTML, Markdown, Unstructured
Chunker	Split text into segments	Recursive, Token-based, Markdown-aware
Embedder	Generate vector embeddings	OpenAI, HuggingFace, Cohere
Vector Store	Store and index embeddings	Qdrant, Chroma, Weaviate, Milvus
Retriever	Fetch relevant chunks	Similarity, MMR, hybrid search
Reranker	Re-score retrieved chunks	Cohere, Cross-encoder
LLM	Generate final answer	OpenAI, Anthropic, Ollama

REST API

# Create a collection (knowledge base)
curl -X POST http://localhost:8000/v1/collections \
  -H "Content-Type: application/json" \
  -d '{
    "name": "product-docs",
    "description": "Product documentation",
    "embedder_config": {
      "provider": "openai",
      "config": {"model": "text-embedding-3-small"}
    }
  }'

# Add a data source
curl -X POST http://localhost:8000/v1/data_sources \
  -H "Content-Type: application/json" \
  -d '{
    "type": "local",
    "uri": "/data/documents/",
    "collection_name": "product-docs"
  }'

# Trigger indexing
curl -X POST http://localhost:8000/v1/collections/product-docs/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "data_source_id": "ds-123",
    "parser_config": {
      "chunk_size": 1000,
      "chunk_overlap": 200
    }
  }'

# Query the collection
curl -X POST http://localhost:8000/v1/collections/product-docs/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure authentication?",
    "top_k": 5,
    "llm_config": {
      "provider": "openai",
      "config": {"model": "gpt-4o", "temperature": 0.1}
    }
  }'

Configuration

Collection Configuration

# config/collection.yaml
collection:
  name: "tech-docs"
  embedder:
    provider: openai
    config:
      model: text-embedding-3-small
      dimensions: 1536

  chunker:
    type: recursive
    config:
      chunk_size: 1000
      chunk_overlap: 200
      separators: ["\n\n", "\n", " ", ""]

  vector_store:
    provider: qdrant
    config:
      url: http://localhost:6333
      collection_name: tech-docs

Parser Configuration

# Supported parsers and their configs
parsers = {
    "pdf": {
        "type": "unstructured",
        "config": {
            "strategy": "hi_res",
            "extract_tables": True,
            "extract_images": False
        }
    },
    "markdown": {
        "type": "markdown",
        "config": {
            "preserve_headers": True,
            "code_block_handling": "keep"
        }
    },
    "html": {
        "type": "html",
        "config": {
            "remove_scripts": True,
            "extract_links": True
        }
    }
}

Retriever Configuration

retriever_config = {
    "type": "hybrid",
    "config": {
        "dense_weight": 0.7,
        "sparse_weight": 0.3,
        "top_k": 10,
        "reranker": {
            "provider": "cohere",
            "model": "rerank-english-v3.0",
            "top_n": 5
        },
        "metadata_filter": {
            "source": "api-docs",
            "version": {"$gte": "2.0"}
        }
    }
}

Advanced Usage

Custom Components

from cognita.base import BaseParser, BaseChunker, BaseRetriever

# Custom parser
class CustomPDFParser(BaseParser):
    def __init__(self, config):
        self.config = config

    def parse(self, file_path: str) -> list[dict]:
        # Custom parsing logic
        import fitz  # PyMuPDF
        doc = fitz.open(file_path)
        pages = []
        for page in doc:
            pages.append({
                "text": page.get_text(),
                "metadata": {"page": page.number + 1}
            })
        return pages

# Custom chunker
class SemanticChunker(BaseChunker):
    def chunk(self, documents: list[dict]) -> list[dict]:
        chunks = []
        for doc in documents:
            # Semantic splitting logic
            sentences = doc["text"].split(". ")
            current_chunk = []
            for sentence in sentences:
                current_chunk.append(sentence)
                if len(" ".join(current_chunk)) > self.config["chunk_size"]:
                    chunks.append({
                        "text": ". ".join(current_chunk),
                        "metadata": doc["metadata"]
                    })
                    current_chunk = []
        return chunks

Incremental Indexing

# Index only new/modified documents
curl -X POST http://localhost:8000/v1/collections/product-docs/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "data_source_id": "ds-123",
    "incremental": true,
    "parser_config": {
      "chunk_size": 1000,
      "chunk_overlap": 200
    }
  }'

# Delete specific documents from index
curl -X DELETE http://localhost:8000/v1/collections/product-docs/documents \
  -H "Content-Type: application/json" \
  -d '{
    "document_ids": ["doc-1", "doc-2"]
  }'

Multi-Collection Querying

# Query across multiple collections
curl -X POST http://localhost:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "deployment best practices",
    "collections": ["product-docs", "engineering-wiki"],
    "top_k": 5,
    "merge_strategy": "interleave"
  }'

Troubleshooting

Issue	Solution
Qdrant connection refused	Start Qdrant: `docker run -p 6333:6333 qdrant/qdrant`
Embedding dimension mismatch	Ensure embedder model matches vector store config
Indexing fails on large PDFs	Increase parser timeout, use chunked upload
Out of memory during indexing	Reduce batch size, process files incrementally
Empty retrieval results	Check collection has indexed docs, lower score threshold
Slow query response	Add reranker, reduce top_k, optimize chunk size
API key errors	Verify `.env` file is loaded, check key format
Frontend not loading	Check CORS settings, verify API URL in frontend config

# Check API health
curl http://localhost:8000/health

# List collections
curl http://localhost:8000/v1/collections

# View collection stats
curl http://localhost:8000/v1/collections/product-docs/stats

# View logs
docker compose logs -f backend