Aller au contenu

Cognita Cheat Sheet

Overview

Cognita is an open-source modular RAG framework developed by TrueFoundry that organizes RAG pipelines into distinct, swappable components: data loaders, parsers, embedders, vector stores, retrievers, and query engines. It provides both a no-code UI and a Python API for building production-grade retrieval-augmented generation systems with full control over each pipeline stage.

The framework emphasizes production readiness with features like incremental indexing, configurable chunking strategies, metadata filtering, hybrid search, and reranking. Cognita supports multiple LLM providers and vector databases out of the box and can be deployed locally or on Kubernetes via TrueFoundry.

Installation

Local Setup

git clone https://github.com/truefoundry/cognita.git
cd cognita

# Install dependencies
pip install -r requirements.txt

# Copy and edit configuration
cp .env.example .env
# Edit .env with your API keys and settings

# Start the backend
python -m uvicorn backend.server:app --host 0.0.0.0 --port 8000

# Start the frontend (new terminal)
cd frontend
npm install
npm run dev

Docker Compose

git clone https://github.com/truefoundry/cognita.git
cd cognita

cp .env.example .env
# Configure .env

docker compose up -d
# API at http://localhost:8000
# UI at http://localhost:3000

Environment Variables

# .env configuration
OPENAI_API_KEY=sk-...
EMBEDDING_MODEL=text-embedding-3-small
LLM_MODEL=gpt-4o
VECTOR_DB=qdrant
QDRANT_URL=http://localhost:6333
METADATA_STORE_TYPE=sqlite
METADATA_STORE_CONFIG='{"url": "sqlite:///cognita.db"}'

Core Components

Pipeline Architecture

ComponentRoleOptions
Data SourceWhere documents come fromLocal files, S3, GCS, web URLs
ParserExtract text from documentsPDF, DOCX, HTML, Markdown, Unstructured
ChunkerSplit text into segmentsRecursive, Token-based, Markdown-aware
EmbedderGenerate vector embeddingsOpenAI, HuggingFace, Cohere
Vector StoreStore and index embeddingsQdrant, Chroma, Weaviate, Milvus
RetrieverFetch relevant chunksSimilarity, MMR, hybrid search
RerankerRe-score retrieved chunksCohere, Cross-encoder
LLMGenerate final answerOpenAI, Anthropic, Ollama

REST API

# Create a collection (knowledge base)
curl -X POST http://localhost:8000/v1/collections \
  -H "Content-Type: application/json" \
  -d '{
    "name": "product-docs",
    "description": "Product documentation",
    "embedder_config": {
      "provider": "openai",
      "config": {"model": "text-embedding-3-small"}
    }
  }'

# Add a data source
curl -X POST http://localhost:8000/v1/data_sources \
  -H "Content-Type: application/json" \
  -d '{
    "type": "local",
    "uri": "/data/documents/",
    "collection_name": "product-docs"
  }'

# Trigger indexing
curl -X POST http://localhost:8000/v1/collections/product-docs/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "data_source_id": "ds-123",
    "parser_config": {
      "chunk_size": 1000,
      "chunk_overlap": 200
    }
  }'

# Query the collection
curl -X POST http://localhost:8000/v1/collections/product-docs/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure authentication?",
    "top_k": 5,
    "llm_config": {
      "provider": "openai",
      "config": {"model": "gpt-4o", "temperature": 0.1}
    }
  }'

Configuration

Collection Configuration

# config/collection.yaml
collection:
  name: "tech-docs"
  embedder:
    provider: openai
    config:
      model: text-embedding-3-small
      dimensions: 1536

  chunker:
    type: recursive
    config:
      chunk_size: 1000
      chunk_overlap: 200
      separators: ["\n\n", "\n", " ", ""]

  vector_store:
    provider: qdrant
    config:
      url: http://localhost:6333
      collection_name: tech-docs

Parser Configuration

# Supported parsers and their configs
parsers = {
    "pdf": {
        "type": "unstructured",
        "config": {
            "strategy": "hi_res",
            "extract_tables": True,
            "extract_images": False
        }
    },
    "markdown": {
        "type": "markdown",
        "config": {
            "preserve_headers": True,
            "code_block_handling": "keep"
        }
    },
    "html": {
        "type": "html",
        "config": {
            "remove_scripts": True,
            "extract_links": True
        }
    }
}

Retriever Configuration

retriever_config = {
    "type": "hybrid",
    "config": {
        "dense_weight": 0.7,
        "sparse_weight": 0.3,
        "top_k": 10,
        "reranker": {
            "provider": "cohere",
            "model": "rerank-english-v3.0",
            "top_n": 5
        },
        "metadata_filter": {
            "source": "api-docs",
            "version": {"$gte": "2.0"}
        }
    }
}

Advanced Usage

Custom Components

from cognita.base import BaseParser, BaseChunker, BaseRetriever

# Custom parser
class CustomPDFParser(BaseParser):
    def __init__(self, config):
        self.config = config

    def parse(self, file_path: str) -> list[dict]:
        # Custom parsing logic
        import fitz  # PyMuPDF
        doc = fitz.open(file_path)
        pages = []
        for page in doc:
            pages.append({
                "text": page.get_text(),
                "metadata": {"page": page.number + 1}
            })
        return pages

# Custom chunker
class SemanticChunker(BaseChunker):
    def chunk(self, documents: list[dict]) -> list[dict]:
        chunks = []
        for doc in documents:
            # Semantic splitting logic
            sentences = doc["text"].split(". ")
            current_chunk = []
            for sentence in sentences:
                current_chunk.append(sentence)
                if len(" ".join(current_chunk)) > self.config["chunk_size"]:
                    chunks.append({
                        "text": ". ".join(current_chunk),
                        "metadata": doc["metadata"]
                    })
                    current_chunk = []
        return chunks

Incremental Indexing

# Index only new/modified documents
curl -X POST http://localhost:8000/v1/collections/product-docs/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "data_source_id": "ds-123",
    "incremental": true,
    "parser_config": {
      "chunk_size": 1000,
      "chunk_overlap": 200
    }
  }'

# Delete specific documents from index
curl -X DELETE http://localhost:8000/v1/collections/product-docs/documents \
  -H "Content-Type: application/json" \
  -d '{
    "document_ids": ["doc-1", "doc-2"]
  }'

Multi-Collection Querying

# Query across multiple collections
curl -X POST http://localhost:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "deployment best practices",
    "collections": ["product-docs", "engineering-wiki"],
    "top_k": 5,
    "merge_strategy": "interleave"
  }'

Troubleshooting

IssueSolution
Qdrant connection refusedStart Qdrant: docker run -p 6333:6333 qdrant/qdrant
Embedding dimension mismatchEnsure embedder model matches vector store config
Indexing fails on large PDFsIncrease parser timeout, use chunked upload
Out of memory during indexingReduce batch size, process files incrementally
Empty retrieval resultsCheck collection has indexed docs, lower score threshold
Slow query responseAdd reranker, reduce top_k, optimize chunk size
API key errorsVerify .env file is loaded, check key format
Frontend not loadingCheck CORS settings, verify API URL in frontend config
# Check API health
curl http://localhost:8000/health

# List collections
curl http://localhost:8000/v1/collections

# View collection stats
curl http://localhost:8000/v1/collections/product-docs/stats

# View logs
docker compose logs -f backend