fastRAG Cheat Sheet

Overview

fastRAG is an efficient retrieval-augmented generation framework developed by Intel Labs, built on top of Haystack. It focuses on optimizing RAG pipelines for speed and efficiency through techniques like sparse retrieval (PLAID/ColBERT), quantized language models, knowledge graph integration, and Intel hardware acceleration via OpenVINO and IPEX. The framework is designed for researchers and engineers building high-performance RAG systems.

fastRAG provides optimized components that can be combined into Haystack pipelines, including efficient retrievers, rerankers, and generators that leverage Intel’s optimization stack. It supports running quantized models on CPUs with near-GPU performance, making RAG deployable on commodity hardware without expensive GPU infrastructure.

Installation

pip Install

pip install fastrag

# With Intel optimizations
pip install fastrag[intel]

# With all optional dependencies
pip install fastrag[all]

# For development
git clone https://github.com/IntelLabs/fastRAG.git
cd fastRAG
pip install -e ".[all]"

Dependencies

# Install Haystack (required)
pip install haystack-ai

# Intel optimization stack (optional but recommended)
pip install optimum[openvino]
pip install intel-extension-for-pytorch

# ColBERT support
pip install colbert-ai

# Knowledge graph support
pip install neo4j

Core Components

Pipeline Components

Component	Type	Description
`PLAIDDocumentStore`	Store	ColBERT-based efficient document store
`ColBERTRetriever`	Retriever	Late-interaction retrieval with PLAID
`QuantizedBiEncoderRetriever`	Retriever	Quantized dense retriever
`OpenVINOGenerator`	Generator	LLM inference via OpenVINO
`IPEXGenerator`	Generator	LLM inference via Intel IPEX
`ReRanker`	Ranker	Cross-encoder reranking
`KGCreator`	Knowledge Graph	Extract entities and relations
`KGRetriever`	Knowledge Graph	Retrieve from knowledge graphs

Basic RAG Pipeline

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from fastrag.retrievers import ColBERTRetriever
from fastrag.generators import OpenVINOGenerator

# Create document store
document_store = InMemoryDocumentStore()

# Index documents
from haystack import Document
documents = [
    Document(content="RAG combines retrieval with generation..."),
    Document(content="Vector databases store embeddings..."),
]
document_store.write_documents(documents)

# Build pipeline
pipeline = Pipeline()
pipeline.add_component("retriever", ColBERTRetriever(document_store=document_store))
pipeline.add_component("prompt_builder", PromptBuilder(
    template="""
    Given the context, answer the question.
    Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
    Question: {{ query }}
    Answer:
    """
))
pipeline.add_component("generator", OpenVINOGenerator(model="Intel/neural-chat-7b-v3-1"))

pipeline.connect("retriever.documents", "prompt_builder.documents")
pipeline.connect("prompt_builder.prompt", "generator.prompt")

# Run query
result = pipeline.run({
    "retriever": {"query": "What is RAG?"},
    "prompt_builder": {"query": "What is RAG?"}
})
print(result["generator"]["replies"][0])

PLAID/ColBERT Retrieval

Indexing with ColBERT

from fastrag.stores import PLAIDDocumentStore
from fastrag.retrievers import ColBERTRetriever

# Create PLAID document store
store = PLAIDDocumentStore(
    collection_path="./plaid_index",
    checkpoint="colbert-ir/colbertv2.0",
    nbits=2,  # Compression level (1, 2, or 4 bits)
    gpus=0    # CPU-only
)

# Index documents
from haystack import Document
docs = [Document(content=text) for text in corpus_texts]
store.write_documents(docs)

# Create retriever
retriever = ColBERTRetriever(
    document_store=store,
    top_k=10
)

# Search
results = retriever.run(query="efficient retrieval methods")
for doc in results["documents"]:
    print(f"Score: {doc.score:.4f} | {doc.content[:100]}")

ColBERT Configuration

Parameter	Description	Default
`nbits`	Residual compression bits (1, 2, 4)	2
`kmeans_niters`	K-means iterations for clustering	4
`gpus`	Number of GPUs (0 for CPU)	0
`ranks`	Number of parallel processes	1
`doc_maxlen`	Maximum document token length	220
`query_maxlen`	Maximum query token length	32

Intel Optimized Inference

OpenVINO Generator

from fastrag.generators import OpenVINOGenerator

# Use pre-optimized model
generator = OpenVINOGenerator(
    model="Intel/neural-chat-7b-v3-1",
    device="CPU",
    max_new_tokens=256,
    temperature=0.7
)

# Convert and optimize custom model
from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    export=True,
    compile=False
)
model.half()
model.save_pretrained("./llama3-openvino")

IPEX Generator

from fastrag.generators import IPEXGenerator

generator = IPEXGenerator(
    model="meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="bfloat16",
    max_new_tokens=256
)

result = generator.run(prompt="Explain vector databases in simple terms.")

Quantized Models

from fastrag.retrievers import QuantizedBiEncoderRetriever

# Use quantized bi-encoder for retrieval
retriever = QuantizedBiEncoderRetriever(
    model="Intel/bge-small-en-v1.5-rag-int8-static",
    document_store=document_store,
    top_k=10
)

# INT8 quantization for custom models
from optimum.intel import INCQuantizer
from neural_compressor.config import PostTrainingQuantConfig

quantizer = INCQuantizer.from_pretrained("BAAI/bge-small-en-v1.5")
quantization_config = PostTrainingQuantConfig(approach="static")
quantizer.quantize(
    quantization_config=quantization_config,
    calibration_dataset=calibration_data,
    save_directory="./quantized-model"
)

Configuration

Pipeline YAML Configuration

# pipeline.yaml
components:
  document_store:
    type: PLAIDDocumentStore
    params:
      collection_path: ./plaid_index
      checkpoint: colbert-ir/colbertv2.0
      nbits: 2

  retriever:
    type: ColBERTRetriever
    params:
      document_store: document_store
      top_k: 10

  reranker:
    type: ReRanker
    params:
      model: cross-encoder/ms-marco-MiniLM-L-12-v2
      top_k: 5

  prompt_builder:
    type: PromptBuilder
    params:
      template: |
        Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
        Question: {{ query }}

  generator:
    type: OpenVINOGenerator
    params:
      model: Intel/neural-chat-7b-v3-1
      max_new_tokens: 256

pipelines:
  rag:
    - retriever -> reranker
    - reranker -> prompt_builder
    - prompt_builder -> generator

Advanced Usage

Knowledge Graph RAG

from fastrag.kg import KGCreator, KGRetriever
from neo4j import GraphDatabase

# Create knowledge graph from documents
kg_creator = KGCreator(
    llm_model="gpt-4o",
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="password"
)

# Extract entities and relations
kg_creator.run(documents=documents)

# Retrieve from knowledge graph
kg_retriever = KGRetriever(
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="password",
    top_k=5
)

results = kg_retriever.run(query="What technologies does Company X use?")

Hybrid Retrieval

from haystack import Pipeline
from haystack.components.joiners import DocumentJoiner

pipeline = Pipeline()
pipeline.add_component("sparse_retriever", ColBERTRetriever(store, top_k=20))
pipeline.add_component("dense_retriever", QuantizedBiEncoderRetriever(
    model="Intel/bge-small-en-v1.5-rag-int8-static",
    document_store=dense_store,
    top_k=20
))
pipeline.add_component("joiner", DocumentJoiner(join_mode="reciprocal_rank_fusion"))
pipeline.add_component("reranker", ReRanker(model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_k=5))

pipeline.connect("sparse_retriever.documents", "joiner.documents")
pipeline.connect("dense_retriever.documents", "joiner.documents")
pipeline.connect("joiner.documents", "reranker.documents")

Benchmarking

from fastrag.benchmarks import RAGBenchmark

benchmark = RAGBenchmark(
    pipeline=pipeline,
    dataset="squad",
    metrics=["exact_match", "f1", "latency", "throughput"]
)

results = benchmark.run(num_queries=100)
print(f"F1 Score: {results['f1']:.3f}")
print(f"Avg Latency: {results['latency_ms']:.1f}ms")
print(f"Throughput: {results['queries_per_sec']:.1f} q/s")

Troubleshooting

Issue	Solution
OpenVINO model loading fails	Install `optimum[openvino]`, check model format
PLAID index creation slow	Reduce `nbits`, lower `kmeans_niters`
Out of memory on CPU	Use quantized models (INT8), reduce batch size
ColBERT checkpoint not found	Download: `colbert-ir/colbertv2.0` from HuggingFace
IPEX not detected	Install `intel-extension-for-pytorch` matching your PyTorch version
Neo4j connection error	Start Neo4j: `docker run -p 7687:7687 neo4j`
Haystack version conflict	Ensure `haystack-ai>=2.0` is installed
Slow first inference	OpenVINO compiles on first run; subsequent calls are faster

# Verify Intel optimizations
python -c "import openvino; print(openvino.__version__)"
python -c "import intel_extension_for_pytorch as ipex; print(ipex.__version__)"

# Check available devices
python -c "from openvino import Core; print(Core().available_devices)"