Overview
fastRAG is an efficient retrieval-augmented generation framework developed by Intel Labs, built on top of Haystack. It focuses on optimizing RAG pipelines for speed and efficiency through techniques like sparse retrieval (PLAID/ColBERT), quantized language models, knowledge graph integration, and Intel hardware acceleration via OpenVINO and IPEX. The framework is designed for researchers and engineers building high-performance RAG systems.
fastRAG provides optimized components that can be combined into Haystack pipelines, including efficient retrievers, rerankers, and generators that leverage Intel’s optimization stack. It supports running quantized models on CPUs with near-GPU performance, making RAG deployable on commodity hardware without expensive GPU infrastructure.
Installation
pip Install
pip install fastrag
# With Intel optimizations
pip install fastrag[intel]
# With all optional dependencies
pip install fastrag[all]
# For development
git clone https://github.com/IntelLabs/fastRAG.git
cd fastRAG
pip install -e ".[all]"
Dependencies
# Install Haystack (required)
pip install haystack-ai
# Intel optimization stack (optional but recommended)
pip install optimum[openvino]
pip install intel-extension-for-pytorch
# ColBERT support
pip install colbert-ai
# Knowledge graph support
pip install neo4j
Core Components
Pipeline Components
| Component | Type | Description |
|---|
PLAIDDocumentStore | Store | ColBERT-based efficient document store |
ColBERTRetriever | Retriever | Late-interaction retrieval with PLAID |
QuantizedBiEncoderRetriever | Retriever | Quantized dense retriever |
OpenVINOGenerator | Generator | LLM inference via OpenVINO |
IPEXGenerator | Generator | LLM inference via Intel IPEX |
ReRanker | Ranker | Cross-encoder reranking |
KGCreator | Knowledge Graph | Extract entities and relations |
KGRetriever | Knowledge Graph | Retrieve from knowledge graphs |
Basic RAG Pipeline
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from fastrag.retrievers import ColBERTRetriever
from fastrag.generators import OpenVINOGenerator
# Create document store
document_store = InMemoryDocumentStore()
# Index documents
from haystack import Document
documents = [
Document(content="RAG combines retrieval with generation..."),
Document(content="Vector databases store embeddings..."),
]
document_store.write_documents(documents)
# Build pipeline
pipeline = Pipeline()
pipeline.add_component("retriever", ColBERTRetriever(document_store=document_store))
pipeline.add_component("prompt_builder", PromptBuilder(
template="""
Given the context, answer the question.
Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
Question: {{ query }}
Answer:
"""
))
pipeline.add_component("generator", OpenVINOGenerator(model="Intel/neural-chat-7b-v3-1"))
pipeline.connect("retriever.documents", "prompt_builder.documents")
pipeline.connect("prompt_builder.prompt", "generator.prompt")
# Run query
result = pipeline.run({
"retriever": {"query": "What is RAG?"},
"prompt_builder": {"query": "What is RAG?"}
})
print(result["generator"]["replies"][0])
PLAID/ColBERT Retrieval
Indexing with ColBERT
from fastrag.stores import PLAIDDocumentStore
from fastrag.retrievers import ColBERTRetriever
# Create PLAID document store
store = PLAIDDocumentStore(
collection_path="./plaid_index",
checkpoint="colbert-ir/colbertv2.0",
nbits=2, # Compression level (1, 2, or 4 bits)
gpus=0 # CPU-only
)
# Index documents
from haystack import Document
docs = [Document(content=text) for text in corpus_texts]
store.write_documents(docs)
# Create retriever
retriever = ColBERTRetriever(
document_store=store,
top_k=10
)
# Search
results = retriever.run(query="efficient retrieval methods")
for doc in results["documents"]:
print(f"Score: {doc.score:.4f} | {doc.content[:100]}")
ColBERT Configuration
| Parameter | Description | Default |
|---|
nbits | Residual compression bits (1, 2, 4) | 2 |
kmeans_niters | K-means iterations for clustering | 4 |
gpus | Number of GPUs (0 for CPU) | 0 |
ranks | Number of parallel processes | 1 |
doc_maxlen | Maximum document token length | 220 |
query_maxlen | Maximum query token length | 32 |
Intel Optimized Inference
OpenVINO Generator
from fastrag.generators import OpenVINOGenerator
# Use pre-optimized model
generator = OpenVINOGenerator(
model="Intel/neural-chat-7b-v3-1",
device="CPU",
max_new_tokens=256,
temperature=0.7
)
# Convert and optimize custom model
from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
export=True,
compile=False
)
model.half()
model.save_pretrained("./llama3-openvino")
IPEX Generator
from fastrag.generators import IPEXGenerator
generator = IPEXGenerator(
model="meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="bfloat16",
max_new_tokens=256
)
result = generator.run(prompt="Explain vector databases in simple terms.")
Quantized Models
from fastrag.retrievers import QuantizedBiEncoderRetriever
# Use quantized bi-encoder for retrieval
retriever = QuantizedBiEncoderRetriever(
model="Intel/bge-small-en-v1.5-rag-int8-static",
document_store=document_store,
top_k=10
)
# INT8 quantization for custom models
from optimum.intel import INCQuantizer
from neural_compressor.config import PostTrainingQuantConfig
quantizer = INCQuantizer.from_pretrained("BAAI/bge-small-en-v1.5")
quantization_config = PostTrainingQuantConfig(approach="static")
quantizer.quantize(
quantization_config=quantization_config,
calibration_dataset=calibration_data,
save_directory="./quantized-model"
)
Configuration
Pipeline YAML Configuration
# pipeline.yaml
components:
document_store:
type: PLAIDDocumentStore
params:
collection_path: ./plaid_index
checkpoint: colbert-ir/colbertv2.0
nbits: 2
retriever:
type: ColBERTRetriever
params:
document_store: document_store
top_k: 10
reranker:
type: ReRanker
params:
model: cross-encoder/ms-marco-MiniLM-L-12-v2
top_k: 5
prompt_builder:
type: PromptBuilder
params:
template: |
Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
Question: {{ query }}
generator:
type: OpenVINOGenerator
params:
model: Intel/neural-chat-7b-v3-1
max_new_tokens: 256
pipelines:
rag:
- retriever -> reranker
- reranker -> prompt_builder
- prompt_builder -> generator
Advanced Usage
Knowledge Graph RAG
from fastrag.kg import KGCreator, KGRetriever
from neo4j import GraphDatabase
# Create knowledge graph from documents
kg_creator = KGCreator(
llm_model="gpt-4o",
neo4j_uri="bolt://localhost:7687",
neo4j_user="neo4j",
neo4j_password="password"
)
# Extract entities and relations
kg_creator.run(documents=documents)
# Retrieve from knowledge graph
kg_retriever = KGRetriever(
neo4j_uri="bolt://localhost:7687",
neo4j_user="neo4j",
neo4j_password="password",
top_k=5
)
results = kg_retriever.run(query="What technologies does Company X use?")
Hybrid Retrieval
from haystack import Pipeline
from haystack.components.joiners import DocumentJoiner
pipeline = Pipeline()
pipeline.add_component("sparse_retriever", ColBERTRetriever(store, top_k=20))
pipeline.add_component("dense_retriever", QuantizedBiEncoderRetriever(
model="Intel/bge-small-en-v1.5-rag-int8-static",
document_store=dense_store,
top_k=20
))
pipeline.add_component("joiner", DocumentJoiner(join_mode="reciprocal_rank_fusion"))
pipeline.add_component("reranker", ReRanker(model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_k=5))
pipeline.connect("sparse_retriever.documents", "joiner.documents")
pipeline.connect("dense_retriever.documents", "joiner.documents")
pipeline.connect("joiner.documents", "reranker.documents")
Benchmarking
from fastrag.benchmarks import RAGBenchmark
benchmark = RAGBenchmark(
pipeline=pipeline,
dataset="squad",
metrics=["exact_match", "f1", "latency", "throughput"]
)
results = benchmark.run(num_queries=100)
print(f"F1 Score: {results['f1']:.3f}")
print(f"Avg Latency: {results['latency_ms']:.1f}ms")
print(f"Throughput: {results['queries_per_sec']:.1f} q/s")
Troubleshooting
| Issue | Solution |
|---|
| OpenVINO model loading fails | Install optimum[openvino], check model format |
| PLAID index creation slow | Reduce nbits, lower kmeans_niters |
| Out of memory on CPU | Use quantized models (INT8), reduce batch size |
| ColBERT checkpoint not found | Download: colbert-ir/colbertv2.0 from HuggingFace |
| IPEX not detected | Install intel-extension-for-pytorch matching your PyTorch version |
| Neo4j connection error | Start Neo4j: docker run -p 7687:7687 neo4j |
| Haystack version conflict | Ensure haystack-ai>=2.0 is installed |
| Slow first inference | OpenVINO compiles on first run; subsequent calls are faster |
# Verify Intel optimizations
python -c "import openvino; print(openvino.__version__)"
python -c "import intel_extension_for_pytorch as ipex; print(ipex.__version__)"
# Check available devices
python -c "from openvino import Core; print(Core().available_devices)"