콘텐츠로 이동

fastRAG Cheat Sheet

Overview

fastRAG is an efficient retrieval-augmented generation framework developed by Intel Labs, built on top of Haystack. It focuses on optimizing RAG pipelines for speed and efficiency through techniques like sparse retrieval (PLAID/ColBERT), quantized language models, knowledge graph integration, and Intel hardware acceleration via OpenVINO and IPEX. The framework is designed for researchers and engineers building high-performance RAG systems.

fastRAG provides optimized components that can be combined into Haystack pipelines, including efficient retrievers, rerankers, and generators that leverage Intel’s optimization stack. It supports running quantized models on CPUs with near-GPU performance, making RAG deployable on commodity hardware without expensive GPU infrastructure.

Installation

pip Install

pip install fastrag

# With Intel optimizations
pip install fastrag[intel]

# With all optional dependencies
pip install fastrag[all]

# For development
git clone https://github.com/IntelLabs/fastRAG.git
cd fastRAG
pip install -e ".[all]"

Dependencies

# Install Haystack (required)
pip install haystack-ai

# Intel optimization stack (optional but recommended)
pip install optimum[openvino]
pip install intel-extension-for-pytorch

# ColBERT support
pip install colbert-ai

# Knowledge graph support
pip install neo4j

Core Components

Pipeline Components

ComponentTypeDescription
PLAIDDocumentStoreStoreColBERT-based efficient document store
ColBERTRetrieverRetrieverLate-interaction retrieval with PLAID
QuantizedBiEncoderRetrieverRetrieverQuantized dense retriever
OpenVINOGeneratorGeneratorLLM inference via OpenVINO
IPEXGeneratorGeneratorLLM inference via Intel IPEX
ReRankerRankerCross-encoder reranking
KGCreatorKnowledge GraphExtract entities and relations
KGRetrieverKnowledge GraphRetrieve from knowledge graphs

Basic RAG Pipeline

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from fastrag.retrievers import ColBERTRetriever
from fastrag.generators import OpenVINOGenerator

# Create document store
document_store = InMemoryDocumentStore()

# Index documents
from haystack import Document
documents = [
    Document(content="RAG combines retrieval with generation..."),
    Document(content="Vector databases store embeddings..."),
]
document_store.write_documents(documents)

# Build pipeline
pipeline = Pipeline()
pipeline.add_component("retriever", ColBERTRetriever(document_store=document_store))
pipeline.add_component("prompt_builder", PromptBuilder(
    template="""
    Given the context, answer the question.
    Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
    Question: {{ query }}
    Answer:
    """
))
pipeline.add_component("generator", OpenVINOGenerator(model="Intel/neural-chat-7b-v3-1"))

pipeline.connect("retriever.documents", "prompt_builder.documents")
pipeline.connect("prompt_builder.prompt", "generator.prompt")

# Run query
result = pipeline.run({
    "retriever": {"query": "What is RAG?"},
    "prompt_builder": {"query": "What is RAG?"}
})
print(result["generator"]["replies"][0])

PLAID/ColBERT Retrieval

Indexing with ColBERT

from fastrag.stores import PLAIDDocumentStore
from fastrag.retrievers import ColBERTRetriever

# Create PLAID document store
store = PLAIDDocumentStore(
    collection_path="./plaid_index",
    checkpoint="colbert-ir/colbertv2.0",
    nbits=2,  # Compression level (1, 2, or 4 bits)
    gpus=0    # CPU-only
)

# Index documents
from haystack import Document
docs = [Document(content=text) for text in corpus_texts]
store.write_documents(docs)

# Create retriever
retriever = ColBERTRetriever(
    document_store=store,
    top_k=10
)

# Search
results = retriever.run(query="efficient retrieval methods")
for doc in results["documents"]:
    print(f"Score: {doc.score:.4f} | {doc.content[:100]}")

ColBERT Configuration

ParameterDescriptionDefault
nbitsResidual compression bits (1, 2, 4)2
kmeans_nitersK-means iterations for clustering4
gpusNumber of GPUs (0 for CPU)0
ranksNumber of parallel processes1
doc_maxlenMaximum document token length220
query_maxlenMaximum query token length32

Intel Optimized Inference

OpenVINO Generator

from fastrag.generators import OpenVINOGenerator

# Use pre-optimized model
generator = OpenVINOGenerator(
    model="Intel/neural-chat-7b-v3-1",
    device="CPU",
    max_new_tokens=256,
    temperature=0.7
)

# Convert and optimize custom model
from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    export=True,
    compile=False
)
model.half()
model.save_pretrained("./llama3-openvino")

IPEX Generator

from fastrag.generators import IPEXGenerator

generator = IPEXGenerator(
    model="meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="bfloat16",
    max_new_tokens=256
)

result = generator.run(prompt="Explain vector databases in simple terms.")

Quantized Models

from fastrag.retrievers import QuantizedBiEncoderRetriever

# Use quantized bi-encoder for retrieval
retriever = QuantizedBiEncoderRetriever(
    model="Intel/bge-small-en-v1.5-rag-int8-static",
    document_store=document_store,
    top_k=10
)

# INT8 quantization for custom models
from optimum.intel import INCQuantizer
from neural_compressor.config import PostTrainingQuantConfig

quantizer = INCQuantizer.from_pretrained("BAAI/bge-small-en-v1.5")
quantization_config = PostTrainingQuantConfig(approach="static")
quantizer.quantize(
    quantization_config=quantization_config,
    calibration_dataset=calibration_data,
    save_directory="./quantized-model"
)

Configuration

Pipeline YAML Configuration

# pipeline.yaml
components:
  document_store:
    type: PLAIDDocumentStore
    params:
      collection_path: ./plaid_index
      checkpoint: colbert-ir/colbertv2.0
      nbits: 2

  retriever:
    type: ColBERTRetriever
    params:
      document_store: document_store
      top_k: 10

  reranker:
    type: ReRanker
    params:
      model: cross-encoder/ms-marco-MiniLM-L-12-v2
      top_k: 5

  prompt_builder:
    type: PromptBuilder
    params:
      template: |
        Context: {% for doc in documents %}{{ doc.content }}{% endfor %}
        Question: {{ query }}

  generator:
    type: OpenVINOGenerator
    params:
      model: Intel/neural-chat-7b-v3-1
      max_new_tokens: 256

pipelines:
  rag:
    - retriever -> reranker
    - reranker -> prompt_builder
    - prompt_builder -> generator

Advanced Usage

Knowledge Graph RAG

from fastrag.kg import KGCreator, KGRetriever
from neo4j import GraphDatabase

# Create knowledge graph from documents
kg_creator = KGCreator(
    llm_model="gpt-4o",
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="password"
)

# Extract entities and relations
kg_creator.run(documents=documents)

# Retrieve from knowledge graph
kg_retriever = KGRetriever(
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="password",
    top_k=5
)

results = kg_retriever.run(query="What technologies does Company X use?")

Hybrid Retrieval

from haystack import Pipeline
from haystack.components.joiners import DocumentJoiner

pipeline = Pipeline()
pipeline.add_component("sparse_retriever", ColBERTRetriever(store, top_k=20))
pipeline.add_component("dense_retriever", QuantizedBiEncoderRetriever(
    model="Intel/bge-small-en-v1.5-rag-int8-static",
    document_store=dense_store,
    top_k=20
))
pipeline.add_component("joiner", DocumentJoiner(join_mode="reciprocal_rank_fusion"))
pipeline.add_component("reranker", ReRanker(model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_k=5))

pipeline.connect("sparse_retriever.documents", "joiner.documents")
pipeline.connect("dense_retriever.documents", "joiner.documents")
pipeline.connect("joiner.documents", "reranker.documents")

Benchmarking

from fastrag.benchmarks import RAGBenchmark

benchmark = RAGBenchmark(
    pipeline=pipeline,
    dataset="squad",
    metrics=["exact_match", "f1", "latency", "throughput"]
)

results = benchmark.run(num_queries=100)
print(f"F1 Score: {results['f1']:.3f}")
print(f"Avg Latency: {results['latency_ms']:.1f}ms")
print(f"Throughput: {results['queries_per_sec']:.1f} q/s")

Troubleshooting

IssueSolution
OpenVINO model loading failsInstall optimum[openvino], check model format
PLAID index creation slowReduce nbits, lower kmeans_niters
Out of memory on CPUUse quantized models (INT8), reduce batch size
ColBERT checkpoint not foundDownload: colbert-ir/colbertv2.0 from HuggingFace
IPEX not detectedInstall intel-extension-for-pytorch matching your PyTorch version
Neo4j connection errorStart Neo4j: docker run -p 7687:7687 neo4j
Haystack version conflictEnsure haystack-ai>=2.0 is installed
Slow first inferenceOpenVINO compiles on first run; subsequent calls are faster
# Verify Intel optimizations
python -c "import openvino; print(openvino.__version__)"
python -c "import intel_extension_for_pytorch as ipex; print(ipex.__version__)"

# Check available devices
python -c "from openvino import Core; print(Core().available_devices)"