Ir al contenido

RAGAS Cheat Sheet

Overview

RAGAS (RAG Assessment) is a framework for evaluating Retrieval-Augmented Generation (RAG) pipelines without needing hand-labeled ground truth for every metric. It uses LLMs as judges to compute component-wise scores that isolate whether problems lie in retrieval (context precision/recall) or generation (faithfulness/answer relevancy).

Key metrics: Faithfulness measures hallucinations (is the answer grounded in retrieved context?). Answer Relevancy measures if the answer addresses the question. Context Precision measures if retrieved contexts rank relevant ones higher. Context Recall measures if all necessary information was retrieved.

RAGAS also provides synthetic test set generation, CI/CD integration, and custom metric authoring.

Installation

# Core RAGAS
pip install ragas

# With specific LLM providers
pip install ragas openai          # OpenAI evaluator
pip install ragas anthropic       # Anthropic evaluator
pip install ragas langchain-openai

# With LlamaIndex integration
pip install ragas llama-index

# Development / latest
pip install git+https://github.com/explodinggradients/ragas.git

Configuration

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Configure evaluator LLM and embeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# --- Anthropic evaluator ---
from langchain_anthropic import ChatAnthropic
evaluator_llm = LangchainLLMWrapper(
    ChatAnthropic(model="claude-3-5-sonnet-20241022")
)

# --- Azure OpenAI ---
from langchain_openai import AzureChatOpenAI
evaluator_llm = LangchainLLMWrapper(
    AzureChatOpenAI(
        azure_deployment="gpt-4o",
        azure_endpoint="https://your-resource.openai.azure.com/",
        api_version="2024-02-01",
    )
)

Core Metrics Reference

MetricClassNeeds ReferenceRangeMeasures
FaithfulnessFaithfulnessNo0–1Hallucination rate
Answer RelevancyAnswerRelevancyNo0–1Answer addresses question
Context PrecisionContextPrecisionYes (reference)0–1Relevant context ranked higher
Context RecallContextRecallYes (reference)0–1All needed context retrieved
Context Entity RecallContextEntityRecallYes0–1Entity coverage in context
Answer CorrectnessAnswerCorrectnessYes0–1Factual accuracy vs. reference
Answer SimilarityAnswerSimilarityYes0–1Semantic similarity to reference
Context Relevance (v0.1)ContextRelevanceNo0–1Context relevance to question

Core API Reference

APIDescription
EvaluationDatasetContainer for evaluation samples
SingleTurnSampleOne QA pair with context
evaluate(dataset, metrics)Run evaluation
TestsetGeneratorGenerate synthetic test sets
KnowledgeGraphSource document graph for testset gen
Faithfulness()Instantiate faithfulness metric
AnswerRelevancy()Instantiate answer relevancy metric
ContextPrecision()Instantiate context precision metric
ContextRecall()Instantiate context recall metric
result.to_pandas()Convert results to DataFrame
result["faithfulness"]Access aggregate score by metric

Advanced Usage

Basic Evaluation

from ragas import evaluate, EvaluationDataset
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision, ContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

# Prepare your pipeline outputs
samples = [
    {
        "user_input": "What is the boiling point of water?",
        "retrieved_contexts": [
            "Water boils at 100°C (212°F) at standard atmospheric pressure.",
            "The boiling point varies with altitude and pressure."
        ],
        "response": "Water boils at 100°C at sea level.",
        "reference": "The boiling point of water is 100°C (212°F) at 1 atm.",
    },
    {
        "user_input": "Who invented the telephone?",
        "retrieved_contexts": [
            "Alexander Graham Bell is credited with inventing the telephone in 1876.",
        ],
        "response": "Alexander Graham Bell invented the telephone.",
        "reference": "Alexander Graham Bell invented the telephone in 1876.",
    },
]

dataset = EvaluationDataset.from_list(samples)

# Set up evaluator
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(llm=llm),
        AnswerRelevancy(llm=llm, embeddings=embeddings),
        ContextPrecision(llm=llm),
        ContextRecall(llm=llm),
    ],
)

print(result)
# Output: {'faithfulness': 0.95, 'answer_relevancy': 0.87, ...}

# Detailed results as DataFrame
df = result.to_pandas()
print(df.head())

Synthetic Test Set Generation

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load your knowledge base documents
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(documents)

# Configure generator LLMs
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
critic_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Generate synthetic testset
generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=embeddings,
)

testset = generator.generate_with_langchain_docs(
    documents=docs,
    testset_size=50,
)

# Export to DataFrame
testset_df = testset.to_pandas()
testset_df.to_csv("synthetic_testset.csv", index=False)
print(testset_df.head())

Custom Metrics

from ragas.metrics.base import MetricWithLLM, SingleTurnMetric
from ragas.prompt import PydanticPrompt
from pydantic import BaseModel, Field
from dataclasses import dataclass, field

class ToneInput(BaseModel):
    question: str
    answer: str

class ToneOutput(BaseModel):
    is_professional: bool = Field(description="Is the tone professional?")
    score: float = Field(description="Professionalism score 0.0-1.0")

class TonePrompt(PydanticPrompt[ToneInput, ToneOutput]):
    instruction: str = (
        "Evaluate if the answer uses a professional tone. "
        "Return a score from 0 (very informal) to 1 (very professional)."
    )
    input_model = ToneInput
    output_model = ToneOutput
    examples = []

@dataclass
class ProfessionalismMetric(MetricWithLLM, SingleTurnMetric):
    name: str = "professionalism"
    tone_prompt: TonePrompt = field(default_factory=TonePrompt)

    async def _single_turn_ascore(self, sample, callbacks=None):
        prompt_input = ToneInput(
            question=sample.user_input,
            answer=sample.response,
        )
        result = await self.tone_prompt.generate(
            data=prompt_input, llm=self.llm, callbacks=callbacks
        )
        return result.score

# Use the custom metric
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
professionalism = ProfessionalismMetric(llm=llm)

result = evaluate(dataset=dataset, metrics=[professionalism])

Evaluation with LlamaIndex

from ragas.integrations.llama_index import evaluate as llama_evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy

# Run your LlamaIndex query engine and collect results
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

questions = ["What is photosynthesis?", "How does DNA replication work?"]
responses = [query_engine.query(q) for q in questions]

# Evaluate
result = llama_evaluate(
    query_engine=query_engine,
    metrics=[Faithfulness(), AnswerRelevancy()],
    questions=questions,
)
print(result)

Common Workflows

Workflow 1: CI/CD Pipeline Integration

# evaluate_rag.py — run in CI after deploying new RAG version
import sys
from ragas import evaluate, EvaluationDataset
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision

FAITHFULNESS_THRESHOLD = 0.85
RELEVANCY_THRESHOLD = 0.80

# Load your golden test set
import pandas as pd
df = pd.read_csv("tests/golden_testset.csv")
dataset = EvaluationDataset.from_list(df.to_dict(orient="records"))

# Run your RAG pipeline on the test set questions
samples_with_outputs = []
for sample in dataset:
    output = my_rag_pipeline(sample.user_input)
    samples_with_outputs.append({
        **sample.dict(),
        "response": output["answer"],
        "retrieved_contexts": output["contexts"],
    })

eval_dataset = EvaluationDataset.from_list(samples_with_outputs)
result = evaluate(eval_dataset, metrics=[Faithfulness(), AnswerRelevancy(), ContextPrecision()])

print(f"Faithfulness: {result['faithfulness']:.3f}")
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")

# Fail CI if below threshold
if result["faithfulness"] < FAITHFULNESS_THRESHOLD:
    print(f"FAIL: Faithfulness {result['faithfulness']:.3f} < {FAITHFULNESS_THRESHOLD}")
    sys.exit(1)

if result["answer_relevancy"] < RELEVANCY_THRESHOLD:
    print(f"FAIL: Answer Relevancy {result['answer_relevancy']:.3f} < {RELEVANCY_THRESHOLD}")
    sys.exit(1)

print("All metrics passed!")

Workflow 2: Compare Two RAG Configurations

from ragas import evaluate, EvaluationDataset
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextRecall
import pandas as pd

def evaluate_config(pipeline_fn, testset, label):
    outputs = [pipeline_fn(s["user_input"]) for s in testset]
    samples = [
        {**s, "response": o["answer"], "retrieved_contexts": o["contexts"]}
        for s, o in zip(testset, outputs)
    ]
    result = evaluate(EvaluationDataset.from_list(samples),
                      metrics=[Faithfulness(), AnswerRelevancy(), ContextRecall()])
    return {
        "config": label,
        "faithfulness": result["faithfulness"],
        "answer_relevancy": result["answer_relevancy"],
        "context_recall": result["context_recall"],
    }

testset = load_golden_testset()

results = pd.DataFrame([
    evaluate_config(rag_v1, testset, "v1-top3"),
    evaluate_config(rag_v2, testset, "v2-top5-reranked"),
])

print(results.to_markdown(index=False))

Tips and Best Practices

  • Faithfulness + Answer Relevancy are reference-free; run them on any production traffic sample without pre-labeling.
  • Context Precision and Recall require a reference answer — generate these with TestsetGenerator to build your golden set automatically.
  • Use a strong evaluator LLM (GPT-4o or Claude 3.5 Sonnet) for reliable metric scores; evaluator quality directly affects result accuracy.
  • Keep evaluator LLM separate from your RAG pipeline LLM to avoid self-evaluation bias.
  • Sample production traffic (e.g., 100 traces/day) rather than evaluating everything — RAGAS evaluations cost tokens.
  • Low faithfulness score typically means your LLM is hallucinating beyond the context; try tighter system prompts or reduce temperature.
  • Low context recall means your retriever isn’t fetching all necessary information; increase top_k or improve chunking.
  • Export to DataFrame with result.to_pandas() for per-sample analysis; aggregate scores hide outliers.
  • Pin your testset version in version control so metric trends are comparable across deployments.
  • Custom metrics using MetricWithLLM let you evaluate domain-specific quality signals (toxicity, compliance, tone) alongside built-in metrics.