RAGAS Cheat Sheet
Overview
RAGAS (RAG Assessment) is a framework for evaluating Retrieval-Augmented Generation (RAG) pipelines without needing hand-labeled ground truth for every metric. It uses LLMs as judges to compute component-wise scores that isolate whether problems lie in retrieval (context precision/recall) or generation (faithfulness/answer relevancy).
Key metrics: Faithfulness measures hallucinations (is the answer grounded in retrieved context?). Answer Relevancy measures if the answer addresses the question. Context Precision measures if retrieved contexts rank relevant ones higher. Context Recall measures if all necessary information was retrieved.
RAGAS also provides synthetic test set generation, CI/CD integration, and custom metric authoring.
Installation
# Core RAGAS
pip install ragas
# With specific LLM providers
pip install ragas openai # OpenAI evaluator
pip install ragas anthropic # Anthropic evaluator
pip install ragas langchain-openai
# With LlamaIndex integration
pip install ragas llama-index
# Development / latest
pip install git+https://github.com/explodinggradients/ragas.git
Configuration
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Configure evaluator LLM and embeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# --- Anthropic evaluator ---
from langchain_anthropic import ChatAnthropic
evaluator_llm = LangchainLLMWrapper(
ChatAnthropic(model="claude-3-5-sonnet-20241022")
)
# --- Azure OpenAI ---
from langchain_openai import AzureChatOpenAI
evaluator_llm = LangchainLLMWrapper(
AzureChatOpenAI(
azure_deployment="gpt-4o",
azure_endpoint="https://your-resource.openai.azure.com/",
api_version="2024-02-01",
)
)
Core Metrics Reference
| Metric | Class | Needs Reference | Range | Measures |
|---|---|---|---|---|
| Faithfulness | Faithfulness | No | 0–1 | Hallucination rate |
| Answer Relevancy | AnswerRelevancy | No | 0–1 | Answer addresses question |
| Context Precision | ContextPrecision | Yes (reference) | 0–1 | Relevant context ranked higher |
| Context Recall | ContextRecall | Yes (reference) | 0–1 | All needed context retrieved |
| Context Entity Recall | ContextEntityRecall | Yes | 0–1 | Entity coverage in context |
| Answer Correctness | AnswerCorrectness | Yes | 0–1 | Factual accuracy vs. reference |
| Answer Similarity | AnswerSimilarity | Yes | 0–1 | Semantic similarity to reference |
| Context Relevance (v0.1) | ContextRelevance | No | 0–1 | Context relevance to question |
Core API Reference
| API | Description |
|---|---|
EvaluationDataset | Container for evaluation samples |
SingleTurnSample | One QA pair with context |
evaluate(dataset, metrics) | Run evaluation |
TestsetGenerator | Generate synthetic test sets |
KnowledgeGraph | Source document graph for testset gen |
Faithfulness() | Instantiate faithfulness metric |
AnswerRelevancy() | Instantiate answer relevancy metric |
ContextPrecision() | Instantiate context precision metric |
ContextRecall() | Instantiate context recall metric |
result.to_pandas() | Convert results to DataFrame |
result["faithfulness"] | Access aggregate score by metric |
Advanced Usage
Basic Evaluation
from ragas import evaluate, EvaluationDataset
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision, ContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
# Prepare your pipeline outputs
samples = [
{
"user_input": "What is the boiling point of water?",
"retrieved_contexts": [
"Water boils at 100°C (212°F) at standard atmospheric pressure.",
"The boiling point varies with altitude and pressure."
],
"response": "Water boils at 100°C at sea level.",
"reference": "The boiling point of water is 100°C (212°F) at 1 atm.",
},
{
"user_input": "Who invented the telephone?",
"retrieved_contexts": [
"Alexander Graham Bell is credited with inventing the telephone in 1876.",
],
"response": "Alexander Graham Bell invented the telephone.",
"reference": "Alexander Graham Bell invented the telephone in 1876.",
},
]
dataset = EvaluationDataset.from_list(samples)
# Set up evaluator
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# Run evaluation
result = evaluate(
dataset=dataset,
metrics=[
Faithfulness(llm=llm),
AnswerRelevancy(llm=llm, embeddings=embeddings),
ContextPrecision(llm=llm),
ContextRecall(llm=llm),
],
)
print(result)
# Output: {'faithfulness': 0.95, 'answer_relevancy': 0.87, ...}
# Detailed results as DataFrame
df = result.to_pandas()
print(df.head())
Synthetic Test Set Generation
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load your knowledge base documents
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(documents)
# Configure generator LLMs
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
critic_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# Generate synthetic testset
generator = TestsetGenerator(
llm=generator_llm,
embedding_model=embeddings,
)
testset = generator.generate_with_langchain_docs(
documents=docs,
testset_size=50,
)
# Export to DataFrame
testset_df = testset.to_pandas()
testset_df.to_csv("synthetic_testset.csv", index=False)
print(testset_df.head())
Custom Metrics
from ragas.metrics.base import MetricWithLLM, SingleTurnMetric
from ragas.prompt import PydanticPrompt
from pydantic import BaseModel, Field
from dataclasses import dataclass, field
class ToneInput(BaseModel):
question: str
answer: str
class ToneOutput(BaseModel):
is_professional: bool = Field(description="Is the tone professional?")
score: float = Field(description="Professionalism score 0.0-1.0")
class TonePrompt(PydanticPrompt[ToneInput, ToneOutput]):
instruction: str = (
"Evaluate if the answer uses a professional tone. "
"Return a score from 0 (very informal) to 1 (very professional)."
)
input_model = ToneInput
output_model = ToneOutput
examples = []
@dataclass
class ProfessionalismMetric(MetricWithLLM, SingleTurnMetric):
name: str = "professionalism"
tone_prompt: TonePrompt = field(default_factory=TonePrompt)
async def _single_turn_ascore(self, sample, callbacks=None):
prompt_input = ToneInput(
question=sample.user_input,
answer=sample.response,
)
result = await self.tone_prompt.generate(
data=prompt_input, llm=self.llm, callbacks=callbacks
)
return result.score
# Use the custom metric
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
professionalism = ProfessionalismMetric(llm=llm)
result = evaluate(dataset=dataset, metrics=[professionalism])
Evaluation with LlamaIndex
from ragas.integrations.llama_index import evaluate as llama_evaluate
from ragas.metrics import Faithfulness, AnswerRelevancy
# Run your LlamaIndex query engine and collect results
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
questions = ["What is photosynthesis?", "How does DNA replication work?"]
responses = [query_engine.query(q) for q in questions]
# Evaluate
result = llama_evaluate(
query_engine=query_engine,
metrics=[Faithfulness(), AnswerRelevancy()],
questions=questions,
)
print(result)
Common Workflows
Workflow 1: CI/CD Pipeline Integration
# evaluate_rag.py — run in CI after deploying new RAG version
import sys
from ragas import evaluate, EvaluationDataset
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision
FAITHFULNESS_THRESHOLD = 0.85
RELEVANCY_THRESHOLD = 0.80
# Load your golden test set
import pandas as pd
df = pd.read_csv("tests/golden_testset.csv")
dataset = EvaluationDataset.from_list(df.to_dict(orient="records"))
# Run your RAG pipeline on the test set questions
samples_with_outputs = []
for sample in dataset:
output = my_rag_pipeline(sample.user_input)
samples_with_outputs.append({
**sample.dict(),
"response": output["answer"],
"retrieved_contexts": output["contexts"],
})
eval_dataset = EvaluationDataset.from_list(samples_with_outputs)
result = evaluate(eval_dataset, metrics=[Faithfulness(), AnswerRelevancy(), ContextPrecision()])
print(f"Faithfulness: {result['faithfulness']:.3f}")
print(f"Answer Relevancy: {result['answer_relevancy']:.3f}")
# Fail CI if below threshold
if result["faithfulness"] < FAITHFULNESS_THRESHOLD:
print(f"FAIL: Faithfulness {result['faithfulness']:.3f} < {FAITHFULNESS_THRESHOLD}")
sys.exit(1)
if result["answer_relevancy"] < RELEVANCY_THRESHOLD:
print(f"FAIL: Answer Relevancy {result['answer_relevancy']:.3f} < {RELEVANCY_THRESHOLD}")
sys.exit(1)
print("All metrics passed!")
Workflow 2: Compare Two RAG Configurations
from ragas import evaluate, EvaluationDataset
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextRecall
import pandas as pd
def evaluate_config(pipeline_fn, testset, label):
outputs = [pipeline_fn(s["user_input"]) for s in testset]
samples = [
{**s, "response": o["answer"], "retrieved_contexts": o["contexts"]}
for s, o in zip(testset, outputs)
]
result = evaluate(EvaluationDataset.from_list(samples),
metrics=[Faithfulness(), AnswerRelevancy(), ContextRecall()])
return {
"config": label,
"faithfulness": result["faithfulness"],
"answer_relevancy": result["answer_relevancy"],
"context_recall": result["context_recall"],
}
testset = load_golden_testset()
results = pd.DataFrame([
evaluate_config(rag_v1, testset, "v1-top3"),
evaluate_config(rag_v2, testset, "v2-top5-reranked"),
])
print(results.to_markdown(index=False))
Tips and Best Practices
- Faithfulness + Answer Relevancy are reference-free; run them on any production traffic sample without pre-labeling.
- Context Precision and Recall require a
referenceanswer — generate these withTestsetGeneratorto build your golden set automatically. - Use a strong evaluator LLM (GPT-4o or Claude 3.5 Sonnet) for reliable metric scores; evaluator quality directly affects result accuracy.
- Keep evaluator LLM separate from your RAG pipeline LLM to avoid self-evaluation bias.
- Sample production traffic (e.g., 100 traces/day) rather than evaluating everything — RAGAS evaluations cost tokens.
- Low faithfulness score typically means your LLM is hallucinating beyond the context; try tighter system prompts or reduce temperature.
- Low context recall means your retriever isn’t fetching all necessary information; increase
top_kor improve chunking. - Export to DataFrame with
result.to_pandas()for per-sample analysis; aggregate scores hide outliers. - Pin your testset version in version control so metric trends are comparable across deployments.
- Custom metrics using
MetricWithLLMlet you evaluate domain-specific quality signals (toxicity, compliance, tone) alongside built-in metrics.