Overview
DeepEval is an open-source evaluation framework for LLM applications that provides unit-testing-style evaluation with 14+ research-backed metrics. It covers RAG evaluation (faithfulness, relevancy, contextual recall), hallucination detection, toxicity checking, bias evaluation, and custom metric creation. DeepEval integrates with pytest for CI/CD pipelines and offers a hosted platform (Confident AI) for tracking evaluations over time.
The framework enables developers to write LLM evaluation tests that run alongside regular test suites, catching regressions in AI quality before deployment. It supports both reference-based metrics (requiring ground truth) and reference-free metrics (using LLM-as-judge), making it applicable across development stages from prototyping to production monitoring.
Installation
pip install deepeval
# Login to Confident AI (optional, for tracking)
deepeval login
# Verify installation
deepeval --version
Core Concepts
Test Case Structure
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is RAG?",
actual_output="RAG combines retrieval with generation to ground LLM responses in factual data.",
expected_output="Retrieval-augmented generation combines document retrieval with LLM generation.",
retrieval_context=[
"RAG is a technique that retrieves relevant documents and feeds them to an LLM for generation.",
"Retrieval-augmented generation grounds LLM outputs in factual data from external sources."
],
context=["RAG was introduced by Facebook AI Research in 2020."]
)
Running Tests with pytest
# test_llm.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
def test_answer_relevancy():
metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What are the benefits of RAG?",
actual_output="RAG reduces hallucinations and provides up-to-date information.",
retrieval_context=["RAG helps ground LLM outputs in factual, retrieved data."]
)
assert_test(test_case, [metric])
def test_faithfulness():
metric = FaithfulnessMetric(threshold=0.8)
test_case = LLMTestCase(
input="Explain vector databases",
actual_output="Vector databases store embeddings for similarity search.",
retrieval_context=["Vector databases are designed to store and query embedding vectors efficiently."]
)
assert_test(test_case, [metric])
# Run tests
deepeval test run test_llm.py
# Run with verbose output
deepeval test run test_llm.py -v
# Run specific test
deepeval test run test_llm.py::test_answer_relevancy
Evaluation Metrics
RAG Metrics
| Metric | Measures | Requires |
|---|
FaithfulnessMetric | Is output grounded in retrieved context? | retrieval_context |
AnswerRelevancyMetric | Is output relevant to the input? | input, actual_output |
ContextualPrecisionMetric | Are relevant chunks ranked higher? | retrieval_context, expected_output |
ContextualRecallMetric | Are all relevant chunks retrieved? | retrieval_context, expected_output |
ContextualRelevancyMetric | Are retrieved chunks relevant to input? | retrieval_context |
General LLM Metrics
| Metric | Measures | Requires |
|---|
HallucinationMetric | Does output contradict context? | context |
ToxicityMetric | Is output toxic or harmful? | actual_output |
BiasMetric | Does output show bias? | actual_output |
GEval | Custom LLM-as-judge evaluation | Configurable |
SummarizationMetric | Quality of summarization | input, actual_output |
Using Metrics
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualRecallMetric,
ContextualPrecisionMetric,
HallucinationMetric,
ToxicityMetric,
BiasMetric,
)
# Faithfulness (is answer grounded in context?)
faithfulness = FaithfulnessMetric(
threshold=0.7,
model="gpt-4o",
include_reason=True
)
# Answer relevancy
relevancy = AnswerRelevancyMetric(threshold=0.7)
# Contextual recall (did we retrieve all needed info?)
recall = ContextualRecallMetric(threshold=0.8)
# Hallucination detection
hallucination = HallucinationMetric(threshold=0.5)
# Run metric
faithfulness.measure(test_case)
print(f"Score: {faithfulness.score}")
print(f"Reason: {faithfulness.reason}")
print(f"Passed: {faithfulness.is_successful()}")
Custom Metrics with GEval
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
# Custom metric using LLM-as-judge
technical_accuracy = GEval(
name="Technical Accuracy",
criteria="Evaluate if the response contains technically accurate information about the topic.",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT
],
evaluation_steps=[
"Check if technical terms are used correctly",
"Verify code examples are syntactically valid",
"Ensure explanations are factually accurate",
"Check for common misconceptions"
],
threshold=0.7,
model="gpt-4o"
)
# Use in test
test_case = LLMTestCase(
input="Explain how HNSW indexing works",
actual_output="HNSW builds a multi-layer graph where each layer..."
)
technical_accuracy.measure(test_case)
print(f"Technical Accuracy: {technical_accuracy.score}")
Batch Evaluation
from deepeval import evaluate
from deepeval.dataset import EvaluationDataset
# Create dataset
dataset = EvaluationDataset(
test_cases=[
LLMTestCase(
input="What is RAG?",
actual_output="RAG is retrieval-augmented generation.",
retrieval_context=["RAG combines retrieval with LLM generation."]
),
LLMTestCase(
input="How do embeddings work?",
actual_output="Embeddings map text to dense vectors.",
retrieval_context=["Embeddings are numerical representations of text."]
),
]
)
# Evaluate all at once
metrics = [FaithfulnessMetric(threshold=0.7), AnswerRelevancyMetric(threshold=0.7)]
results = evaluate(dataset, metrics)
# View results
for result in results.test_results:
print(f"Input: {result.input}")
for metric_result in result.metrics_data:
print(f" {metric_result.name}: {metric_result.score:.3f} ({'PASS' if metric_result.success else 'FAIL'})")
Configuration
Model Configuration
from deepeval.models import DeepEvalBaseLLM
# Use custom model
class CustomModel(DeepEvalBaseLLM):
def __init__(self):
self.model_name = "custom-llm"
def load_model(self):
# Load your model
pass
def generate(self, prompt: str) -> str:
# Generate response
return "model response"
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
def get_model_name(self) -> str:
return self.model_name
# Use with metrics
metric = FaithfulnessMetric(model=CustomModel())
Environment Variables
export OPENAI_API_KEY=sk-...
export DEEPEVAL_API_KEY=your-confident-ai-key
export DEEPEVAL_RESULTS_FOLDER=./eval_results
Advanced Usage
Conversational Evaluation
from deepeval.test_case import ConversationalTestCase, LLMTestCase
convo_test = ConversationalTestCase(
turns=[
LLMTestCase(input="What is Python?", actual_output="Python is a programming language."),
LLMTestCase(input="What frameworks does it have?", actual_output="Popular frameworks include Django, Flask, and FastAPI."),
LLMTestCase(input="Which is best for APIs?", actual_output="FastAPI is excellent for building APIs due to its speed and type hints."),
]
)
Red Teaming
from deepeval.red_teaming import RedTeamer
red_teamer = RedTeamer(model="gpt-4o")
# Generate adversarial inputs
attacks = red_teamer.generate(
purpose="Customer support chatbot",
system_prompt="You are a helpful customer service agent.",
attacks_per_category=5,
categories=["jailbreaking", "prompt_injection", "toxicity"]
)
# Evaluate vulnerability
results = red_teamer.evaluate(attacks)
print(f"Vulnerability score: {results.vulnerability_score}")
CI/CD Integration
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install deepeval
- run: deepeval test run tests/test_llm.py --verbose
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Troubleshooting
| Issue | Solution |
|---|
| OpenAI rate limit errors | Set OPENAI_API_KEY, reduce parallel evaluations |
| Metric score always 0 or 1 | Check test case fields are populated correctly |
| Faithfulness requires retrieval_context | Add retrieval_context to LLMTestCase |
| Slow evaluation | Use gpt-4o-mini for metrics, batch test cases |
| Import errors | Ensure deepeval is latest version: pip install -U deepeval |
| Custom model not working | Implement both generate and a_generate methods |
| Results not showing on Confident AI | Run deepeval login and check API key |
| pytest not discovering tests | Name files test_*.py, functions test_* |
# Debug metric computation
python -c "
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
m = FaithfulnessMetric(threshold=0.5, include_reason=True)
tc = LLMTestCase(input='test', actual_output='test output', retrieval_context=['context'])
m.measure(tc)
print(f'Score: {m.score}, Reason: {m.reason}')
"