DeepEval Cheat Sheet

Overview

DeepEval is an open-source evaluation framework for LLM applications that provides unit-testing-style evaluation with 14+ research-backed metrics. It covers RAG evaluation (faithfulness, relevancy, contextual recall), hallucination detection, toxicity checking, bias evaluation, and custom metric creation. DeepEval integrates with pytest for CI/CD pipelines and offers a hosted platform (Confident AI) for tracking evaluations over time.

The framework enables developers to write LLM evaluation tests that run alongside regular test suites, catching regressions in AI quality before deployment. It supports both reference-based metrics (requiring ground truth) and reference-free metrics (using LLM-as-judge), making it applicable across development stages from prototyping to production monitoring.

Installation

pip install deepeval

# Login to Confident AI (optional, for tracking)
deepeval login

# Verify installation
deepeval --version

Core Concepts

Test Case Structure

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is RAG?",
    actual_output="RAG combines retrieval with generation to ground LLM responses in factual data.",
    expected_output="Retrieval-augmented generation combines document retrieval with LLM generation.",
    retrieval_context=[
        "RAG is a technique that retrieves relevant documents and feeds them to an LLM for generation.",
        "Retrieval-augmented generation grounds LLM outputs in factual data from external sources."
    ],
    context=["RAG was introduced by Facebook AI Research in 2020."]
)

Running Tests with pytest

# test_llm.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

def test_answer_relevancy():
    metric = AnswerRelevancyMetric(threshold=0.7)
    test_case = LLMTestCase(
        input="What are the benefits of RAG?",
        actual_output="RAG reduces hallucinations and provides up-to-date information.",
        retrieval_context=["RAG helps ground LLM outputs in factual, retrieved data."]
    )
    assert_test(test_case, [metric])

def test_faithfulness():
    metric = FaithfulnessMetric(threshold=0.8)
    test_case = LLMTestCase(
        input="Explain vector databases",
        actual_output="Vector databases store embeddings for similarity search.",
        retrieval_context=["Vector databases are designed to store and query embedding vectors efficiently."]
    )
    assert_test(test_case, [metric])

# Run tests
deepeval test run test_llm.py

# Run with verbose output
deepeval test run test_llm.py -v

# Run specific test
deepeval test run test_llm.py::test_answer_relevancy

Evaluation Metrics

RAG Metrics

Metric	Measures	Requires
`FaithfulnessMetric`	Is output grounded in retrieved context?	`retrieval_context`
`AnswerRelevancyMetric`	Is output relevant to the input?	`input`, `actual_output`
`ContextualPrecisionMetric`	Are relevant chunks ranked higher?	`retrieval_context`, `expected_output`
`ContextualRecallMetric`	Are all relevant chunks retrieved?	`retrieval_context`, `expected_output`
`ContextualRelevancyMetric`	Are retrieved chunks relevant to input?	`retrieval_context`

General LLM Metrics

Metric	Measures	Requires
`HallucinationMetric`	Does output contradict context?	`context`
`ToxicityMetric`	Is output toxic or harmful?	`actual_output`
`BiasMetric`	Does output show bias?	`actual_output`
`GEval`	Custom LLM-as-judge evaluation	Configurable
`SummarizationMetric`	Quality of summarization	`input`, `actual_output`

Using Metrics

from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualRecallMetric,
    ContextualPrecisionMetric,
    HallucinationMetric,
    ToxicityMetric,
    BiasMetric,
)

# Faithfulness (is answer grounded in context?)
faithfulness = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
)

# Answer relevancy
relevancy = AnswerRelevancyMetric(threshold=0.7)

# Contextual recall (did we retrieve all needed info?)
recall = ContextualRecallMetric(threshold=0.8)

# Hallucination detection
hallucination = HallucinationMetric(threshold=0.5)

# Run metric
faithfulness.measure(test_case)
print(f"Score: {faithfulness.score}")
print(f"Reason: {faithfulness.reason}")
print(f"Passed: {faithfulness.is_successful()}")

Custom Metrics with GEval

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Custom metric using LLM-as-judge
technical_accuracy = GEval(
    name="Technical Accuracy",
    criteria="Evaluate if the response contains technically accurate information about the topic.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    evaluation_steps=[
        "Check if technical terms are used correctly",
        "Verify code examples are syntactically valid",
        "Ensure explanations are factually accurate",
        "Check for common misconceptions"
    ],
    threshold=0.7,
    model="gpt-4o"
)

# Use in test
test_case = LLMTestCase(
    input="Explain how HNSW indexing works",
    actual_output="HNSW builds a multi-layer graph where each layer..."
)
technical_accuracy.measure(test_case)
print(f"Technical Accuracy: {technical_accuracy.score}")

Batch Evaluation

from deepeval import evaluate
from deepeval.dataset import EvaluationDataset

# Create dataset
dataset = EvaluationDataset(
    test_cases=[
        LLMTestCase(
            input="What is RAG?",
            actual_output="RAG is retrieval-augmented generation.",
            retrieval_context=["RAG combines retrieval with LLM generation."]
        ),
        LLMTestCase(
            input="How do embeddings work?",
            actual_output="Embeddings map text to dense vectors.",
            retrieval_context=["Embeddings are numerical representations of text."]
        ),
    ]
)

# Evaluate all at once
metrics = [FaithfulnessMetric(threshold=0.7), AnswerRelevancyMetric(threshold=0.7)]
results = evaluate(dataset, metrics)

# View results
for result in results.test_results:
    print(f"Input: {result.input}")
    for metric_result in result.metrics_data:
        print(f"  {metric_result.name}: {metric_result.score:.3f} ({'PASS' if metric_result.success else 'FAIL'})")

Configuration

Model Configuration

from deepeval.models import DeepEvalBaseLLM

# Use custom model
class CustomModel(DeepEvalBaseLLM):
    def __init__(self):
        self.model_name = "custom-llm"

    def load_model(self):
        # Load your model
        pass

    def generate(self, prompt: str) -> str:
        # Generate response
        return "model response"

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self) -> str:
        return self.model_name

# Use with metrics
metric = FaithfulnessMetric(model=CustomModel())

Environment Variables

export OPENAI_API_KEY=sk-...
export DEEPEVAL_API_KEY=your-confident-ai-key
export DEEPEVAL_RESULTS_FOLDER=./eval_results

Advanced Usage

Conversational Evaluation

from deepeval.test_case import ConversationalTestCase, LLMTestCase

convo_test = ConversationalTestCase(
    turns=[
        LLMTestCase(input="What is Python?", actual_output="Python is a programming language."),
        LLMTestCase(input="What frameworks does it have?", actual_output="Popular frameworks include Django, Flask, and FastAPI."),
        LLMTestCase(input="Which is best for APIs?", actual_output="FastAPI is excellent for building APIs due to its speed and type hints."),
    ]
)

Red Teaming

from deepeval.red_teaming import RedTeamer

red_teamer = RedTeamer(model="gpt-4o")

# Generate adversarial inputs
attacks = red_teamer.generate(
    purpose="Customer support chatbot",
    system_prompt="You are a helpful customer service agent.",
    attacks_per_category=5,
    categories=["jailbreaking", "prompt_injection", "toxicity"]
)

# Evaluate vulnerability
results = red_teamer.evaluate(attacks)
print(f"Vulnerability score: {results.vulnerability_score}")

CI/CD Integration

# .github/workflows/llm-eval.yml
name: LLM Evaluation
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install deepeval
      - run: deepeval test run tests/test_llm.py --verbose
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Troubleshooting

Issue	Solution
OpenAI rate limit errors	Set `OPENAI_API_KEY`, reduce parallel evaluations
Metric score always 0 or 1	Check test case fields are populated correctly
Faithfulness requires retrieval_context	Add `retrieval_context` to `LLMTestCase`
Slow evaluation	Use `gpt-4o-mini` for metrics, batch test cases
Import errors	Ensure `deepeval` is latest version: `pip install -U deepeval`
Custom model not working	Implement both `generate` and `a_generate` methods
Results not showing on Confident AI	Run `deepeval login` and check API key
pytest not discovering tests	Name files `test_.py`, functions `test_`

# Debug metric computation
python -c "
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
m = FaithfulnessMetric(threshold=0.5, include_reason=True)
tc = LLMTestCase(input='test', actual_output='test output', retrieval_context=['context'])
m.measure(tc)
print(f'Score: {m.score}, Reason: {m.reason}')
"