Promptfoo Cheat Sheet

Overview

Promptfoo is a developer-first tool for systematically testing LLM applications. You define test cases with expected assertions in a YAML config, then promptfoo eval runs them across one or more providers (OpenAI, Anthropic, Ollama, custom HTTP endpoints, etc.) and produces a scored report.

Key features: Provider-agnostic test runner, built-in assertion types (contains, regex, LLM-graded, JSON schema), model comparison (run the same tests against multiple models simultaneously), red-teaming (automated adversarial probe generation), and a web UI for browsing results.

Installation

# Global install (recommended for CLI use)
npm install -g promptfoo

# or run without installing
npx promptfoo@latest

# Verify installation
promptfoo --version

# Python package (for library use)
pip install promptfoo

Configuration

# promptfooconfig.yaml — minimal example
description: "My prompt evaluation"

providers:
  - openai:gpt-4o-mini
  - anthropic:messages:claude-3-5-haiku-20241022

prompts:
  - "Answer the question concisely: {{question}}"

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: icontains   # case-insensitive
        value: "paris"

# Environment variables for providers
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_ENDPOINT=...

Core CLI Commands

Command	Description
`promptfoo init`	Scaffold a starter config in current directory
`promptfoo eval`	Run evaluation from `promptfooconfig.yaml`
`promptfoo eval -c myconfig.yaml`	Run with specific config file
`promptfoo eval --no-cache`	Disable caching of LLM responses
`promptfoo eval --output results.json`	Write results to JSON file
`promptfoo eval --output results.csv`	Write results to CSV
`promptfoo view`	Open web UI to browse latest results
`promptfoo view -y`	Open web UI, auto-accept browser launch
`promptfoo share`	Share results to promptfoo.app
`promptfoo redteam init`	Scaffold a red-team config
`promptfoo redteam run`	Run automated adversarial probes
`promptfoo redteam report`	View red-team results in UI
`promptfoo cache clear`	Clear the LLM response cache
`promptfoo list providers`	List built-in provider IDs

Assertion Types

Type	Example Value	Description
`equals`	`"Paris"`	Exact string match
`contains`	`"Paris"`	Output contains substring
`icontains`	`"paris"`	Case-insensitive contains
`not-contains`	`"error"`	Output does NOT contain
`regex`	`"\\d{4}"`	Matches regular expression
`starts-with`	`"The answer"`	Output starts with value
`javascript`	`"output.length < 200"`	Custom JS expression
`python`	`"len(output) < 200"`	Custom Python expression
`json-schema`	`{type: object, ...}`	Validates JSON structure
`is-json`	—	Output is valid JSON
`llm-rubric`	`"Answer is factually correct"`	LLM judge grades output
`model-graded-closedqa`	`"Says Paris is capital"`	Closed-QA LLM grader
`similar`	`"Paris is the capital"`	Semantic similarity check
`cost`	`0.01`	API cost under threshold
`latency`	`3000`	Latency under N ms

Advanced Usage

Multi-Provider Comparison Config

# compare_models.yaml
description: "Compare GPT-4o vs Claude 3.5 vs Gemini"

providers:
  - id: openai:gpt-4o
    config:
      temperature: 0.1
  - id: anthropic:messages:claude-3-5-sonnet-20241022
    config:
      temperature: 0.1
  - id: vertex:gemini-1.5-pro

prompts:
  - file://prompts/system_prompt.txt
  - |
    You are a helpful assistant.
    User question: {{question}}
    Answer:

tests:
  - vars:
      question: "Explain transformer architecture in 3 sentences"
    assert:
      - type: llm-rubric
        value: "Explains self-attention, encoder/decoder structure, and positional encoding"
      - type: latency
        threshold: 5000

  - vars:
      question: "Write a Python function to reverse a linked list"
    assert:
      - type: contains
        value: "def "
      - type: javascript
        value: "output.includes('next') || output.includes('prev')"
      - type: llm-rubric
        value: "Provides correct and complete Python implementation"

Custom Provider (HTTP Endpoint)

providers:
  - id: https
    config:
      url: "https://my-api.example.com/v1/chat"
      method: POST
      headers:
        Authorization: "Bearer {{env.MY_API_KEY}}"
        Content-Type: application/json
      body:
        model: "my-custom-model"
        messages:
          - role: user
            content: "{{prompt}}"
      responseParser: "json.choices[0].message.content"

JavaScript Assertions

tests:
  - vars:
      question: "List 5 planets"
    assert:
      # Custom JS in YAML
      - type: javascript
        value: |
          const planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune'];
          const found = planets.filter(p => output.includes(p));
          return { pass: found.length >= 5, score: found.length / 5,
                   reason: `Found ${found.length}/5 planets` };

      # External JS file
      - type: javascript
        value: file://assertions/validate_planets.js

// assertions/validate_planets.js
module.exports = (output, context) => {
  const planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune'];
  const found = planets.filter(p => output.includes(p));
  return {
    pass: found.length >= 5,
    score: found.length / 8,
    reason: `Mentioned ${found.length} out of 8 planets`,
  };
};

Python Assertions

tests:
  - assert:
      - type: python
        value: file://assertions/check_json.py

# assertions/check_json.py
import json

def get_assert(output: str, context: dict) -> dict:
    try:
        data = json.loads(output)
        has_name = "name" in data
        has_score = isinstance(data.get("score"), (int, float))
        return {
            "pass": has_name and has_score,
            "score": 1.0 if (has_name and has_score) else 0.0,
            "reason": f"JSON valid: name={has_name}, score={has_score}",
        }
    except json.JSONDecodeError:
        return {"pass": False, "score": 0.0, "reason": "Invalid JSON output"}

Red-Teaming Config

# redteam.yaml
description: "Red team my chatbot"

targets:
  - id: openai:gpt-4o-mini
    config:
      systemPrompt: "You are a helpful customer service agent for ACME Corp."

redteam:
  numTests: 50         # probes per plugin
  plugins:
    - jailbreak        # jailbreak attempts
    - harmful:hate     # hate speech
    - harmful:violence # violence
    - pii              # PII extraction
    - prompt-injection # injections via user input
    - hijacking        # goal hijacking
    - politics         # political bias
  strategies:
    - jailbreak:composite
    - multilingual     # attacks in other languages

# Run red team
promptfoo redteam run -c redteam.yaml

# View results
promptfoo redteam report

CI/CD Integration

# .github/workflows/llm-eval.yml
name: LLM Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm install -g promptfoo
      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          promptfoo eval -c promptfooconfig.yaml \
            --output results.json \
            --no-cache \
            --exit-on-failure    # non-zero exit if any test fails
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-results
          path: results.json

# Exit codes
# 0 = all tests passed
# 1 = one or more tests failed
# Use --exit-on-failure to enforce CI gate
promptfoo eval --exit-on-failure

Programmatic API (Node.js)

import promptfoo from 'promptfoo';

const results = await promptfoo.evaluate({
  providers: ['openai:gpt-4o-mini'],
  prompts: ['Translate to French: {{text}}'],
  tests: [
    {
      vars: { text: 'Hello, world!' },
      assert: [{ type: 'icontains', value: 'Bonjour' }],
    },
  ],
});

console.log(`Pass rate: ${results.stats.successes}/${results.stats.totalTests}`);
for (const result of results.results) {
  console.log(result.vars, result.success, result.response?.output);
}

Common Workflows

Workflow 1: Regression Testing a Prompt Change

# 1. Baseline eval
promptfoo eval -c config_v1.yaml --output baseline.json

# 2. Update your prompt in config_v2.yaml

# 3. Eval new version
promptfoo eval -c config_v2.yaml --output v2.json

# 4. View comparison in web UI
promptfoo view

Workflow 2: Find the Best Model for Your Use Case

# model_selection.yaml
providers:
  - openai:gpt-4o-mini       # cheapest
  - openai:gpt-4o             # most capable
  - anthropic:messages:claude-3-5-haiku-20241022
  - ollama:llama3.2           # local / free

prompts:
  - file://prompt.txt

tests:
  - file://tests/golden_set.yaml

defaultTest:
  assert:
    - type: cost
      threshold: 0.005   # < $0.005 per call
    - type: latency
      threshold: 2000    # < 2 seconds

promptfoo eval -c model_selection.yaml --output comparison.json
promptfoo view   # compare side-by-side in UI

Tips and Best Practices

Use promptfoo init to get a working starter config with examples in seconds.
Cache responses (default) during iteration; use --no-cache only in CI or when testing prompt changes.
Combine assertion types: use contains for fast checks and llm-rubric for nuanced quality — the cheap assertions gate expensive LLM-graded ones.
llm-rubric needs a capable grader model — set defaultTest.options.provider to GPT-4o or Claude 3.5 Sonnet for reliable grades.
Test edge cases explicitly: add tests for empty inputs, very long inputs, non-English text, and adversarial inputs.
--exit-on-failure in CI turns promptfoo into a quality gate — block merges when pass rate drops.
Version your test YAML in git alongside your prompt files so prompt and test evolution are linked.
Red-teaming is separate from evaluation — run promptfoo redteam periodically (not on every PR) as it’s expensive.
Use file:// references in YAML to keep prompts and assertions in separate files for better maintainability.
Share results with promptfoo share to get a public URL for async review with stakeholders.