Promptfoo Cheat Sheet
Overview
Promptfoo is a developer-first tool for systematically testing LLM applications. You define test cases with expected assertions in a YAML config, then promptfoo eval runs them across one or more providers (OpenAI, Anthropic, Ollama, custom HTTP endpoints, etc.) and produces a scored report.
Key features: Provider-agnostic test runner, built-in assertion types (contains, regex, LLM-graded, JSON schema), model comparison (run the same tests against multiple models simultaneously), red-teaming (automated adversarial probe generation), and a web UI for browsing results.
Installation
# Global install (recommended for CLI use)
npm install -g promptfoo
# or run without installing
npx promptfoo@latest
# Verify installation
promptfoo --version
# Python package (for library use)
pip install promptfoo
Configuration
# promptfooconfig.yaml — minimal example
description: "My prompt evaluation"
providers:
- openai:gpt-4o-mini
- anthropic:messages:claude-3-5-haiku-20241022
prompts:
- "Answer the question concisely: {{question}}"
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: icontains # case-insensitive
value: "paris"
# Environment variables for providers
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_ENDPOINT=...
Core CLI Commands
| Command | Description |
|---|---|
promptfoo init | Scaffold a starter config in current directory |
promptfoo eval | Run evaluation from promptfooconfig.yaml |
promptfoo eval -c myconfig.yaml | Run with specific config file |
promptfoo eval --no-cache | Disable caching of LLM responses |
promptfoo eval --output results.json | Write results to JSON file |
promptfoo eval --output results.csv | Write results to CSV |
promptfoo view | Open web UI to browse latest results |
promptfoo view -y | Open web UI, auto-accept browser launch |
promptfoo share | Share results to promptfoo.app |
promptfoo redteam init | Scaffold a red-team config |
promptfoo redteam run | Run automated adversarial probes |
promptfoo redteam report | View red-team results in UI |
promptfoo cache clear | Clear the LLM response cache |
promptfoo list providers | List built-in provider IDs |
Assertion Types
| Type | Example Value | Description |
|---|---|---|
equals | "Paris" | Exact string match |
contains | "Paris" | Output contains substring |
icontains | "paris" | Case-insensitive contains |
not-contains | "error" | Output does NOT contain |
regex | "\\d{4}" | Matches regular expression |
starts-with | "The answer" | Output starts with value |
javascript | "output.length < 200" | Custom JS expression |
python | "len(output) < 200" | Custom Python expression |
json-schema | {type: object, ...} | Validates JSON structure |
is-json | — | Output is valid JSON |
llm-rubric | "Answer is factually correct" | LLM judge grades output |
model-graded-closedqa | "Says Paris is capital" | Closed-QA LLM grader |
similar | "Paris is the capital" | Semantic similarity check |
cost | 0.01 | API cost under threshold |
latency | 3000 | Latency under N ms |
Advanced Usage
Multi-Provider Comparison Config
# compare_models.yaml
description: "Compare GPT-4o vs Claude 3.5 vs Gemini"
providers:
- id: openai:gpt-4o
config:
temperature: 0.1
- id: anthropic:messages:claude-3-5-sonnet-20241022
config:
temperature: 0.1
- id: vertex:gemini-1.5-pro
prompts:
- file://prompts/system_prompt.txt
- |
You are a helpful assistant.
User question: {{question}}
Answer:
tests:
- vars:
question: "Explain transformer architecture in 3 sentences"
assert:
- type: llm-rubric
value: "Explains self-attention, encoder/decoder structure, and positional encoding"
- type: latency
threshold: 5000
- vars:
question: "Write a Python function to reverse a linked list"
assert:
- type: contains
value: "def "
- type: javascript
value: "output.includes('next') || output.includes('prev')"
- type: llm-rubric
value: "Provides correct and complete Python implementation"
Custom Provider (HTTP Endpoint)
providers:
- id: https
config:
url: "https://my-api.example.com/v1/chat"
method: POST
headers:
Authorization: "Bearer {{env.MY_API_KEY}}"
Content-Type: application/json
body:
model: "my-custom-model"
messages:
- role: user
content: "{{prompt}}"
responseParser: "json.choices[0].message.content"
JavaScript Assertions
tests:
- vars:
question: "List 5 planets"
assert:
# Custom JS in YAML
- type: javascript
value: |
const planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune'];
const found = planets.filter(p => output.includes(p));
return { pass: found.length >= 5, score: found.length / 5,
reason: `Found ${found.length}/5 planets` };
# External JS file
- type: javascript
value: file://assertions/validate_planets.js
// assertions/validate_planets.js
module.exports = (output, context) => {
const planets = ['Mercury','Venus','Earth','Mars','Jupiter','Saturn','Uranus','Neptune'];
const found = planets.filter(p => output.includes(p));
return {
pass: found.length >= 5,
score: found.length / 8,
reason: `Mentioned ${found.length} out of 8 planets`,
};
};
Python Assertions
tests:
- assert:
- type: python
value: file://assertions/check_json.py
# assertions/check_json.py
import json
def get_assert(output: str, context: dict) -> dict:
try:
data = json.loads(output)
has_name = "name" in data
has_score = isinstance(data.get("score"), (int, float))
return {
"pass": has_name and has_score,
"score": 1.0 if (has_name and has_score) else 0.0,
"reason": f"JSON valid: name={has_name}, score={has_score}",
}
except json.JSONDecodeError:
return {"pass": False, "score": 0.0, "reason": "Invalid JSON output"}
Red-Teaming Config
# redteam.yaml
description: "Red team my chatbot"
targets:
- id: openai:gpt-4o-mini
config:
systemPrompt: "You are a helpful customer service agent for ACME Corp."
redteam:
numTests: 50 # probes per plugin
plugins:
- jailbreak # jailbreak attempts
- harmful:hate # hate speech
- harmful:violence # violence
- pii # PII extraction
- prompt-injection # injections via user input
- hijacking # goal hijacking
- politics # political bias
strategies:
- jailbreak:composite
- multilingual # attacks in other languages
# Run red team
promptfoo redteam run -c redteam.yaml
# View results
promptfoo redteam report
CI/CD Integration
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm install -g promptfoo
- name: Run evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
promptfoo eval -c promptfooconfig.yaml \
--output results.json \
--no-cache \
--exit-on-failure # non-zero exit if any test fails
- uses: actions/upload-artifact@v4
if: always()
with:
name: eval-results
path: results.json
# Exit codes
# 0 = all tests passed
# 1 = one or more tests failed
# Use --exit-on-failure to enforce CI gate
promptfoo eval --exit-on-failure
Programmatic API (Node.js)
import promptfoo from 'promptfoo';
const results = await promptfoo.evaluate({
providers: ['openai:gpt-4o-mini'],
prompts: ['Translate to French: {{text}}'],
tests: [
{
vars: { text: 'Hello, world!' },
assert: [{ type: 'icontains', value: 'Bonjour' }],
},
],
});
console.log(`Pass rate: ${results.stats.successes}/${results.stats.totalTests}`);
for (const result of results.results) {
console.log(result.vars, result.success, result.response?.output);
}
Common Workflows
Workflow 1: Regression Testing a Prompt Change
# 1. Baseline eval
promptfoo eval -c config_v1.yaml --output baseline.json
# 2. Update your prompt in config_v2.yaml
# 3. Eval new version
promptfoo eval -c config_v2.yaml --output v2.json
# 4. View comparison in web UI
promptfoo view
Workflow 2: Find the Best Model for Your Use Case
# model_selection.yaml
providers:
- openai:gpt-4o-mini # cheapest
- openai:gpt-4o # most capable
- anthropic:messages:claude-3-5-haiku-20241022
- ollama:llama3.2 # local / free
prompts:
- file://prompt.txt
tests:
- file://tests/golden_set.yaml
defaultTest:
assert:
- type: cost
threshold: 0.005 # < $0.005 per call
- type: latency
threshold: 2000 # < 2 seconds
promptfoo eval -c model_selection.yaml --output comparison.json
promptfoo view # compare side-by-side in UI
Tips and Best Practices
- Use
promptfoo initto get a working starter config with examples in seconds. - Cache responses (default) during iteration; use
--no-cacheonly in CI or when testing prompt changes. - Combine assertion types: use
containsfor fast checks andllm-rubricfor nuanced quality — the cheap assertions gate expensive LLM-graded ones. llm-rubricneeds a capable grader model — setdefaultTest.options.providerto GPT-4o or Claude 3.5 Sonnet for reliable grades.- Test edge cases explicitly: add tests for empty inputs, very long inputs, non-English text, and adversarial inputs.
--exit-on-failurein CI turns promptfoo into a quality gate — block merges when pass rate drops.- Version your test YAML in git alongside your prompt files so prompt and test evolution are linked.
- Red-teaming is separate from evaluation — run
promptfoo redteamperiodically (not on every PR) as it’s expensive. - Use
file://references in YAML to keep prompts and assertions in separate files for better maintainability. - Share results with
promptfoo shareto get a public URL for async review with stakeholders.