Langfuse Cheat Sheet
Overview
Langfuse is an open-source LLM engineering platform that provides full observability into your AI application. It captures every LLM call, chain step, and agent action as structured traces, then lets you score outputs, manage prompts centrally, build evaluation datasets, and track token costs — all with a clean web UI and a flexible SDK.
Key concepts: Traces represent a single end-to-end request. Spans are sub-steps within a trace (retrieval, LLM calls, tool use). Generations are LLM-specific spans with model/token metadata. Scores are human or automated quality signals attached to traces or generations.
Langfuse works with any LLM stack via direct SDK, decorators, or auto-instrumentation for LangChain, LlamaIndex, OpenAI, and more.
Installation
# Python SDK
pip install langfuse
# With OpenAI integration
pip install langfuse openai
# With LangChain integration
pip install langfuse langchain langchain-openai
# With LlamaIndex integration
pip install langfuse llama-index llama-index-callbacks-langfuse
# Node.js / TypeScript SDK
npm install langfuse
# or
yarn add langfuse
Configuration
import os
from langfuse import Langfuse
# Set environment variables (recommended)
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # or self-hosted URL
# Create client
langfuse = Langfuse()
# Explicit configuration
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://cloud.langfuse.com",
debug=False,
flush_at=15, # batch size before auto-flush
flush_interval=0.5, # seconds between auto-flushes
)
# Self-hosted via Docker Compose
# LANGFUSE_HOST=http://localhost:3000
# Self-hosted quickstart
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d
# Access at http://localhost:3000
Core API Reference
| API | Description |
|---|---|
langfuse.trace(name, ...) | Start a new root trace |
trace.span(name, ...) | Add a child span to the trace |
trace.generation(name, ...) | Add an LLM generation span |
trace.score(name, value, ...) | Attach a score to the trace |
span.update(...) | Update span metadata after creation |
span.end(output=...) | End a span and set output |
generation.update(usage=...) | Update token usage |
langfuse.score(trace_id, name, value) | Score any trace by ID |
langfuse.get_prompt(name) | Fetch a managed prompt |
langfuse.create_dataset(name) | Create an evaluation dataset |
langfuse.flush() | Force-flush pending events |
@observe() | Decorator for auto-tracing functions |
langfuse.auth_check() | Verify credentials are valid |
Advanced Usage
Manual Tracing
from langfuse import Langfuse
import openai
langfuse = Langfuse()
client = openai.OpenAI()
def rag_pipeline(user_question: str, session_id: str):
# Create root trace
trace = langfuse.trace(
name="rag-pipeline",
input={"question": user_question},
user_id="user-123",
session_id=session_id,
tags=["production", "v2"],
metadata={"environment": "prod"},
)
# Retrieval span
retrieval_span = trace.span(
name="vector-retrieval",
input={"query": user_question},
)
passages = retrieve_passages(user_question)
retrieval_span.end(
output={"passages": passages, "count": len(passages)},
)
# LLM generation span
generation = trace.generation(
name="answer-generation",
model="gpt-4o-mini",
input=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Context: {passages}\n\nQuestion: {user_question}"},
],
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=generation.input,
)
answer = response.choices[0].message.content
generation.end(
output=answer,
usage={
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens,
},
model="gpt-4o-mini",
)
# Update trace with final output
trace.update(output={"answer": answer})
# Auto-score using heuristic
langfuse.score(
trace_id=trace.id,
name="response-length",
value=len(answer.split()),
data_type="NUMERIC",
)
langfuse.flush()
return answer, trace.id
Decorator-based Tracing (@observe)
from langfuse.decorators import observe, langfuse_context
import openai
client = openai.OpenAI()
@observe(name="document-retriever")
def retrieve_docs(query: str) -> list[str]:
# This function is automatically traced as a span
langfuse_context.update_current_observation(
metadata={"retriever": "faiss", "top_k": 5}
)
return ["doc1...", "doc2..."]
@observe(name="llm-call")
def call_llm(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
# Report token usage to Langfuse
langfuse_context.update_current_observation(
usage={"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens},
model="gpt-4o-mini",
)
return response.choices[0].message.content
@observe()
def rag_pipeline(question: str) -> str:
# Outer trace automatically created; inner functions become spans
langfuse_context.update_current_trace(
name="rag-pipeline",
user_id="user-42",
tags=["rag"],
)
docs = retrieve_docs(question)
prompt = f"Context: {docs}\n\nAnswer: {question}"
answer = call_llm(prompt)
return answer
result = rag_pipeline("What is quantum computing?")
OpenAI Drop-in Integration
# Replace `import openai` with langfuse-wrapped version
from langfuse.openai import openai
# All calls are automatically traced — no other changes needed
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}],
# Optional Langfuse metadata as extra kwargs:
name="my-generation",
session_id="sess-abc",
user_id="user-123",
tags=["chatbot"],
metadata={"feature": "onboarding"},
)
LangChain Integration
from langfuse.callback import CallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# One-line integration via callback handler
handler = CallbackHandler(
public_key="pk-lf-...",
secret_key="sk-lf-...",
session_id="sess-123",
user_id="user-42",
tags=["langchain"],
)
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
chain = prompt | llm
# Pass handler in config
result = chain.invoke(
{"topic": "black holes"},
config={"callbacks": [handler]},
)
handler.flush()
Prompt Management
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch a managed prompt (version is cached locally)
prompt = langfuse.get_prompt("my-rag-prompt")
# Compile with variables
compiled = prompt.compile(topic="quantum computing", language="English")
# Use the prompt text
print(compiled)
# Create a prompt via SDK
langfuse.create_prompt(
name="summarizer",
prompt="Summarize the following in {{language}}: {{text}}",
labels=["production"],
config={"model": "gpt-4o-mini", "temperature": 0.3},
)
# Fetch specific version or label
prod_prompt = langfuse.get_prompt("summarizer", label="production")
staging_prompt = langfuse.get_prompt("summarizer", label="staging")
Datasets and Evaluation
from langfuse import Langfuse
langfuse = Langfuse()
# Create dataset
dataset = langfuse.create_dataset(name="qa-evaluation-set")
# Add items
langfuse.create_dataset_item(
dataset_name="qa-evaluation-set",
input={"question": "What is the capital of France?"},
expected_output={"answer": "Paris"},
metadata={"source": "trivia"},
)
# Run experiment on dataset
def run_my_pipeline(input_data):
return {"answer": my_rag_pipeline(input_data["question"])}
def exact_match(output, expected):
return 1.0 if output["answer"].lower() == expected["answer"].lower() else 0.0
items = langfuse.get_dataset("qa-evaluation-set").items
run_name = "experiment-v2-2024"
for item in items:
output = run_my_pipeline(item.input)
score = exact_match(output, item.expected_output)
# Link trace to dataset item
with item.observe(run_name=run_name) as trace_id:
langfuse.score(
trace_id=trace_id,
name="exact-match",
value=score,
)
langfuse.flush()
Common Workflows
Workflow 1: Score Traces for Human Evaluation
# Fetch recent traces
traces = langfuse.fetch_traces(limit=50, tags=["production"]).data
for trace in traces:
# Automated scoring (e.g., response length heuristic)
if trace.output and "answer" in trace.output:
word_count = len(trace.output["answer"].split())
langfuse.score(
trace_id=trace.id,
name="word-count",
value=word_count,
data_type="NUMERIC",
)
langfuse.flush()
Workflow 2: Cost Tracking by Feature
@observe()
def feature_a(prompt: str) -> str:
langfuse_context.update_current_trace(tags=["feature-a"])
# ... LLM calls tracked automatically
return result
@observe()
def feature_b(prompt: str) -> str:
langfuse_context.update_current_trace(tags=["feature-b"])
return result
# In Langfuse UI: filter by tag to see cost per feature
Tips and Best Practices
- Use
@observedecorators for the cleanest integration — they handle span nesting automatically and require minimal code changes. - Always call
langfuse.flush()in scripts and short-lived processes; the SDK batches events asynchronously and unflushed data is lost on exit. - Tag traces meaningfully (environment, feature, model version) — tags are the primary filter in the UI and enable cost segmentation.
- Set
session_idto group a multi-turn conversation into one session for coherent UX analysis. - Use
user_idto track per-user usage, cost, and quality trends. - Prompt management with version labels (
production,staging) enables safe prompt rollouts without code deploys. - Dataset experiments should use unique
run_namevalues so you can compare runs side-by-side in the UI. - Self-host with Docker Compose for data-residency requirements; the schema is identical to cloud.
- LlamaIndex integration uses
LlamaIndexCallbackHandler— one line in yourSettings.callback_manager. - Score data type matters: use
NUMERICfor continuous metrics,BOOLEANfor pass/fail,CATEGORICALfor labels.