Langfuse Cheat Sheet

Overview

Langfuse is an open-source LLM engineering platform that provides full observability into your AI application. It captures every LLM call, chain step, and agent action as structured traces, then lets you score outputs, manage prompts centrally, build evaluation datasets, and track token costs — all with a clean web UI and a flexible SDK.

Key concepts: Traces represent a single end-to-end request. Spans are sub-steps within a trace (retrieval, LLM calls, tool use). Generations are LLM-specific spans with model/token metadata. Scores are human or automated quality signals attached to traces or generations.

Langfuse works with any LLM stack via direct SDK, decorators, or auto-instrumentation for LangChain, LlamaIndex, OpenAI, and more.

Installation

# Python SDK
pip install langfuse

# With OpenAI integration
pip install langfuse openai

# With LangChain integration
pip install langfuse langchain langchain-openai

# With LlamaIndex integration
pip install langfuse llama-index llama-index-callbacks-langfuse

# Node.js / TypeScript SDK
npm install langfuse
# or
yarn add langfuse

Configuration

import os
from langfuse import Langfuse

# Set environment variables (recommended)
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"   # or self-hosted URL

# Create client
langfuse = Langfuse()

# Explicit configuration
langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com",
    debug=False,
    flush_at=15,        # batch size before auto-flush
    flush_interval=0.5, # seconds between auto-flushes
)

# Self-hosted via Docker Compose
# LANGFUSE_HOST=http://localhost:3000

# Self-hosted quickstart
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d
# Access at http://localhost:3000

Core API Reference

API	Description
`langfuse.trace(name, ...)`	Start a new root trace
`trace.span(name, ...)`	Add a child span to the trace
`trace.generation(name, ...)`	Add an LLM generation span
`trace.score(name, value, ...)`	Attach a score to the trace
`span.update(...)`	Update span metadata after creation
`span.end(output=...)`	End a span and set output
`generation.update(usage=...)`	Update token usage
`langfuse.score(trace_id, name, value)`	Score any trace by ID
`langfuse.get_prompt(name)`	Fetch a managed prompt
`langfuse.create_dataset(name)`	Create an evaluation dataset
`langfuse.flush()`	Force-flush pending events
`@observe()`	Decorator for auto-tracing functions
`langfuse.auth_check()`	Verify credentials are valid

Advanced Usage

Manual Tracing

from langfuse import Langfuse
import openai

langfuse = Langfuse()
client = openai.OpenAI()

def rag_pipeline(user_question: str, session_id: str):
    # Create root trace
    trace = langfuse.trace(
        name="rag-pipeline",
        input={"question": user_question},
        user_id="user-123",
        session_id=session_id,
        tags=["production", "v2"],
        metadata={"environment": "prod"},
    )

    # Retrieval span
    retrieval_span = trace.span(
        name="vector-retrieval",
        input={"query": user_question},
    )
    passages = retrieve_passages(user_question)
    retrieval_span.end(
        output={"passages": passages, "count": len(passages)},
    )

    # LLM generation span
    generation = trace.generation(
        name="answer-generation",
        model="gpt-4o-mini",
        input=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Context: {passages}\n\nQuestion: {user_question}"},
        ],
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=generation.input,
    )
    answer = response.choices[0].message.content

    generation.end(
        output=answer,
        usage={
            "input": response.usage.prompt_tokens,
            "output": response.usage.completion_tokens,
        },
        model="gpt-4o-mini",
    )

    # Update trace with final output
    trace.update(output={"answer": answer})

    # Auto-score using heuristic
    langfuse.score(
        trace_id=trace.id,
        name="response-length",
        value=len(answer.split()),
        data_type="NUMERIC",
    )

    langfuse.flush()
    return answer, trace.id

Decorator-based Tracing (`@observe`)

from langfuse.decorators import observe, langfuse_context
import openai

client = openai.OpenAI()

@observe(name="document-retriever")
def retrieve_docs(query: str) -> list[str]:
    # This function is automatically traced as a span
    langfuse_context.update_current_observation(
        metadata={"retriever": "faiss", "top_k": 5}
    )
    return ["doc1...", "doc2..."]

@observe(name="llm-call")
def call_llm(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    # Report token usage to Langfuse
    langfuse_context.update_current_observation(
        usage={"input": response.usage.prompt_tokens,
               "output": response.usage.completion_tokens},
        model="gpt-4o-mini",
    )
    return response.choices[0].message.content

@observe()
def rag_pipeline(question: str) -> str:
    # Outer trace automatically created; inner functions become spans
    langfuse_context.update_current_trace(
        name="rag-pipeline",
        user_id="user-42",
        tags=["rag"],
    )
    docs = retrieve_docs(question)
    prompt = f"Context: {docs}\n\nAnswer: {question}"
    answer = call_llm(prompt)
    return answer

result = rag_pipeline("What is quantum computing?")

OpenAI Drop-in Integration

# Replace `import openai` with langfuse-wrapped version
from langfuse.openai import openai

# All calls are automatically traced — no other changes needed
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}],
    # Optional Langfuse metadata as extra kwargs:
    name="my-generation",
    session_id="sess-abc",
    user_id="user-123",
    tags=["chatbot"],
    metadata={"feature": "onboarding"},
)

LangChain Integration

from langfuse.callback import CallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# One-line integration via callback handler
handler = CallbackHandler(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    session_id="sess-123",
    user_id="user-42",
    tags=["langchain"],
)

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
chain = prompt | llm

# Pass handler in config
result = chain.invoke(
    {"topic": "black holes"},
    config={"callbacks": [handler]},
)
handler.flush()

Prompt Management

from langfuse import Langfuse

langfuse = Langfuse()

# Fetch a managed prompt (version is cached locally)
prompt = langfuse.get_prompt("my-rag-prompt")

# Compile with variables
compiled = prompt.compile(topic="quantum computing", language="English")

# Use the prompt text
print(compiled)

# Create a prompt via SDK
langfuse.create_prompt(
    name="summarizer",
    prompt="Summarize the following in {{language}}: {{text}}",
    labels=["production"],
    config={"model": "gpt-4o-mini", "temperature": 0.3},
)

# Fetch specific version or label
prod_prompt = langfuse.get_prompt("summarizer", label="production")
staging_prompt = langfuse.get_prompt("summarizer", label="staging")

Datasets and Evaluation

from langfuse import Langfuse

langfuse = Langfuse()

# Create dataset
dataset = langfuse.create_dataset(name="qa-evaluation-set")

# Add items
langfuse.create_dataset_item(
    dataset_name="qa-evaluation-set",
    input={"question": "What is the capital of France?"},
    expected_output={"answer": "Paris"},
    metadata={"source": "trivia"},
)

# Run experiment on dataset
def run_my_pipeline(input_data):
    return {"answer": my_rag_pipeline(input_data["question"])}

def exact_match(output, expected):
    return 1.0 if output["answer"].lower() == expected["answer"].lower() else 0.0

items = langfuse.get_dataset("qa-evaluation-set").items
run_name = "experiment-v2-2024"

for item in items:
    output = run_my_pipeline(item.input)
    score = exact_match(output, item.expected_output)

    # Link trace to dataset item
    with item.observe(run_name=run_name) as trace_id:
        langfuse.score(
            trace_id=trace_id,
            name="exact-match",
            value=score,
        )

langfuse.flush()

Common Workflows

Workflow 1: Score Traces for Human Evaluation

# Fetch recent traces
traces = langfuse.fetch_traces(limit=50, tags=["production"]).data

for trace in traces:
    # Automated scoring (e.g., response length heuristic)
    if trace.output and "answer" in trace.output:
        word_count = len(trace.output["answer"].split())
        langfuse.score(
            trace_id=trace.id,
            name="word-count",
            value=word_count,
            data_type="NUMERIC",
        )

langfuse.flush()

Workflow 2: Cost Tracking by Feature

@observe()
def feature_a(prompt: str) -> str:
    langfuse_context.update_current_trace(tags=["feature-a"])
    # ... LLM calls tracked automatically
    return result

@observe()
def feature_b(prompt: str) -> str:
    langfuse_context.update_current_trace(tags=["feature-b"])
    return result

# In Langfuse UI: filter by tag to see cost per feature

Tips and Best Practices

Use @observe decorators for the cleanest integration — they handle span nesting automatically and require minimal code changes.
Always call langfuse.flush() in scripts and short-lived processes; the SDK batches events asynchronously and unflushed data is lost on exit.
Tag traces meaningfully (environment, feature, model version) — tags are the primary filter in the UI and enable cost segmentation.
Set session_id to group a multi-turn conversation into one session for coherent UX analysis.
Use user_id to track per-user usage, cost, and quality trends.
Prompt management with version labels (production, staging) enables safe prompt rollouts without code deploys.
Dataset experiments should use unique run_name values so you can compare runs side-by-side in the UI.
Self-host with Docker Compose for data-residency requirements; the schema is identical to cloud.
LlamaIndex integration uses LlamaIndexCallbackHandler — one line in your Settings.callback_manager.
Score data type matters: use NUMERIC for continuous metrics, BOOLEAN for pass/fail, CATEGORICAL for labels.