Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Arize Phoenix ist eine Open-Source Observability- und Evaluation-Plattform für LLM-Anwendungen und Agenten. Sie ist OpenTelemetry-nativ: Instrumentierung sendet Standard-OTLP-Traces, die Phoenix sammelt, daher können Sie jeden Prompt, Model Call, Tool-Aufruf und Retrieval-Schritt sehen, dann Evaluationen darüber durchführen. Sie läuft lokal (Notebook, Container oder selbst gehosteter Server) und integriert sich mit gängigen Frameworks über OpenInference-Instrumentoren.

Installation

Method	Befehl
pip	`pip install arize-phoenix`
Lokalen Server ausführen	`phoenix serve` (UI auf `http://localhost:6006`)
Docker	`docker run -p 6006:6006 arizephoenix/phoenix:latest`
In einem Notebook	`import phoenix as px; px.launch_app()`
Instrumentierungs-Extras	`pip install openinference-instrumentation-openai` (und andere)

Tracing Quickstart

from phoenix.otel import register

# Configure the tracer provider and point it at a Phoenix collector
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

# Auto-instrument a library (example: OpenAI)
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# ...now your LLM calls show up as traces in Phoenix

Konzept	Bedeutung
Span	Eine Operation (LLM Call, Tool Call, Retriever-Schritt)
Trace	Ein Baum von Spans für eine Anfrage
Projekt	Eine benannte Sammlung von Traces
OpenInference	Die semantischen Konventionen, die Phoenix über OTLP verwendet

Framework-Instrumentierung

Library	Instrumentor-Paket
OpenAI	`openinference-instrumentation-openai`
LangChain	`openinference-instrumentation-langchain`
LlamaIndex	`openinference-instrumentation-llama-index`
Anthropic	`openinference-instrumentation-anthropic`
DSPy	`openinference-instrumentation-dspy`
Auto (many)	enable via `register(auto_instrument=True)`

Manuelle Spans

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("retrieve") as span:
    span.set_attribute("input.value", query)
    docs = retriever(query)
    span.set_attribute("retrieval.documents", len(docs))

Evaluation

Phoenix kann Traces bewerten (besonders LLM-as-Judge Evals) und Ergebnisse anhängen.

from phoenix.evals import llm_classify, OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE

results = llm_classify(
    dataframe=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=["factual", "hallucinated"],
)

Eval	Misst
Hallucination	Ist die Antwort im Kontext verankert?
QA correctness	Ist die Antwort korrekt gegen eine Referenz?
Relevance	Sind abgerufene Dokumente für die Abfrage relevant?
Toxicity	Unsicherer/toxischer Inhalt
Custom	Ihre eigene LLM-as-Judge oder Code-Vorlage

Datasets & Experiments

Fähigkeit	Verwendung
Datasets	Beispiel Ein/Ausgaben aus Traces kuratieren
Experiments	Eine Aufgabe über einen Dataset ausführen und bewerten
`px.Client().upload_dataset(...)`	Ein Dataset programmatisch erstellen
Compare runs	Qualität über Prompt/Model-Änderungen verfolgen

Retrieval-Debugging

Feature	Hilft bei
Span-Attribute für abgerufene Docs	Sehen Sie genau, was RAG abgerufen hat
Embedding/UMAP-Ansichten	Flecken und Drift in Embeddings erkennen
Eval on retrieval relevance	Retrieval-Qualität pro Abfrage quantifizieren

Phoenix vs. andere LLM Observability

Aspekt	Arize Phoenix	Langfuse	MLflow
Standard	OpenTelemetry/OTLP	OTel + SDK	MLflow tracing
Focus	Tracing + Eval + Retrieval Debug	Tracing + Prompts + Eval	Full ML platform + LLM
Self-host	Ja	Ja	Ja
Best for	OTel-native Eval & RAG Debugging	Product Analytics + Prompts	Teams already on MLflow

Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Installation

Tracing Quickstart

Framework-Instrumentierung

Manuelle Spans

Evaluation

Datasets & Experiments

Retrieval-Debugging

Phoenix vs. andere LLM Observability

Ressourcen