Skip to content

Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Arize Phoenix is an open-source observability and evaluation platform for LLM applications and agents. It is OpenTelemetry-native: instrumentation emits standard OTLP traces that Phoenix collects, so you can see every prompt, model call, tool invocation, and retrieval step, then run evaluations over them. It runs locally (notebook, container, or self-hosted server) and integrates with common frameworks through OpenInference instrumentors.

Installation

MethodCommand
pippip install arize-phoenix
Run a local serverphoenix serve (UI at http://localhost:6006)
Dockerdocker run -p 6006:6006 arizephoenix/phoenix:latest
In a notebookimport phoenix as px; px.launch_app()
Instrumentation extraspip install openinference-instrumentation-openai (and others)

Tracing Quickstart

from phoenix.otel import register

# Configure the tracer provider and point it at a Phoenix collector
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

# Auto-instrument a library (example: OpenAI)
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# ...now your LLM calls show up as traces in Phoenix
ConceptMeaning
SpanOne operation (LLM call, tool call, retriever step)
TraceA tree of spans for one request
ProjectA named collection of traces
OpenInferenceThe semantic conventions Phoenix uses over OTLP

Framework Instrumentation

LibraryInstrumentor package
OpenAIopeninference-instrumentation-openai
LangChainopeninference-instrumentation-langchain
LlamaIndexopeninference-instrumentation-llama-index
Anthropicopeninference-instrumentation-anthropic
DSPyopeninference-instrumentation-dspy
Auto (many)enable via register(auto_instrument=True)

Manual Spans

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("retrieve") as span:
    span.set_attribute("input.value", query)
    docs = retriever(query)
    span.set_attribute("retrieval.documents", len(docs))

Evaluation

Phoenix can score traces (especially LLM-as-judge evals) and attach results.

from phoenix.evals import llm_classify, OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE

results = llm_classify(
    dataframe=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=["factual", "hallucinated"],
)
EvalMeasures
HallucinationIs the answer grounded in the context?
QA correctnessIs the answer correct vs reference?
RelevanceAre retrieved docs relevant to the query?
ToxicityUnsafe/toxic content
CustomYour own LLM-as-judge or code template

Datasets & Experiments

CapabilityUse
DatasetsCurate example inputs/outputs from traces
ExperimentsRun a task over a dataset and score it
px.Client().upload_dataset(...)Create a dataset programmatically
Compare runsTrack quality across prompt/model changes

Retrieval Debugging

FeatureHelps with
Span attributes for retrieved docsSee exactly what RAG fetched
Embedding/UMAP viewsSpot clusters and drift in embeddings
Eval on retrieval relevanceQuantify retrieval quality per query

Phoenix vs Other LLM Observability

AspectArize PhoenixLangfuseMLflow
StandardOpenTelemetry/OTLPOTel + SDKMLflow tracing
FocusTracing + eval + retrieval debugTracing + prompts + evalFull ML platform + LLM
Self-hostYesYesYes
Best forOTel-native eval & RAG debuggingProduct analytics + promptsTeams already on MLflow

Resources