Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Arize Phoenix is an open-source observability and evaluation platform for LLM applications and agents. It is OpenTelemetry-native: instrumentation emits standard OTLP traces that Phoenix collects, so you can see every prompt, model call, tool invocation, and retrieval step, then run evaluations over them. It runs locally (notebook, container, or self-hosted server) and integrates with common frameworks through OpenInference instrumentors.

Installation

Method	Command
pip	`pip install arize-phoenix`
Run a local server	`phoenix serve` (UI at `http://localhost:6006`)
Docker	`docker run -p 6006:6006 arizephoenix/phoenix:latest`
In a notebook	`import phoenix as px; px.launch_app()`
Instrumentation extras	`pip install openinference-instrumentation-openai` (and others)

Tracing Quickstart

from phoenix.otel import register

# Configure the tracer provider and point it at a Phoenix collector
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

# Auto-instrument a library (example: OpenAI)
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# ...now your LLM calls show up as traces in Phoenix

Concept	Meaning
Span	One operation (LLM call, tool call, retriever step)
Trace	A tree of spans for one request
Project	A named collection of traces
OpenInference	The semantic conventions Phoenix uses over OTLP

Framework Instrumentation

Library	Instrumentor package
OpenAI	`openinference-instrumentation-openai`
LangChain	`openinference-instrumentation-langchain`
LlamaIndex	`openinference-instrumentation-llama-index`
Anthropic	`openinference-instrumentation-anthropic`
DSPy	`openinference-instrumentation-dspy`
Auto (many)	enable via `register(auto_instrument=True)`

Manual Spans

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("retrieve") as span:
    span.set_attribute("input.value", query)
    docs = retriever(query)
    span.set_attribute("retrieval.documents", len(docs))

Evaluation

Phoenix can score traces (especially LLM-as-judge evals) and attach results.

from phoenix.evals import llm_classify, OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE

results = llm_classify(
    dataframe=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=["factual", "hallucinated"],
)

Eval	Measures
Hallucination	Is the answer grounded in the context?
QA correctness	Is the answer correct vs reference?
Relevance	Are retrieved docs relevant to the query?
Toxicity	Unsafe/toxic content
Custom	Your own LLM-as-judge or code template

Datasets & Experiments

Capability	Use
Datasets	Curate example inputs/outputs from traces
Experiments	Run a task over a dataset and score it
`px.Client().upload_dataset(...)`	Create a dataset programmatically
Compare runs	Track quality across prompt/model changes

Retrieval Debugging

Feature	Helps with
Span attributes for retrieved docs	See exactly what RAG fetched
Embedding/UMAP views	Spot clusters and drift in embeddings
Eval on retrieval relevance	Quantify retrieval quality per query

Phoenix vs Other LLM Observability

Aspect	Arize Phoenix	Langfuse	MLflow
Standard	OpenTelemetry/OTLP	OTel + SDK	MLflow tracing
Focus	Tracing + eval + retrieval debug	Tracing + prompts + eval	Full ML platform + LLM
Self-host	Yes	Yes	Yes
Best for	OTel-native eval & RAG debugging	Product analytics + prompts	Teams already on MLflow

Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Arize Phoenix - LLM Observability & Evaluation Cheatsheet

Installation

Tracing Quickstart

Framework Instrumentation

Manual Spans

Evaluation

Datasets & Experiments

Retrieval Debugging

Phoenix vs Other LLM Observability

Resources