Arize Phoenix - LLM Observability & Evaluation Cheatsheet
Arize Phoenix is an open-source observability and evaluation platform for LLM applications and agents. It is OpenTelemetry-native: instrumentation emits standard OTLP traces that Phoenix collects, so you can see every prompt, model call, tool invocation, and retrieval step, then run evaluations over them. It runs locally (notebook, container, or self-hosted server) and integrates with common frameworks through OpenInference instrumentors.
Installation
| Method | Command |
|---|
| pip | pip install arize-phoenix |
| Run a local server | phoenix serve (UI at http://localhost:6006) |
| Docker | docker run -p 6006:6006 arizephoenix/phoenix:latest |
| In a notebook | import phoenix as px; px.launch_app() |
| Instrumentation extras | pip install openinference-instrumentation-openai (and others) |
Tracing Quickstart
from phoenix.otel import register
# Configure the tracer provider and point it at a Phoenix collector
tracer_provider = register(
project_name="my-llm-app",
endpoint="http://localhost:6006/v1/traces",
)
# Auto-instrument a library (example: OpenAI)
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
# ...now your LLM calls show up as traces in Phoenix
| Concept | Meaning |
|---|
| Span | One operation (LLM call, tool call, retriever step) |
| Trace | A tree of spans for one request |
| Project | A named collection of traces |
| OpenInference | The semantic conventions Phoenix uses over OTLP |
Framework Instrumentation
| Library | Instrumentor package |
|---|
| OpenAI | openinference-instrumentation-openai |
| LangChain | openinference-instrumentation-langchain |
| LlamaIndex | openinference-instrumentation-llama-index |
| Anthropic | openinference-instrumentation-anthropic |
| DSPy | openinference-instrumentation-dspy |
| Auto (many) | enable via register(auto_instrument=True) |
Manual Spans
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("retrieve") as span:
span.set_attribute("input.value", query)
docs = retriever(query)
span.set_attribute("retrieval.documents", len(docs))
Evaluation
Phoenix can score traces (especially LLM-as-judge evals) and attach results.
from phoenix.evals import llm_classify, OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE
results = llm_classify(
dataframe=spans_df,
model=OpenAIModel(model="gpt-4o"),
template=HALLUCINATION_PROMPT_TEMPLATE,
rails=["factual", "hallucinated"],
)
| Eval | Measures |
|---|
| Hallucination | Is the answer grounded in the context? |
| QA correctness | Is the answer correct vs reference? |
| Relevance | Are retrieved docs relevant to the query? |
| Toxicity | Unsafe/toxic content |
| Custom | Your own LLM-as-judge or code template |
Datasets & Experiments
| Capability | Use |
|---|
| Datasets | Curate example inputs/outputs from traces |
| Experiments | Run a task over a dataset and score it |
px.Client().upload_dataset(...) | Create a dataset programmatically |
| Compare runs | Track quality across prompt/model changes |
Retrieval Debugging
| Feature | Helps with |
|---|
| Span attributes for retrieved docs | See exactly what RAG fetched |
| Embedding/UMAP views | Spot clusters and drift in embeddings |
| Eval on retrieval relevance | Quantify retrieval quality per query |
Phoenix vs Other LLM Observability
| Aspect | Arize Phoenix | Langfuse | MLflow |
|---|
| Standard | OpenTelemetry/OTLP | OTel + SDK | MLflow tracing |
| Focus | Tracing + eval + retrieval debug | Tracing + prompts + eval | Full ML platform + LLM |
| Self-host | Yes | Yes | Yes |
| Best for | OTel-native eval & RAG debugging | Product analytics + prompts | Teams already on MLflow |
Resources