Arize Phoenix - Foglio di aiuto per l'osservabilità e la valutazione dei modelli linguistici

Arize Phoenix - Foglio di aiuto per l”osservabilità e la valutazione dei modelli linguistici

Arize Phoenix è una piattaforma open-source di osservabilità e valutazione per applicazioni e agenti di LLM. È nativa di OpenTelemetry: l”instrumentazione emette tracce OTLP standard che Phoenix raccoglie, quindi puoi vedere ogni prompt, chiamata di modello, invocazione di strumento e passaggio di recupero, quindi eseguire valutazioni su di essi. Viene eseguito localmente (notebook, contenitore o server self-hosted) e si integra con framework comuni tramite strumenti di instrumentazione OpenInference.

Installation

Method	Command
pip	`pip install arize-phoenix`
Run a local server	`phoenix serve` (UI at `http://localhost:6006`)
Docker	`docker run -p 6006:6006 arizephoenix/phoenix:latest`
In a notebook	`import phoenix as px; px.launch_app()`
Instrumentation extras	`pip install openinference-instrumentation-openai` (and others)

Quickstart della traccia

from phoenix.otel import register

# Configure the tracer provider and point it at a Phoenix collector
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

# Auto-instrument a library (example: OpenAI)
from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# ...now your LLM calls show up as traces in Phoenix

Concept	Significato
Span	Un”operazione (chiamata LLM, chiamata di strumento, passaggio di recupero)
Trace	Un albero di span per una richiesta
Project	Una raccolta denominata di tracce
OpenInference	Le convenzioni semantiche che Phoenix usa su OTLP

Instrumentazione del framework

Library	Pacchetto instrumentor
OpenAI	`openinference-instrumentation-openai`
LangChain	`openinference-instrumentation-langchain`
LlamaIndex	`openinference-instrumentation-llama-index`
Anthropic	`openinference-instrumentation-anthropic`
DSPy	`openinference-instrumentation-dspy`
Auto (many)	enable via `register(auto_instrument=True)`

Span manuali

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("retrieve") as span:
    span.set_attribute("input.value", query)
    docs = retriever(query)
    span.set_attribute("retrieval.documents", len(docs))

Valutazione

Phoenix può assegnare punteggi alle tracce (specialmente valutazioni LLM-as-judge) e allegare risultati.

from phoenix.evals import llm_classify, OpenAIModel, HALLUCINATION_PROMPT_TEMPLATE

results = llm_classify(
    dataframe=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=["factual", "hallucinated"],
)

Eval	Misura
Hallucination	La risposta è fondata nel contesto?
QA correctness	La risposta è corretta rispetto al riferimento?
Relevance	I documenti recuperati sono rilevanti per la query?
Toxicity	Contenuto non sicuro/tossico
Custom	Il tuo modello LLM-as-judge o modello di codice personalizzato

Dataset e esperimenti

Capability	Usa
Datasets	Cura esempi di input/output dalle tracce
Experiments	Esegui un”attività su un dataset e assegnagli un punteggio
`px.Client().upload_dataset(...)`	Crea un dataset a livello di programmazione
Compare runs	Traccia la qualità tra prompt/modello cambia

Debug della ricerca

Feature	Aiuta con
Span attributes for retrieved docs	Vedi esattamente cosa ha recuperato RAG
Embedding/UMAP views	Individua cluster e drift negli embedding
Eval on retrieval relevance	Quantifica la qualità della ricerca per query

Arize Phoenix vs Altra osservabilità LLM

Aspect	Arize Phoenix	Langfuse	MLflow
Standard	OpenTelemetry/OTLP	OTel + SDK	MLflow tracing
Focus	Tracing + eval + retrieval debug	Tracing + prompts + eval	Full ML platform + LLM
Self-host	Yes	Yes	Yes
Best for	OTel-native eval & RAG debugging	Product analytics + prompts	Teams already on MLflow