Monitoring & Observability for GenAI Systems¶

Traditional monitoring tells you if the service is alive. GenAI observability tells you if the service is alive, useful, safe, and worth the money.

★ TL;DR¶

What: The tracing, metrics, logs, and feedback loops used to understand AI behavior in production.
Why: GenAI systems can be "up" while still being wrong, unsafe, or too expensive.
Key point: You need both system telemetry and quality telemetry.

★ Overview¶

Definition¶

Monitoring tracks known signals such as latency and error rates. Observability adds enough telemetry to investigate unknown failures, regressions, and user-quality breakdowns.

Scope¶

This note focuses on production telemetry for LLM apps, RAG systems, and agents. For offline quality methodology, see LLM Evaluation Deep Dive.

Significance¶

AI failures are often semantic, not just infrastructural.
The same request can succeed technically and fail product-wise.
Observability is what turns prompt iteration into engineering instead of guesswork.

Prerequisites¶

★ Deep Dive¶

The Four Telemetry Layers¶

Layer	What You Track	Example Signals
Infrastructure	Service health	CPU, GPU, memory, request rate, errors
Runtime	Model-call behavior	TTFT, tokens/sec, retries, tool latency
Quality	Output usefulness	groundedness, rubric score, retrieval relevance
Business	User outcome	resolution rate, conversion, retention, escalations

Why GenAI Needs Traces¶

A single user request can include:

retrieval
reranking
prompt assembly
one or more model calls
tool execution
post-processing and validation

Without traces, teams see only "the answer was bad" and have no clue where the failure occurred.

What To Capture Per Request¶

Signal	Reason
Input metadata	tenant, route, model, prompt version
Retrieval context	documents returned, scores, chunk ids
Model metadata	latency, token usage, finish reason
Tool events	tool selected, tool latency, tool result summary
Validation outcome	schema pass/fail, policy pass/fail
User outcome	thumbs up/down, escalation, retry

Production Metrics That Matter¶

Metric	Why It Matters
P95 latency	Better user experience indicator than average latency
Cost per successful task	More meaningful than cost per request
Groundedness rate	Useful for knowledge-heavy assistants
Tool success rate	Critical for agents and workflow systems
Fallback frequency	Reveals overload or low-confidence issues
Human escalation rate	Strong proxy for trust and failure severity

Alerts You Actually Want¶

Alert on:

latency spikes beyond target budget
error or timeout bursts
unusual cost jumps
sudden drops in evaluation or feedback scores
retrieval failure surges
guardrail trigger spikes

Do not alert on every token count wiggle.

Tooling Landscape¶

The specific platform mix changes frequently, but the stable categories are:

tracing and session inspection
prompt and dataset evaluation
infrastructure/APM metrics
feedback collection

Last verified for example categories and ecosystem naming: 2026-04.

Observability Platform Comparison (April 2026)¶

Platform	Primary Strength	Ideal For	Key Advantage	Deployment
Langfuse	Tracing + prompt management	Self-hosted privacy, open-source teams	Full control, generous free tier, OTel-compatible	Self-hosted / Cloud
Arize Phoenix	ML + LLM observability	Teams already using Arize for ML monitoring	Unified ML+LLM observability, notebook-friendly	Open-source
Braintrust	Eval-first CI/CD integration	Teams shipping fast with eval gates	Prompt playground, dataset management, scoring API	Cloud
LangSmith	LangChain-native tracing	LangChain/LangGraph users	Deep integration with LangChain ecosystem	Cloud
Latitude	Issue lifecycle management	Teams focused on failure triage workflows	Issue → root cause → fix lifecycle tracking	Cloud

PLATFORM DECISION GUIDE:

  Using LangChain/LangGraph?         → LangSmith (deepest integration)
  Need self-hosted / data residency? → Langfuse (open-source, self-hosted)
  Already using Arize for ML?        → Phoenix (unified ML + LLM)
  Eval-gated CI/CD is the priority?  → Braintrust (eval → deploy pipeline)
  Starting from scratch?             → Langfuse (best free tier, OTel-native)

Example Trace Schema¶

{
  "request_id": "req_123",
  "route": "support-assistant",
  "model": "gpt-4o-mini",
  "prompt_version": "support-v7",
  "retrieval": {
    "top_k": 5,
    "doc_ids": ["kb_41", "kb_77"]
  },
  "usage": {
    "prompt_tokens": 1210,
    "completion_tokens": 182
  },
  "quality": {
    "grounded": true,
    "feedback": "upvote"
  }
}

Practical Workflow¶

Define the product outcome you care about.
Map the request path and log the critical stages.
Add dashboards for infra and quality separately.
Review low-scoring traces weekly.
Feed findings back into eval sets and CI/CD.

◆ Quick Reference¶

If You Need To Diagnose...	Inspect First
Slow answers	Trace timings across retrieval, model, and tools
Expensive answers	token usage, routing policy, retries, prompt size
Bad facts	retrieval payload, citations, groundedness checks
Broken agents	trajectory trace and tool-call outcomes
User dissatisfaction	feedback-linked traces and failure clusters

○ Gotchas & Common Mistakes¶

Logging raw prompts and documents can create privacy and security problems.
Dashboards without trace drill-down rarely solve semantic failures.
Teams often track cost per request but ignore cost per successful task.
Manual spot checks are not enough once traffic grows.

○ Interview Angles¶

Q: Why is observability harder for LLM systems than for normal APIs?
A: Because correctness is not binary. The system can return a 200 response and still be wrong, unsafe, or unhelpful. You need traceable context, output quality signals, and user feedback, not just uptime metrics.
Q: What is the minimum telemetry for a production RAG system?
A: Request id, model and prompt version, retrieval documents and scores, token usage, latency, validation status, and user feedback. That gives you enough context to debug both system and semantic failures.

★ Code & Implementation¶

LLM Metrics with Prometheus¶

# pip install openai>=1.60 prometheus_client>=0.20
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, prometheus_client>=0.20
import time
from openai import OpenAI
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter("llm_requests_total", "LLM API calls", ["model", "status"])
LATENCY_HIST  = Histogram("llm_latency_seconds", "LLM latency", ["model"],
                           buckets=[0.1, 0.5, 1, 2, 5, 10, 30])
TOKEN_COUNTER = Counter("llm_tokens_total", "LLM tokens", ["model", "type"])

client = OpenAI()
start_http_server(9090)  # Prometheus scrapes :9090/metrics

def monitored_call(messages: list[dict], model: str = "gpt-4o-mini") -> str:
    start = time.monotonic()
    try:
        resp = client.chat.completions.create(model=model, messages=messages, max_tokens=200)
        REQUEST_COUNT.labels(model=model, status="success").inc()
        TOKEN_COUNTER.labels(model=model, type="prompt").inc(resp.usage.prompt_tokens)
        TOKEN_COUNTER.labels(model=model, type="completion").inc(resp.usage.completion_tokens)
        return resp.choices[0].message.content
    except Exception:
        REQUEST_COUNT.labels(model=model, status="error").inc()
        raise
    finally:
        LATENCY_HIST.labels(model=model).observe(time.monotonic() - start)

print(monitored_call([{"role": "user", "content": "What is observability?"}]))
# Grafana dashboard: connect to Prometheus → visualize p50/p95 latency + error rate

LLM Tracing with Langfuse¶

# pip install langfuse>=2.0 openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: langfuse>=2.0, openai>=1.60
# Set env: LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST (if self-hosted)

from langfuse.decorators import observe, langfuse_context
from openai import OpenAI

client = OpenAI()

@observe()  # Automatically creates a trace with latency, token usage, and cost
def answer_question(user_question: str, model: str = "gpt-4o-mini") -> str:
    """Traced LLM call — appears in Langfuse dashboard with full metadata."""
    # Tag the trace for filtering in the dashboard
    langfuse_context.update_current_observation(
        metadata={"prompt_version": "support-v7", "route": "qa-pipeline"},
    )
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_question}],
        max_tokens=200,
    )
    return response.choices[0].message.content

# Usage — each call creates a trace visible in Langfuse UI
result = answer_question("What are the key metrics for LLM observability?")
print(result)
# Dashboard shows: latency, token usage, cost, model, prompt version per trace
# Filter by: model, route, user, score, time range

★ Connections¶

Relationship	Topics
Builds on	LLMOps & Production Deployment, Agent Evaluation & Observability, LLM Evaluation Deep Dive
Leads to	CI/CD for ML and LLM Systems, Cost Optimization for GenAI Systems
Compare with	Traditional APM and log-only monitoring
Cross-domain	SRE, analytics engineering, experimentation

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Alert fatigue	Team ignores alerts because too many are non-actionable	Thresholds too sensitive, no severity tiers	Tiered alerting (P0-P3), alert correlation, runbook links
Metric cardinality explosion	Monitoring system slows or crashes	Unbounded label values (per-user metrics)	Bounded label sets, metric aggregation, pre-aggregated dashboards
LLM quality blind spots	Quality degrades but no alert fires	Only tracking latency/throughput, not output quality	LLM-as-judge sampling, drift detection, user feedback loops
Log volume cost	Logging costs exceed inference costs	Logging full prompts/completions at volume	Log sampling (1-5%), structured logging, retention policies

◆ Hands-On Exercises¶

Exercise 1: Build an LLM Monitoring Dashboard¶

Goal: Create a monitoring setup that tracks latency, cost, and quality Time: 45 minutes Steps: 1. Instrument a FastAPI LLM endpoint with OpenTelemetry 2. Track p50/p95/p99 latency, token usage, error rate 3. Add a quality score metric (LLM-as-judge on 1% of responses) 4. Visualize in a dashboard (Grafana or matplotlib) Expected Output: Dashboard with 4 panels: latency, throughput, cost, quality

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	LangSmith Documentation	Production LLM observability platform
🔧 Hands-on	Arize Phoenix	Open-source LLM observability and evaluation
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 9	Monitoring patterns specific to AI systems
🎥 Video	Shreya Shankar — "Rethinking ML Monitoring"	Data quality monitoring for ML systems

★ Sources¶

Langfuse documentation
LangSmith documentation
Arize Phoenix documentation
Agent Evaluation & Observability