Agent Evaluation & Observability¶

✨ Bit: A chatbot either knows the answer or hallucinates — you can catch that with one metric. An agent? It plans, selects tools, reads results, retries, backs off, re-plans, and then answers. If you only check the final answer, you're grading a cross-country road trip by looking at the parking job.

★ TL;DR¶

What: The metrics, traces, and evaluation workflows used to measure agent quality in offline and online settings
Why: Final-answer accuracy alone hides tool misuse, looping, latency blowups, and unsafe behavior
Key point: Observe the trajectory, not just the destination

★ Overview¶

Definition¶

Agent evaluation measures how well an agent completes tasks, uses tools, follows policies, and recovers from failure. Agent observability provides the traces and runtime signals needed to understand why it behaved that way.

Scope¶

This note covers trace design, offline and online evaluation, common metrics, code examples for production tracing, and failure modes. For general eval concepts, see LLM Evaluation & Benchmarks. For agent design, see AI Agents.

Why It Matters¶

Agents can fail long before the final answer is visible
Production agents need regression tests for workflows, not just prompts
Observability is the bridge from "we saw a bug" to "we know why it happened"
Without trajectory scoring, you ship agents that "look good" but cost 10x, loop endlessly, or misuse tools in ways that only matter in production

★ Deep Dive¶

What To Measure¶

Layer	Example Metrics
Task outcome	task success rate, human acceptance rate
Trajectory quality	steps taken, loops, dead ends, retry rate
Tool use	tool selection accuracy, schema validity, tool failure recovery
Safety	policy violations, risky actions blocked, escalation rate
Efficiency	latency, tokens, cost, number of tool calls
Groundedness	does the final answer use evidence from tool calls, or hallucinate past them?

Offline Evaluation¶

Use a fixed dataset of tasks with expected outcomes, references, or rubrics.

Good for:

regression testing
comparing prompts or planners
validating new tools before production rollout

Typical offline inputs:

user task
allowed tools
gold answer or scoring rubric
expected intermediate actions if available

Online Evaluation¶

Use live traces and human or automatic feedback in production.

Good for:

catching drift
discovering new failure modes
measuring business impact

Typical online signals:

thumbs up / down
abandonment rate
fallback or human-handoff rate
tool error concentration

Trace Design¶

trace
|- input
|- retrieved context or tool outputs
|- planner thoughts or state transitions
|- tool calls and arguments
|- model outputs
|- verifier decisions
`- final response + metadata

A good trace lets you answer:

what the agent saw
why it chose a tool
what happened after the tool call
where latency and cost accumulated

Evaluating Tool Use¶

Failure Mode	What To Check
Wrong tool chosen	compare selected tool vs acceptable tools
Bad arguments	schema and semantic validation
Unnecessary tool calls	cost and latency waste
Ignored tool evidence	mismatch between evidence and final answer
Retry storm	same tool called 3+ times with identical arguments

Agent Scorecard Example¶

Task completion rate:        78%
Tool-call validity:          94%
Average steps per task:      6.1
Loop rate:                   8%
Human escalation rate:       5%
Median latency:              4.8 s
Median cost per task:        $0.031

Evaluation Workflow¶

Create representative task sets
Trace every run with consistent metadata
Score task success, tool correctness, and safety
Review worst traces manually
Fix planner prompts, tool schemas, or policies
Re-run regression suite before deployment

Observability Stack¶

Capability	Why You Need It
Tracing	reconstruct agent trajectories
Datasets	enable repeatable offline tests
Feedback capture	tie runtime quality to user outcomes
Alerts	detect loops, tool spikes, or cost blowups
Experiment tags	compare prompts, models, and planner versions

◆ Code & Implementation¶

Traced Agent Evaluation with LangSmith¶

# ⚠️ Last tested: 2026-04 | Requires: langsmith>=0.3, openai>=1.60
# pip install langsmith openai

from langsmith import Client, traceable
from langsmith.schemas import Run
import json

client = Client()

# ═══ 1. Define a traced agent task ═══
@traceable(name="support-agent-task", tags=["eval-suite-v2"])
def run_agent(task: str, allowed_tools: list[str]) -> dict:
    """
    Run an agent task with full trajectory capture.
    In production, this wraps your real agent (LangGraph, ADK, etc).
    """
    # -- placeholder for real agent execution --
    return {
        "answer": "Your refund has been processed.",
        "steps": 4,
        "tools_used": ["lookup_order", "check_refund_policy", "process_refund"],
        "tools_available": allowed_tools,
        "total_tokens": 2847,
        "latency_ms": 3200,
        "cost_usd": 0.023,
    }


# ═══ 2. Score trajectory programmatically ═══
def score_trajectory(run_output: dict) -> dict:
    """Score an agent run on multiple trajectory dimensions."""
    steps = run_output.get("steps", 0)
    tools_used = run_output.get("tools_used", [])
    tools_available = run_output.get("tools_available", [])

    # Tool selection precision: did agent only use relevant tools?
    valid_tools = [t for t in tools_used if t in tools_available]
    tool_precision = len(valid_tools) / max(len(tools_used), 1)

    # Step efficiency: fewer steps = better (baseline: 5 steps)
    step_efficiency = min(1.0, 5 / max(steps, 1))

    # Cost efficiency: under $0.05 is good
    cost = run_output.get("cost_usd", 0)
    cost_score = min(1.0, 0.05 / max(cost, 0.001))

    return {
        "tool_precision": round(tool_precision, 3),
        "step_efficiency": round(step_efficiency, 3),
        "cost_score": round(cost_score, 3),
        "composite": round(
            0.4 * tool_precision + 0.3 * step_efficiency + 0.3 * cost_score, 3
        ),
    }


# ═══ 3. Run evaluation ═══
result = run_agent(
    task="Process refund for order #12345",
    allowed_tools=["lookup_order", "check_refund_policy", "process_refund", "escalate"],
)

scores = score_trajectory(result)
print(json.dumps(scores, indent=2))
# Output:
# {
#   "tool_precision": 1.0,
#   "step_efficiency": 1.0,
#   "cost_score": 1.0,
#   "composite": 1.0
# }

LLM-as-Judge for Agent Trajectories¶

# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60
# Use an LLM to grade trajectory quality when rubrics are insufficient

from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are an agent trajectory evaluator.

Given an agent's task and its execution trace, score on these dimensions (1-5):
1. TASK_COMPLETION: Did it solve the user's problem?
2. TOOL_EFFICIENCY: Were tool calls necessary and correct?
3. SAFETY: Did it respect policies and avoid risky actions?
4. GROUNDEDNESS: Did the final answer use evidence from tool calls?

Return JSON: {"task_completion": N, "tool_efficiency": N, "safety": N, "groundedness": N, "reasoning": "..."}
"""

def judge_trajectory(task: str, trace: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-5.4-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content": f"Task: {task}\n\nTrace:\n{trace}"},
        ],
    )
    import json
    return json.loads(response.choices[0].message.content)


# Usage:
# scores = judge_trajectory("Refund order #12345", trace_string)

◆ Quick Reference¶

If You See	Investigate
High final-answer quality but high cost	unnecessary tools or too many steps
Frequent loops	planner prompt, stop rules, or missing tool feedback
Safe final answers but poor user trust	missing citations or weak transparency
High tool validity but poor results	wrong planner or wrong retrieval context
Scores diverge between offline and online	eval set is not representative of production traffic

○ Gotchas & Common Mistakes¶

⚠️ Final-answer grading alone will miss most agent failures
⚠️ Online feedback without trace context is hard to act on
⚠️ Judge-model scores should be grounded in evidence, not only style
⚠️ Good observability requires stable metadata schemas — version them or traces become incomparable across releases
⚠️ Non-determinism is a feature of LLMs but a bug in evaluation — always pin temperature=0 and use seeds for offline eval suites
⚠️ Don't confuse observability with logging — observability means you can reconstruct the full trajectory from stored data; logging means you printed some text

○ Interview Angles¶

Q: How would you evaluate an agent beyond task success?
A: I would score the trajectory: tool selection, tool arguments, retry behavior, safety, latency, cost, and whether the final answer actually used the evidence produced during execution. I'd build a composite score weighing task completion (40%), tool precision (30%), and cost efficiency (30%), and track regression across prompt/model changes.
Q: Why is observability mandatory for agents?
A: Because agent failures are sequential. Without traces, you only see the bad final answer, not the exact step where the planner, tool call, or verifier went wrong. A refund agent that loops 8 times and then gives the right answer looks "correct" in a success-only metric but costs 4x and takes 30 seconds.
Q: How do you handle non-determinism in agent evaluation?
A: Three approaches: (1) Pin temperature=0 and use seed parameters for reproducible runs, (2) Run each eval task 3-5 times and report median + variance, (3) Use majority-vote scoring where a task "passes" only if 3/5 runs succeed. For production monitoring, track distributions, not point estimates.
Q: When should you use LLM-as-Judge vs programmatic scoring?
A: Programmatic scoring (tool precision, step count, cost) is faster, cheaper, and more reproducible — use it for everything you can formalize. LLM-as-Judge fills the gap for subjective quality: tone, helpfulness, groundedness. In practice, use both: programmatic scores gate the CI pipeline, LLM-Judge scores provide qualitative insight for manual review of borderline cases.

★ Connections¶

Relationship	Topics
Builds on	AI Agents, LLM Evaluation & Benchmarks, LLMOps & Production Deployment
Leads to	Multi-Agent Architectures, LLM Evaluation & Benchmarks
Compare with	Static prompt evaluation, benchmark-only evaluation
Cross-domain	APM, distributed tracing, experiment management

◆ Hands-On Exercises¶

Exercise 1: Build an Agent Evaluation Harness¶

Goal: Create a reproducible evaluation framework for agent workflows Time: 45 minutes Steps: 1. Define 10 agent tasks with expected outcomes and allowed tools 2. Implement the score_trajectory function from the code example above 3. Run your agent on each task 3 times (test determinism) 4. Score on: task completion, cost, latency, tool precision, and step efficiency 5. Create a regression test suite that can run in CI with pytest Expected Output: Agent evaluation report with pass/fail per task, composite score, and variance analysis

Exercise 2: Build an LLM-as-Judge Pipeline¶

Goal: Use a judge model to grade agent trajectories on qualitative dimensions Time: 30 minutes Steps: 1. Capture 5 agent traces (mix of good and bad runs) 2. Implement the judge_trajectory function from the code example above 3. Run the judge on all 5 traces 4. Compare judge scores with your manual assessment — calibrate the prompt Expected Output: JSON scores for each trace with reasoning, plus a calibration analysis

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Non-deterministic results	Same agent task produces different outcomes on each run	LLM sampling randomness, tool output variance	Temperature=0, seed pinning, majority vote across runs
Eval metric gaming	Agent scores high on eval but fails on real tasks	Eval tasks too narrow, no edge cases	Diverse eval suite, production failure replay, held-out test sets
Cost-blind evaluation	Agent solves task but at 10x expected cost	No cost tracking in eval harness	Include cost/token metrics in eval scoring
Trace schema drift	Old traces can't be compared to new traces after a deploy	Metadata fields changed without versioning	Version all trace schemas, maintain backward compatibility layer
Observer effect	Agent behaves differently in eval vs production	Eval harness constrains tool access, timing, or context	Use production-identical environment for eval, including real tool endpoints
Judge model bias	LLM-as-Judge systematically over-scores verbose answers	Judge model has its own biases, prefers length or formality	Calibrate judge with human annotations, use structured rubrics with evidence grounding

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Jimenez et al. "SWE-bench" (2023)	Benchmark for evaluating coding agents on real GitHub issues
📄 Paper	Zhuge et al. "Agent-as-a-Judge" (2024)	Using agentic systems as evaluators for other agents
🔧 Hands-on	LangSmith Evaluations	Production agent evaluation and tracing
🔧 Hands-on	Langfuse	Open-source LLM observability and evaluation
🔧 Hands-on	Braintrust	Eval-focused observability for LLM applications
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 7	Agent evaluation patterns and metrics

★ Sources¶

LangSmith evaluation concepts documentation
Langfuse open-source observability documentation
RAGAS documentation
DeepEval documentation
OpenAI evaluation guidance for application testing
Anthropic, "Building effective agents" (2024)