Part VII - Reasoning, safety, and evaluation

Evaluation and monitoring - traces, metrics, regressions

Traditional software tests answer pass/fail on a deterministic function. An agent is a probabilistic control loop: the same prompt can take a different route, call a different tool, and still land on a correct-sounding answer for the wrong reasons. So we stop grading only the final string and start grading the trajectory - the whole sequence of decisions, tool calls, and observations that produced it - both offline before we ship and continuously in production after.

Book source

Chapter 19 - Evaluation and Monitoring (评估与监控); PDF outline pages 210-219.

The plan

Five moves. (1) Establish why deterministic testing breaks on a probabilistic agent, and what "score the trajectory" actually means. (2) Build the metric stack from the ground up: response accuracy, then latency, then token/cost tracking, then LLM-as-a-Judge for subjective quality - with the book's legal-survey rubric worked out. (3) Define trajectory evaluation precisely - exact-match, in-order, any-order, precision/recall over tool calls against an ideal path. (4) Split offline regression suites (test files vs evalsets, Google ADK's two surfaces) from online monitoring (drift, anomalies, A/B tests) and wire failures back into the learning loop. (5) Climb to the book's "contractor" thesis: replacing informal prompts with verifiable contracts so an agent becomes accountable. We close on the rule that governs the whole track: a behavior you cannot measure is a behavior you cannot trust.

Linear position

Prerequisite: Lesson 20 (Guardrails and safety) gave us layered constraints and, crucially, the instrumented trace - every input check, tool permission, output filter, and decision is now logged. Without that record there is nothing to score.
New capability: Offline evals, online monitoring, trajectory scoring, LLM-as-a-Judge, and regression gates that block a deploy when behavior degrades.

1 · Why deterministic testing breaks, and what "score the trajectory" means

A unit test for add(2, 3) asserts the output equals 5. It is deterministic: same input, same output, forever. An agent is not that. Give a coding agent the task "fix the failing test in auth.py" twice and it may, on run one, read the file, run the test, edit one line, re-run, and report; on run two it may grep the whole repo, edit three files, never re-run the test, and still produce a final message that sounds finished. Both runs might even end with the test passing. The final answer alone cannot distinguish the disciplined run from the lucky one.

This is the central claim of the chapter: a good final response can hide a bad route, an unsafe tool call, excessive cost, or a lucky guess. The book's example: a customer-service agent's ideal trajectory is determine intent → call the database tool → review the result → generate the report. An agent that skips the database call and hallucinates a plausible answer can score 1.0 on final-answer accuracy while being completely untrustworthy. So evaluation must measure three things the chapter names repeatedly - effectiveness (did it achieve the goal), efficiency (at what cost and latency), and compliance (did it stay within policy) - and it must measure them over the trajectory, not just the endpoint.

Mental model. Grading only the final answer is like grading a math exam on the boxed number alone. A student who guessed and a student who derived it both get the box right - until the next, slightly different problem, where only the one who actually knew the method survives. Trajectory evaluation grades the work shown, which is the only thing that predicts whether the agent generalizes to the next, unseen task.

TASK ─▶ [route] ─▶ [tool: read_file] ─▶ [observe] ─▶ [tool: run_tests] ─▶ [edit] ─▶ [tool: run_tests] ─▶ FINAL │ │ │ │ was this was this the did it actually minimal diff? the right ideal tool, in verify, or just evidence cited? decision? the right order? claim success? cost / latency? ─────────────────────────────────────────────────────────────────────────────────────────────────── final-answer eval looks HERE ───────────────────────────────────────────────────────────────▲ only trajectory eval scores EVERY box above ◀──────────────────────────────────────────────────────

2 · The metric stack, from accuracy up to LLM-as-a-Judge

The book builds evaluation from the simplest possible metric and shows why each rung is insufficient on its own, forcing the next. We follow the same ladder.

Rung 1 - exact-match response accuracy (and why it is almost useless)

The chapter's first code example is deliberately naive: strip whitespace, lowercase, and compare for strict equality.

def evaluate_response_accuracy(agent_output: str, expected_output: str) -> float:
    """Strict match only; real use needs richer metrics."""
    return 1.0 if agent_output.strip().lower() == expected_output.strip().lower() else 0.0

# The book's own example shows the trap:
agent_response = "The capital of France is Paris."
ground_truth   = "Paris is the capital of France."
# score == 0.0  -- semantically identical, lexically different.

The book is blunt that these two sentences are equivalent and yet score 0.0. Exact match cannot see meaning. So real evaluation reaches for richer techniques the chapter lists explicitly: string similarity (Levenshtein, Jaccard), keyword analysis, semantic similarity (cosine distance between embedding vectors), LLM-as-a-Judge (rung 4 below), and RAG-specific metrics such as faithfulness and relevance. Exact match survives only for narrow, closed-form outputs.

Rung 2 - latency monitoring

In interactive settings, a correct answer delivered too slowly is a failure. The chapter insists on recording per-request latency to durable storage - structured JSON logs, time-series databases (InfluxDB, Prometheus), data warehouses (Snowflake, BigQuery, PostgreSQL), or observability platforms (Datadog, Splunk, Grafana Cloud). The key engineering point: track the distribution, not the mean. A p50 of 1.2s with a p95 of 9s is a different product than a flat 1.5s everywhere, even though their averages can match.

Rung 3 - token and cost tracking

LLM-driven agents bill by input and output tokens, so token usage is the direct lever on cost. The chapter gives a small monitor class that accumulates input and output tokens per interaction (using word-count as a placeholder; production uses the provider's real tokenizer).

Worked number - cost of one trajectory

Take the coding agent's disciplined run. Suppose four model turns with these token counts (using a representative API price of $3 per million input tokens and $15 per million output tokens):

turn 1 (route): 1,200 in / 90 out
turn 2 (read file → reason): 8,400 in / 260 out
turn 3 (edit): 9,100 in / 520 out
turn 4 (verify + report): 9,600 in / 380 out

Input total = 28,300 tokens; output total = 1,250 tokens.
Cost = 28,300 / 10⁶ × $3 + 1,250 / 10⁶ × $15 = $0.0849 + $0.0188 ≈ $0.104 per task.
Now the "lucky" run that grepped the whole repo dumped a 40,000-token blob into context on turn 2 alone: input total balloons to ~68,000 tokens, cost rises to ~$0.22 - more than double for a worse trajectory. Final-answer accuracy would have rated both runs identically; only the cost metric exposes the waste. This is why the chapter demands cost and quality be reported together, never quality alone.

Rung 4 - LLM-as-a-Judge for subjective quality

Some qualities - helpfulness, neutrality, clarity - have no string to match against. The chapter's solution is LLM-as-a-Judge (LLM-as-a-Judge): give a second model a precise rubric and have it score outputs at scale. Its worked example is a legal-survey reviewer. A rubric instructs the judge to score a survey question 1-5 on five criteria - clarity and precision, neutrality and lack of bias, relevance and focus, completeness, and audience fit - and to emit structured JSON with an overall_score, a rationale, detailed_feedback, concerns, and a recommended_action.

LEGAL_SURVEY_RUBRIC = """
You are a legal-survey methodology expert and a rigorous reviewer.
Score the question 1-5 on each criterion, with reasons and specific feedback:
  1. Clarity & precision   2. Neutrality / no bias   3. Relevance & focus
  4. Completeness          5. Audience fit
Return JSON: overall_score, rationale, detailed_feedback, concerns, recommended_action.
"""

# The judge runs at low temperature (~0.2) for consistency, with JSON response mode.
# Book uses Gemini via google.generativeai; the pattern is model-independent.
judgment = judge.judge_survey_question(question)   # -> dict or None if blocked / unparseable

The example is sharp because the book runs three questions through it: a well-formed neutral one ("Do you agree current Swiss IP law adequately protects AI-generated content meeting the originality standard?"), a biased one ("Do you think overly strict privacy laws like the FADP are stifling Swiss innovation?" - a loaded leading question), and a vague one ("What do you think about legal tech?"). A good judge gives the first a high score, flags the second for bias, and flags the third for being unanswerable. That demonstrates the judge catching exactly the failures a keyword matcher cannot.

The chapter is honest about the trade-offs, and presents them as a comparison table. We reproduce it because the interview question "how do you evaluate a subjective output?" is really "do you understand these trade-offs?"

Method	Strengths	Weaknesses
Human evaluation	Catches subtle behavior; the ground truth for nuance	Hard to scale, expensive, subjective / inconsistent across raters
LLM-as-a-Judge	Consistent, efficient, scalable; understands semantics	May miss intermediate steps; bounded by the judge model's own ability and biases
Automated metrics	Scalable, fast, objective, cheap	Cannot fully cover agent capability; blind to meaning (the exact-match trap)

The practical synthesis the chapter implies: use automated metrics as a cheap always-on gate, LLM-as-a-Judge for subjective quality at scale, and reserve scarce human review for sampling and for calibrating the judge itself.

3 · Trajectory evaluation, precisely

Now the heart of the chapter. Trajectory evaluation compares the agent's actual sequence of steps against an ideal path, scoring decision quality, the reasoning process, and the final result. For the tool-call sequence specifically, the book names a family of comparison methods, each answering a different question about how strict you want to be:

Match method	Passes when	Use it when
exact match	actual tool sequence equals the ideal sequence, identically	the path is fixed and any deviation is a bug
in-order match	the ideal tools all appear, in the right relative order (extra steps allowed)	order matters but harmless detours are fine
any-order match	the ideal tools all appear, order irrelevant	independent steps that can happen in any sequence
precision / recall	graded - precision = right calls / all calls made; recall = right calls / all ideal calls	partial credit on long trajectories; the most common real choice
single-tool use	a specific required tool was invoked at least once	checking a mandatory step (e.g. "did it actually run the tests?")

Worked number - scoring a coding-agent trajectory

Ideal path for "fix the failing test": [read_file, run_tests, edit_file, run_tests, report] - 5 calls.
Actual path the agent took: [grep_repo, read_file, edit_file, report] - 4 calls.
exact match: fail (sequences differ).
in-order match: fail - run_tests never appears, so the ideal sub-sequence is not preserved.
single-tool use (run_tests required): fail - the verify step was skipped. This is the killer flag: the agent edited code and never confirmed the fix.
precision/recall on tool set: the right calls present are {read_file, edit_file, report} = 3. Precision = 3/4 = 0.75 (one stray grep_repo); recall = 3/5 = 0.60 (missed both run_tests).
The recall of 0.60 and the failed single-tool check together tell the precise story final-answer accuracy hid: the agent made a plausible edit but never verified it.

Multi-agent systems raise the bar again. The chapter lists what you must additionally measure when the system is "a team": Do the agents cooperate effectively (does the flight-booking agent correctly hand the dates and destination to the hotel-booking agent)? Do they make and follow a sensible plan (book the flight before the hotel; flag it if the hotel agent jumps ahead or an agent gets stuck)? Was the right agent chosen for the task (a weather query should go to the weather agent, not the general-knowledge one)? And does adding a new agent help or hurt overall (does a new restaurant-booking agent improve throughput or just create conflicts)? These are collaboration metrics that have no single-agent analogue.

4 · Offline suites vs online monitoring

The chapter cleanly separates two regimes that beginners conflate. Offline evaluation runs a fixed battery before you ship - this is your regression gate. Online monitoring watches the same signals in live traffic after you ship. You need both: offline catches what you anticipated; online catches what you did not.

Offline: test files and evalsets (the Google ADK shape)

The book describes Google's ADK (Agent Development Kit) as a concrete framework that organizes offline evaluation into two artifact types - and this distinction is worth borrowing regardless of framework:

Test file

A single JSON session - one multi-turn conversation - for quick unit-test-style checks during development. It records the user requests, the expected tool-use trajectory, the agent's intermediate responses, and the final reply. Group them in folders; a test_config.json defines the eval criteria. Optimized for fast execution and simple sessions.

Evalset file

A collection of multiple evals for complex, multi-turn integration testing. Each eval is a full session with several turns, tool calls, and reference answers. The book's example: the user first asks "what can you do?", then "roll a ten-sided die twice and tell me whether 9 is prime" - defining both the expected tool calls and final reply.

ADK exposes three execution surfaces, mapping to three stages of the development lifecycle: the Web UI (adk web) for interactive evaluation and dataset generation; pytest integration (via AgentEvaluator.evaluate) for CI/CD pipelines; and the command line (adk eval) for automated runs and routine build verification, where you can point at an evalset and even select specific evals by comma-separated names. The pattern to internalize: an eval artifact is a frozen, replayable specification of expected behavior, and it runs in the same CI that gates your code.

Online: drift, anomalies, and A/B tests

Once deployed, performance can silently decay. The chapter names the failure shapes to watch: concept drift (output relevance or accuracy falls as the input distribution or environment shifts), anomalous behavior (unexpected actions that may signal a bug, an attack, or unwanted emergent behavior), and the need for A/B testing (run two versions or strategies in parallel and compare - the book's example is a logistics agent trying two planning algorithms). Compliance and safety auditing runs continuously too, auto-generating reports and firing alerts or KPIs when a policy boundary is crossed.

01Record full traces. Every route decision, tool call, observation, intermediate response, cost, and latency - durably stored so any trace is replayable. (You cannot evaluate what you did not log; see lesson 20.)

02Score the trace. Goal contract met? Routes sensible? Tool use within policy? Evidence/citations verified? Cost and latency within budget? Apply automated metrics + LLM-as-a-Judge + sampled human review.

03Gate on the offline suite. Run the evalset before any change; block the deploy if the pass rate drops below target. A regression is a behavior that used to pass and now fails.

04Monitor production, then close the loop. Detect drift and anomalies, label root causes, and convert each new production failure into a frozen test trace - which feeds lesson 11's learning and adaptation pipeline.

Step 04 is the engine of the whole track: every production failure becomes a permanent eval. The first time the coding agent corrupts a file because a tool returned malformed JSON, you capture that exact trace, add it to the evalset, and now no future version can regress on it silently.

5 · From agents to accountable "contractors"

The chapter closes with its most ambitious idea, drawn from the Agent Companion work (Gulli et al.): to make agents reliable in high-stakes settings, upgrade them from probabilistic, error-prone assistants into deterministic, accountable "contractors." A traditional agent runs on a short instruction - fine for a demo, fragile in production because ambiguity causes silent failure. A contractor agent is bound by a formal contract that is the single source of truth, and the book lays out four pillars:

Pillar	What it adds	Coding-agent example
1 · Formal contract	Specifies deliverables, scope, data sources, cost, and deadline - far beyond a prompt. Results are objectively verifiable.	"Generate a 20-page PDF analyzing Q1 2025 European sales, with 5 visualizations, a Q1 2024 comparison, and a supply-chain risk assessment" - or for code, an exact spec of the change and its acceptance tests.
2 · Dynamic negotiation	The contract is a conversation, not a static command. The agent analyzes terms and pushes back on infeasibility before starting.	"The specified test database is unavailable - provide credentials or approve the public dataset, which may reduce data granularity." Ambiguity is resolved up front.
3 · Quality-focused iterative execution	Correctness first. The agent self-verifies against the contract before submitting.	Generates several candidate solutions, compiles and runs the contract's unit tests, scores them on performance / security / readability, and submits only the version that passes all tests.
4 · Hierarchical decomposition	A main contractor splits a complex task into sub-contracts with their own deliverables, assignable to specialist agents.	"Build an e-commerce app" → sub-contracts for UI/UX, an auth module, a product database, and payment-gateway integration - each independently verifiable.

Notice how directly this connects back to evaluation: the contract is the eval spec. Pillar 3's "run the contract's unit tests and submit only the version that passes all of them" is trajectory evaluation moved inside the agent's own execution loop. The book's arc is that evaluation is not a stage bolted on at the end - it is the mechanism that turns an unpredictable tool into an accountable system.

Running example - evaluating the coding agent

Pulling it together for our running coding/research assistant. A final-answer eval would ask only "did the prose claim success?" A trajectory eval asks the questions that actually predict reliability: did the agent read the relevant files (not grep the whole repo)? did it make a minimal diff? did it run the correct tests, and re-run them after editing? did it handle a tool failure gracefully? did it report evidence (the actual test output) rather than a confident summary? The implementation is a per-trace scorecard fed into a gated suite:

trace_eval = {
  "goal_met":           check_goal(trace),            # contract / acceptance tests pass
  "route_ok":           judge_routes(trace),          # LLM-as-a-Judge on decision quality
  "tool_use_ok":        check_tool_policy(trace),     # within permissions (lesson 20)
  "ran_tests":          single_tool_use(trace, "run_tests"),   # the killer check from §3
  "trajectory_recall":  recall_vs_ideal(trace, ideal_path),    # graded path match
  "evidence_ok":        verify_citations(trace),      # claims backed by real tool output
  "cost":               sum_cost(trace),              # the $0.10 vs $0.22 from §2
  "latency_p95_bucket": latency_bucket(trace),
}

# Offline gate: a change ships only if the frozen evalset still passes at target rate.
release_gate.require(evalset.pass_rate > TARGET)
# Online: every new prod failure is captured as a trace and appended to the evalset.
evalset.append(capture_failure_trace(incident))

Checkpoint exercise

Try it

Define five metrics for your agent: two quality (e.g. trajectory recall, LLM-judge helpfulness), two operational (p95 latency, cost per task), one safety (tool-policy compliance). For each, decide whether it is an offline regression gate or an online production monitor - and write the ideal tool path for one representative task so you can compute trajectory recall against it.

Failure modes

A handful of cherry-picked demos treated as an eval suite - no coverage of the failure cases that actually occur.
Grading only the final answer, so a lucky guess and a sound derivation score identically (the §1 trap).
Reporting quality with no cost or latency alongside it - hiding the run that doubled spend for a worse trajectory.
Exact-match metrics on open-ended outputs ("Paris is the capital" scores 0.0).
No replayable traces because tool outputs were never stored - production failures cannot be reproduced or frozen into evals.
An LLM judge with a vague rubric, so its scores are inconsistent and uncalibrated against human review.
Drift never monitored, so accuracy decays silently after the input distribution shifts.

Implementation checklist

What is the trace schema, and is every trace replayable (tool I/O persisted)?
Which metrics define success - effectiveness, efficiency, and compliance, not just one?
What is the ideal trajectory per task, and which match method (in-order, recall, single-tool) scores it?
Is there an LLM-judge rubric, and is it periodically calibrated against sampled human review?
Which regressions block deployment, and at what pass-rate threshold?
How are human reviews sampled, and how are costs attributed per task / per tool?
What detects drift and anomalies online, and how does each failure become a new eval?

Where this points next

We can now measure whether an agent did its job well, cheaply, and safely - across the whole trajectory, offline and online. But measurement assumes the agent already chose what to do. A genuinely autonomous agent faces many useful actions at once and limited budget: which task should it attempt next? Evaluation tells you a route was good or bad after the fact; the next lesson asks how the agent decides before the fact. Lesson 22, Prioritization - choosing the next best action, builds the scoring and ranking that keeps an autonomous agent focused, using exactly the cost, value, and risk signals our metrics here produce.

Takeaway

Agents are probabilistic control loops, so deterministic pass/fail testing is not enough: you must score the trajectory - routes, tool calls, observations, evidence, cost, and latency - not just the final answer, because a good answer can hide a bad path or a lucky guess. Build the metric stack deliberately: exact match is nearly useless on open-ended output, so reach for semantic similarity, latency distributions, token/cost tracking, and LLM-as-a-Judge with a precise rubric (the book's legal-survey example). Score tool sequences against an ideal path with exact / in-order / any-order / precision-recall / single-tool methods. Keep offline regression suites (test files and evalsets, e.g. Google ADK's Web UI / pytest / CLI surfaces) as deploy gates, and online monitoring (drift, anomalies, A/B) that turns every production failure into a new frozen eval feeding the learning loop. The endpoint of the chapter is the "contractor" model: replace informal prompts with verifiable contracts so agents become accountable. The governing rule: a behavior you cannot measure is a behavior you cannot trust.

Interview prompts

Why can't you evaluate an agent the way you unit-test a function? (§1 - the agent is probabilistic; the same input can take different routes and still produce a correct-sounding answer, so a good final answer can mask a bad route, an unsafe tool call, excess cost, or a lucky guess. Score the trajectory, measuring effectiveness, efficiency, and compliance.)
What's wrong with exact-match accuracy, and what replaces it? (§2 - it scores semantically identical answers as 0.0, e.g. "The capital of France is Paris" vs "Paris is the capital of France". Replace with semantic similarity over embeddings, LLM-as-a-Judge, and RAG metrics like faithfulness/relevance.)
How do you evaluate a subjective quality like "helpfulness"? (§2 - LLM-as-a-Judge: a second model scores against a precise rubric at low temperature, emitting structured JSON. The book's legal-survey rubric scores 1-5 on clarity, neutrality, relevance, completeness, audience fit, and flags biased or vague questions a keyword matcher would miss. Calibrate it against sampled human review.)
An agent's final answer is correct but you suspect a bad trajectory. How do you prove it? (§3 - compare the actual tool sequence to the ideal path: single-tool-use to check a mandatory step like run_tests, and precision/recall for partial credit. A high recall miss plus a failed single-tool check shows the agent edited code but never verified it.)
Why report cost alongside quality rather than quality alone? (§2 worked example - two runs with identical final-answer accuracy can differ 2x in cost ($0.10 vs $0.22) because one dumped the whole repo into context; quality-only reporting hides the waste.)
Distinguish offline evaluation from online monitoring. (§4 - offline runs a frozen evalset as a regression gate before deploy (e.g. ADK test files for unit-style checks, evalsets for integration); online watches live traffic for drift, anomalies, and A/B comparisons. Both are needed, and every production failure becomes a new offline eval.)
What is the "contractor" model and how does it relate to evaluation? (§5 - upgrading agents into accountable systems via formal contracts (deliverables, scope, data, cost, deadline), dynamic negotiation, quality-focused self-verifying execution, and hierarchical sub-contracts. The contract is the eval spec; self-verification is trajectory evaluation inside the agent's own loop.)