Part V - Reliability and control

Goal setting and monitoring - objective, progress, stop

Up to now an agent could act — call tools, remember, adapt. But acting is not the same as making progress toward something. A loop with no target either stops too early (declares a victory it cannot prove) or never stops at all (burns budget chasing a goal it already met, or one it can never reach). This lesson gives the loop a direction and a way to tell whether it is getting closer.

Book source

Chapter 11 - Goal Setting and Monitoring (目标设定与监控). The chapter pairs a goal-setting pattern (give the agent a concrete, measurable target) with a monitoring pattern (track progress and decide success), and walks a LangChain + OpenAI code-generation agent that loops generate → evaluate → judge → refine until an LLM judge returns True or a 5-iteration cap is hit.

The plan

Five moves. (1) Separate the two jobs the chapter fuses — a goal contract (where are we going, and how will we know we arrived) and a monitor (where are we now, and are we still moving). (2) Make the contract concrete with SMART criteria and a hard budget, so "done" and "give up" are both definable. (3) Build the book's running example — the iterative coder that self-reviews against a checklist and stops on a True/False verdict — and read its real failure: a judge that can hallucinate success. (4) Do the budget arithmetic: how an iteration cap, token cost, and confidence threshold interact, with a widget to feel the trade-off. (5) Name the four legitimate stop conditions and what each one reports. We close on why monitoring is the precondition for recovery (lesson 14).

Linear position

Prerequisite: lesson 12 (MCP - model context protocol as integration contract) — the agent can now reach many tools and large context through a standard contract, which is exactly what makes an ungoverned loop dangerous: more reach means more ways to wander.
New capability: a measurable goal contract, structured progress telemetry, and explicit stop conditions that turn an open-ended run into a managed, reportable process.

1 · Two jobs that are usually conflated: the goal and the monitor

The chapter's title is two patterns wearing one name, and keeping them apart is the whole insight. Goal setting answers "where are we going, and how will we recognise arrival?" It is a mostly static contract written once, before autonomous work begins. Monitoring answers "where are we now, and are we still moving toward the goal?" It is a running process that compares live state to that contract on every step and emits a verdict: continue, replan, recover, or escalate.

The book's travel analogy makes the split intuitive. Planning a trip, you first fix the destination (goal state), note your starting point (initial state), weigh options under constraints (budget, routes, transport), then execute steps — book, pack, travel, arrive. The destination and constraints are the contract. Checking "am I closer to the airport, on time, within budget?" at each step is the monitor. Skip the contract and the agent is merely "responsive" — it answers whatever is in front of it — but it can never judge whether its own behaviour is effective. That is the book's core complaint about goal-less agents: without a goal there is no yardstick for progress, so autonomy degenerates into reactive task execution that falls apart in dynamic, real-world settings.

GOAL CONTRACT (written once, before the loop) +---------------------------------------------------+ | objective constraints budget | | success_evidence done_condition escalate_at | +---------------------------------------------------+ | read every step v +--------------+ observe +----------------+ | AGENT LOOP | ------------> | MONITOR | | act / tool | | progress? | | | <------------ | stalled? | +--------------+ verdict | over budget? | ^ | | confident? | | | continue +-------+--------+ | v replan / recover | +--------------- STOP <-----------+ success . impossible . budget . human-needed

Notice what monitoring is made of. It is not a vibe; it is a set of observable signals the chapter enumerates: the agent's own actions, the environment state, and tool outputs. A monitor watches completed subtasks, tool errors, budget spent, evidence gathered, open risks, and a confidence estimate in the final answer. Good monitoring forces the agent to explain not just what it did but why it believes the task is complete — which is precisely the thing a goal-less loop can never produce.

2 · The goal contract: SMART criteria plus a hard budget

The chapter leans on the SMART framework (Specific, Measurable, Achievable, Relevant, Time-bound) for what makes a goal usable by a monitor. The operative word is Measurable: a goal the monitor cannot evaluate is decoration. "Make the parser better" has no monitor; "make the parser accept trailing commas, proven by a targeted test passing" has one. Time-bound is equally load-bearing for agents — without it there is no principled moment to stop, which is the chapter's named risk of the basic example: an unbounded loop.

So a goal contract has six fields, and the last three are what separate a toy from something you would let run unattended:

Field	Answers	Coding-agent example
objective	What behaviour changes?	"make the JSON parser accept trailing commas"
constraints	What must never break?	"preserve the public API; no new dependencies"
success_evidence	What proves done? (the Measurable part)	"the targeted test passes; diff summarised; full suite still green"
done_condition	How is the verdict computed?	"all success_evidence items observed"
budget	The Time-bound ceiling	"max 5 iterations, 20 tool calls, 30 min, $0.40"
escalate_at	When to hand to a human?	"fix needs an out-of-scope schema migration"

goal = {
  "objective": "make the JSON parser accept trailing commas",
  "constraints": ["preserve public API", "no new dependencies"],
  "success_evidence": ["targeted test passes", "full suite green", "diff summarized"],
  "done_condition": "all success_evidence observed",
  "budget": {"iterations": 5, "tool_calls": 20, "minutes": 30, "usd": 0.40},
  "escalate_at": "fix requires out-of-scope schema migration"
}

# the monitor turns live state into a verdict against that contract
progress = monitor(state, goal)        # -> {met: [...], missing: [...], cost, stalled, confidence}
if progress.met == goal["success_evidence"]:
    state.status = "done"              # stop: success
elif over_budget(state, goal["budget"]):
    state.status = "abandoned"         # stop: budget
elif progress.stalled:
    state.status = "replan"            # no movement -> change approach, do not repeat

The crucial design rule, and a named failure mode: the goal contract is immutable during a run. An agent that quietly rewrites its own objective mid-flight ("this is too hard, I'll just make the test less strict") will always report success — it moves the goalpost to wherever the ball landed. Changing the goal is a legitimate operation, but it is a replan event the monitor must surface to a human, not a side effect the loop performs on itself.

3 · The running example: the self-reviewing coder, and why it can lie

The chapter's worked code (LangChain + OpenAI, GPT-4o, by Mahtab Syed) is a clean instance of the pattern. You hand the agent a use case (e.g. "find the BinaryGap of a positive integer") and a quality checklist that is literally the success criteria — "simple to understand," "functionally correct," "handles edge cases," "positive-integer input only," "prints a few examples." The agent does not emit code and stop. It enters a loop:

01Generate — produce a Python draft from the use case (and any prior code + feedback).

02Evaluate — a second LLM call acting as a code reviewer critiques the draft against the goals.

03Judge — a third call reads the critique and returns exactly one word, True or False: are the goals met?

04Stop or refine — on True, save an annotated .py file and exit. On False, feed the critique back as feedback and loop, up to max_iterations = 5.

This maps one-to-one onto §1: the checklist is the goal contract, the reviewer-plus-judge calls are the monitor, and max_iterations is the budget. The True/False verdict is deliberately reduced to a single token so the loop has an unambiguous stop signal — no parsing a paragraph to guess whether the agent thinks it is finished.

The book is refreshingly honest that this is illustrative, not production-grade, and the warnings are the real lesson:

The judge can hallucinate success

The fatal weakness is that the same model both writes the code and judges it. An LLM that misunderstood the goal will also misunderstand whether the goal was met, and may confidently emit True for code that is wrong — the monitor inherits the generator's blind spots. The book's two fixes are exactly the right instincts: (1) separate the roles into different agents — the author describes a Gemini-built team with distinct members (coding assistant, code reviewer, documenter, test writer, prompt refiner), so the reviewer's judgement is independent of the author's; and (2) keep a human in the loop for the final run-and-test, because the LLM never actually executed the code. We build role separation properly in the multi-agent lessons and the human gate in lesson 15.

The second named risk is the unbounded loop. The max_iterations = 5 cap is the only thing standing between this agent and an infinite generate-critique cycle when the judge oscillates. That cap is not a detail — it is the Time-bound axis of SMART made executable, and it forces us to reason about budgets numerically.

4 · Budget arithmetic: iterations, tokens, cost, and a confidence floor

A stop condition you cannot price is a stop condition you cannot trust. Let us put real numbers on the coding agent. Each iteration makes three LLM calls — generate, evaluate (critique), judge — and the judge call is tiny. Suppose:

generate ≈ 1,200 tokens · evaluate ≈ 1,500 tokens · judge ≈ 200 tokens ⟹ ≈ 2,900 tokens / iteration.

Worked cost. At an illustrative blended rate of $5 per million tokens, one iteration costs 2,900 × $5 / 1,000,000 ≈ $0.0145. The full 5-iteration budget therefore caps spend at 5 × $0.0145 ≈ $0.073 per task — comfortably under the $0.40 ceiling in our contract, so for this task the iteration cap binds first, not the dollar cap. Scale to a batch of 10,000 code tasks and the worst case is 10,000 × $0.073 ≈ $725; if the typical task converges in 2 iterations the realistic bill is closer to 10,000 × 2 × $0.0145 ≈ $290. That gap — $725 worst case vs $290 typical — is exactly why you monitor actual iterations, not just the cap.

The convergence vs. cap trade-off. Say each iteration has an independent probability p of fixing the remaining issues and earning a genuine True. The chance the task is still unfinished after the cap of N iterations is (1−p)^N. With p = 0.5: after 1 iteration 50% still fail; after 3, 0.5³ = 12.5%; after 5, 0.5⁵ ≈ 3.1%. Raising the cap from 3 to 5 buys you ~9 fewer failed-out-of-budget tasks per 100 — but only if p is genuinely positive. If the judge is stuck oscillating (true p ≈ 0), extra iterations buy nothing but cost, which is why a smart monitor also watches for stagnation (no new evidence met across two iterations) and abandons early instead of grinding to the cap.

Add one more knob the chapter implies but does not formalise: a confidence floor. Rather than accept any True, require the judge's confidence to clear a threshold (say 0.8) before declaring success; below it, treat the verdict as False and spend another iteration. Raising the floor trades cost for safety against a hallucinated True. The widget below lets you feel all three forces — per-iteration success probability, the iteration cap, and the confidence floor — against the dollar budget at once.

Stop-condition budget — tune the cap, the convergence rate, and the confidence floor

Each bar is one iteration's expected outcome across many runs: the chance the agent has genuinely succeeded by then (green), is still working (grey), or has hallucinated a false True the confidence floor failed to catch (red). The KPIs roll up the whole run: expected iterations actually spent, expected dollar cost, the share of tasks that finish within the cap, and the residual false-success rate.

success prob / iter p: 0.50 iteration cap N: 5 confidence floor: 0.80

Expected iters

—

Expected cost

—

Finish ≤ cap

—

False success

—

Show the core JS

// per iteration: p = chance the fix is genuinely correct this round.
// a correct fix is only accepted if the judge's confidence clears the floor;
// a leak past the floor when the fix is NOT correct is a false success.
const COST_PER_ITER = 0.0145;            // 2,900 tokens @ $5 / 1e6
const pAccept = p * conf;                // genuine success that clears the floor
const pFalse  = (1 - p) * (1 - conf) * 0.5; // wrong code judged True anyway
let alive = 1, expIters = 0, finished = 0, falseSucc = 0;
for (let k = 1; k <= N; k++) {
  expIters += alive * 1;                 // we pay for this iteration while alive
  finished += alive * pAccept;
  falseSucc += alive * pFalse;
  alive *= (1 - pAccept - pFalse);       // survivors loop again
}
const expCost = expIters * COST_PER_ITER;

5 · Stop conditions: the four ways a run is allowed to end

A managed loop never just "finishes." It exits through exactly one of four named gates, and each gate owes the caller a different report. Treating them as distinct is what makes an agent accountable.

success

All success_evidence observed and confidence ≥ floor. Report: the evidence, the diff/summary, residual risks. The coding agent saves the annotated .py file here.

impossible

The monitor proves a constraint cannot be satisfied (e.g. the fix would break the public API). Report: which constraint blocked it, and the closest partial result.

budget exhausted

Iteration / token / time / dollar cap hit before success. Report: best attempt so far, what was still missing, and the cost spent.

human-needed

An escalate_at trigger fired (out-of-scope migration, low confidence, ambiguous goal). Report: the decision the human must make and the context to make it.

The book frames the same idea through its application scenarios, and they are worth carrying as mental models. A customer-support agent whose goal is "resolve the billing issue" monitors the conversation and database, confirms the change and the customer's reaction (success), or escalates to a human when unresolved (human-needed). A project-management assistant with the goal "hit milestone X by date Y" monitors task status and proactively warns on detected slippage. An autonomous vehicle with "deliver the passenger from A to B safely" monitors environment, self-state, and route progress in real time. In every case the goal is measurable, the monitor is continuous, and the stop is explicit — the same three pieces as the coding agent, just at different stakes.

Framework note — where the contract lives

The chapter ends on an implementation detail worth knowing for interviews: in Google's ADK, the goal is typically passed as part of the agent's instructions, and monitoring is realised through state management and tool interactions — i.e. the contract is prompt-level and the telemetry is state-level. LangChain/LangGraph express the same split as a graph whose nodes are the generate/evaluate/judge steps and whose edges encode the stop conditions; the loop's recursion limit is LangGraph's blunt version of the iteration cap.

Where this points next

Monitoring told us that something is wrong — a tool errored, evidence stalled, a constraint was violated, the budget ran low. It did not tell us what to do about it. The honest verdict from §3 — the judge can be wrong, the code was never executed, the loop can oscillate — is the agenda for lesson 14, Exception handling and recovery: treating errors not as crashes but as expected observations the loop can recover from, with retries, backoff, fallbacks, and clean escalation. Goals and monitors define when to act; recovery defines how. And the "human-needed" stop condition we named here is the doorway to lesson 15, Human-in-the-loop, where that escalation becomes a first-class control point.

Failure modes

Goalpost-moving — the loop silently rewrites its own objective mid-run, so it always "succeeds."
Hallucinated success — the judge declares done without real evidence, especially when generator and judge are the same model.
Overrun — continuing after the success criteria are already met, burning budget on nothing.
Unbounded loop — no iteration/time cap, so an oscillating judge runs forever.
Stagnation blindness — grinding to the cap while making zero new progress, instead of replanning early.

Implementation checklist

What concrete evidence proves success (measurable, not vibes)?
Which constraints can never be violated?
What structured progress fields does the monitor track each step?
What are the hard budget caps — iterations, tokens, time, dollars?
Is there a confidence floor before a True is accepted?
Is the judge independent of the generator?
What is reported on each of the four stop conditions?

Takeaway

Goal setting and monitoring are two patterns: a written, immutable goal contract (objective, constraints, measurable success_evidence, done_condition, budget, escalate_at) and a running monitor that compares live state to it and emits continue / replan / recover / escalate. SMART criteria make the goal evaluable; a hard budget makes stopping definable. The book's iterative coder — generate → evaluate → judge (True/False) → refine, capped at 5 iterations — is the canonical instance, and its honest weakness (the same model writes and judges, so it can hallucinate success) is why production systems separate the reviewer from the author and keep a human in the loop. A managed run ends through exactly one of four gates — success, impossible, budget, human-needed — each owing the caller a specific report.

Interview prompts

What are the two distinct patterns inside "goal setting and monitoring," and why keep them separate? (§1 — the goal contract is a static "where are we going / how do we know we arrived"; the monitor is a running "where are we now / are we still moving." Separating them lets the monitor evaluate progress against a fixed yardstick instead of a drifting one.)
Why must a goal be measurable, and what fails if it isn't? (§2 — the monitor computes a verdict by comparing state to success_evidence; an unmeasurable goal gives the monitor nothing to check, so the agent can never prove completion and autonomy collapses into reactive task execution.)
In the book's coding agent, what plays the role of goal, monitor, and budget? (§3 — the quality checklist is the goal contract, the reviewer + True/False judge calls are the monitor, and max_iterations = 5 is the budget.)
What is the central weakness of using one LLM to both generate and judge, and how do you fix it? (§3 — the judge inherits the generator's misunderstandings and can confidently return True for wrong code; fix by separating reviewer from author into independent agents and keeping a human for the final run-and-test.)
You cap a self-refining agent at 5 iterations. If each iteration fixes the task with probability 0.5, what is the chance it fails within budget, and when should you stop earlier? (§4 — (1−0.5)⁵ ≈ 3.1% fail within budget; stop earlier on stagnation — no new evidence met across two iterations — since extra iterations only help when p is genuinely positive.)
Name the four ways a managed agent run is allowed to end, and what each must report. (§5 — success (evidence + diff + residual risk), impossible (which constraint blocked it + partial result), budget exhausted (best attempt + what was missing + cost), human-needed (the decision and context for escalation).)
How does Google's ADK express the goal and the monitor, respectively? (§5 — the goal is passed via the agent's instructions; monitoring is realised through state management and tool interactions.)