Part V - Reliability and control
Goal setting and monitoring - objective, progress, stop
Up to now an agent could act — call tools, remember, adapt. But acting is not the same as making progress toward something. A loop with no target either stops too early (declares a victory it cannot prove) or never stops at all (burns budget chasing a goal it already met, or one it can never reach). This lesson gives the loop a direction and a way to tell whether it is getting closer.
True or a 5-iteration cap is hit.True/False verdict — and read its real failure: a judge that can hallucinate success. (4) Do the budget arithmetic: how an iteration cap, token cost, and confidence threshold interact, with a widget to feel the trade-off. (5) Name the four legitimate stop conditions and what each one reports. We close on why monitoring is the precondition for recovery (lesson 14).
New capability: a measurable goal contract, structured progress telemetry, and explicit stop conditions that turn an open-ended run into a managed, reportable process.
1 · Two jobs that are usually conflated: the goal and the monitor
The chapter's title is two patterns wearing one name, and keeping them apart is the whole insight. Goal setting answers "where are we going, and how will we recognise arrival?" It is a mostly static contract written once, before autonomous work begins. Monitoring answers "where are we now, and are we still moving toward the goal?" It is a running process that compares live state to that contract on every step and emits a verdict: continue, replan, recover, or escalate.
The book's travel analogy makes the split intuitive. Planning a trip, you first fix the destination (goal state), note your starting point (initial state), weigh options under constraints (budget, routes, transport), then execute steps — book, pack, travel, arrive. The destination and constraints are the contract. Checking "am I closer to the airport, on time, within budget?" at each step is the monitor. Skip the contract and the agent is merely "responsive" — it answers whatever is in front of it — but it can never judge whether its own behaviour is effective. That is the book's core complaint about goal-less agents: without a goal there is no yardstick for progress, so autonomy degenerates into reactive task execution that falls apart in dynamic, real-world settings.
Notice what monitoring is made of. It is not a vibe; it is a set of observable signals the chapter enumerates: the agent's own actions, the environment state, and tool outputs. A monitor watches completed subtasks, tool errors, budget spent, evidence gathered, open risks, and a confidence estimate in the final answer. Good monitoring forces the agent to explain not just what it did but why it believes the task is complete — which is precisely the thing a goal-less loop can never produce.
2 · The goal contract: SMART criteria plus a hard budget
The chapter leans on the SMART framework (Specific, Measurable, Achievable, Relevant, Time-bound) for what makes a goal usable by a monitor. The operative word is Measurable: a goal the monitor cannot evaluate is decoration. "Make the parser better" has no monitor; "make the parser accept trailing commas, proven by a targeted test passing" has one. Time-bound is equally load-bearing for agents — without it there is no principled moment to stop, which is the chapter's named risk of the basic example: an unbounded loop.
So a goal contract has six fields, and the last three are what separate a toy from something you would let run unattended:
| Field | Answers | Coding-agent example |
|---|---|---|
| objective | What behaviour changes? | "make the JSON parser accept trailing commas" |
| constraints | What must never break? | "preserve the public API; no new dependencies" |
| success_evidence | What proves done? (the Measurable part) | "the targeted test passes; diff summarised; full suite still green" |
| done_condition | How is the verdict computed? | "all success_evidence items observed" |
| budget | The Time-bound ceiling | "max 5 iterations, 20 tool calls, 30 min, $0.40" |
| escalate_at | When to hand to a human? | "fix needs an out-of-scope schema migration" |
goal = {
"objective": "make the JSON parser accept trailing commas",
"constraints": ["preserve public API", "no new dependencies"],
"success_evidence": ["targeted test passes", "full suite green", "diff summarized"],
"done_condition": "all success_evidence observed",
"budget": {"iterations": 5, "tool_calls": 20, "minutes": 30, "usd": 0.40},
"escalate_at": "fix requires out-of-scope schema migration"
}
# the monitor turns live state into a verdict against that contract
progress = monitor(state, goal) # -> {met: [...], missing: [...], cost, stalled, confidence}
if progress.met == goal["success_evidence"]:
state.status = "done" # stop: success
elif over_budget(state, goal["budget"]):
state.status = "abandoned" # stop: budget
elif progress.stalled:
state.status = "replan" # no movement -> change approach, do not repeat
The crucial design rule, and a named failure mode: the goal contract is immutable during a run. An agent that quietly rewrites its own objective mid-flight ("this is too hard, I'll just make the test less strict") will always report success — it moves the goalpost to wherever the ball landed. Changing the goal is a legitimate operation, but it is a replan event the monitor must surface to a human, not a side effect the loop performs on itself.
3 · The running example: the self-reviewing coder, and why it can lie
The chapter's worked code (LangChain + OpenAI, GPT-4o, by Mahtab Syed) is a clean instance of the pattern. You hand the agent a use case (e.g. "find the BinaryGap of a positive integer") and a quality checklist that is literally the success criteria — "simple to understand," "functionally correct," "handles edge cases," "positive-integer input only," "prints a few examples." The agent does not emit code and stop. It enters a loop:
True or False: are the goals met?True, save an annotated .py file and exit. On False, feed the critique back as feedback and loop, up to max_iterations = 5.This maps one-to-one onto §1: the checklist is the goal contract, the reviewer-plus-judge calls are the monitor, and max_iterations is the budget. The True/False verdict is deliberately reduced to a single token so the loop has an unambiguous stop signal — no parsing a paragraph to guess whether the agent thinks it is finished.
The book is refreshingly honest that this is illustrative, not production-grade, and the warnings are the real lesson:
True for code that is wrong — the monitor inherits the generator's blind spots. The book's two fixes are exactly the right instincts: (1) separate the roles into different agents — the author describes a Gemini-built team with distinct members (coding assistant, code reviewer, documenter, test writer, prompt refiner), so the reviewer's judgement is independent of the author's; and (2) keep a human in the loop for the final run-and-test, because the LLM never actually executed the code. We build role separation properly in the multi-agent lessons and the human gate in lesson 15.
The second named risk is the unbounded loop. The max_iterations = 5 cap is the only thing standing between this agent and an infinite generate-critique cycle when the judge oscillates. That cap is not a detail — it is the Time-bound axis of SMART made executable, and it forces us to reason about budgets numerically.
4 · Budget arithmetic: iterations, tokens, cost, and a confidence floor
A stop condition you cannot price is a stop condition you cannot trust. Let us put real numbers on the coding agent. Each iteration makes three LLM calls — generate, evaluate (critique), judge — and the judge call is tiny. Suppose:
Worked cost. At an illustrative blended rate of $5 per million tokens, one iteration costs 2,900 × $5 / 1,000,000 ≈ $0.0145. The full 5-iteration budget therefore caps spend at 5 × $0.0145 ≈ $0.073 per task — comfortably under the $0.40 ceiling in our contract, so for this task the iteration cap binds first, not the dollar cap. Scale to a batch of 10,000 code tasks and the worst case is 10,000 × $0.073 ≈ $725; if the typical task converges in 2 iterations the realistic bill is closer to 10,000 × 2 × $0.0145 ≈ $290. That gap — $725 worst case vs $290 typical — is exactly why you monitor actual iterations, not just the cap.
The convergence vs. cap trade-off. Say each iteration has an independent probability p of fixing the remaining issues and earning a genuine True. The chance the task is still unfinished after the cap of N iterations is (1−p)N. With p = 0.5: after 1 iteration 50% still fail; after 3, 0.53 = 12.5%; after 5, 0.55 ≈ 3.1%. Raising the cap from 3 to 5 buys you ~9 fewer failed-out-of-budget tasks per 100 — but only if p is genuinely positive. If the judge is stuck oscillating (true p ≈ 0), extra iterations buy nothing but cost, which is why a smart monitor also watches for stagnation (no new evidence met across two iterations) and abandons early instead of grinding to the cap.
Add one more knob the chapter implies but does not formalise: a confidence floor. Rather than accept any True, require the judge's confidence to clear a threshold (say 0.8) before declaring success; below it, treat the verdict as False and spend another iteration. Raising the floor trades cost for safety against a hallucinated True. The widget below lets you feel all three forces — per-iteration success probability, the iteration cap, and the confidence floor — against the dollar budget at once.
5 · Stop conditions: the four ways a run is allowed to end
A managed loop never just "finishes." It exits through exactly one of four named gates, and each gate owes the caller a different report. Treating them as distinct is what makes an agent accountable.
success_evidence observed and confidence ≥ floor. Report: the evidence, the diff/summary, residual risks. The coding agent saves the annotated .py file here.escalate_at trigger fired (out-of-scope migration, low confidence, ambiguous goal). Report: the decision the human must make and the context to make it.The book frames the same idea through its application scenarios, and they are worth carrying as mental models. A customer-support agent whose goal is "resolve the billing issue" monitors the conversation and database, confirms the change and the customer's reaction (success), or escalates to a human when unresolved (human-needed). A project-management assistant with the goal "hit milestone X by date Y" monitors task status and proactively warns on detected slippage. An autonomous vehicle with "deliver the passenger from A to B safely" monitors environment, self-state, and route progress in real time. In every case the goal is measurable, the monitor is continuous, and the stop is explicit — the same three pieces as the coding agent, just at different stakes.
Where this points next
Monitoring told us that something is wrong — a tool errored, evidence stalled, a constraint was violated, the budget ran low. It did not tell us what to do about it. The honest verdict from §3 — the judge can be wrong, the code was never executed, the loop can oscillate — is the agenda for lesson 14, Exception handling and recovery: treating errors not as crashes but as expected observations the loop can recover from, with retries, backoff, fallbacks, and clean escalation. Goals and monitors define when to act; recovery defines how. And the "human-needed" stop condition we named here is the doorway to lesson 15, Human-in-the-loop, where that escalation becomes a first-class control point.
Failure modes
- Goalpost-moving — the loop silently rewrites its own objective mid-run, so it always "succeeds."
- Hallucinated success — the judge declares done without real evidence, especially when generator and judge are the same model.
- Overrun — continuing after the success criteria are already met, burning budget on nothing.
- Unbounded loop — no iteration/time cap, so an oscillating judge runs forever.
- Stagnation blindness — grinding to the cap while making zero new progress, instead of replanning early.
Implementation checklist
- What concrete evidence proves success (measurable, not vibes)?
- Which constraints can never be violated?
- What structured progress fields does the monitor track each step?
- What are the hard budget caps — iterations, tokens, time, dollars?
- Is there a confidence floor before a
Trueis accepted? - Is the judge independent of the generator?
- What is reported on each of the four stop conditions?
True/False) → refine, capped at 5 iterations — is the canonical instance, and its honest weakness (the same model writes and judges, so it can hallucinate success) is why production systems separate the reviewer from the author and keep a human in the loop. A managed run ends through exactly one of four gates — success, impossible, budget, human-needed — each owing the caller a specific report.
Interview prompts
- What are the two distinct patterns inside "goal setting and monitoring," and why keep them separate? (§1 — the goal contract is a static "where are we going / how do we know we arrived"; the monitor is a running "where are we now / are we still moving." Separating them lets the monitor evaluate progress against a fixed yardstick instead of a drifting one.)
- Why must a goal be measurable, and what fails if it isn't? (§2 — the monitor computes a verdict by comparing state to success_evidence; an unmeasurable goal gives the monitor nothing to check, so the agent can never prove completion and autonomy collapses into reactive task execution.)
- In the book's coding agent, what plays the role of goal, monitor, and budget? (§3 — the quality checklist is the goal contract, the reviewer + True/False judge calls are the monitor, and max_iterations = 5 is the budget.)
- What is the central weakness of using one LLM to both generate and judge, and how do you fix it? (§3 — the judge inherits the generator's misunderstandings and can confidently return True for wrong code; fix by separating reviewer from author into independent agents and keeping a human for the final run-and-test.)
- You cap a self-refining agent at 5 iterations. If each iteration fixes the task with probability 0.5, what is the chance it fails within budget, and when should you stop earlier? (§4 — (1−0.5)⁵ ≈ 3.1% fail within budget; stop earlier on stagnation — no new evidence met across two iterations — since extra iterations only help when p is genuinely positive.)
- Name the four ways a managed agent run is allowed to end, and what each must report. (§5 — success (evidence + diff + residual risk), impossible (which constraint blocked it + partial result), budget exhausted (best attempt + what was missing + cost), human-needed (the decision and context for escalation).)
- How does Google's ADK express the goal and the monitor, respectively? (§5 — the goal is passed via the agent's instructions; monitoring is realised through state management and tool interactions.)