Part V - Reliability and control
Exception handling and recovery - controlled failure
A demo agent assumes every tool call works. A deployed agent assumes none of them do. This lesson is about the difference: turning errors from "the run crashed" into expected observations that route to a bounded, recorded recovery path — or to a clean, honest stop.
SequentialAgent that chains a primary handler, a state-driven fallback handler, and a response agent.New capability: A failure taxonomy, a detection layer, a bounded handling-and-recovery policy (retry / repair / fallback / degrade / escalate / rollback / stop), and clean visible failure when recovery would only hide uncertainty.
1 · The core shift: failure is an observation, not a crash
The book opens with a human analogy: just as a person adapts to an unexpected obstacle, an agent needs robust machinery to detect a problem, start a recovery process, or at minimum fail in a controlled way. That last clause is the whole point. In a notebook demo, the implicit error policy is "let the exception propagate and the cell turns red." In a deployed agent that runs for minutes across dozens of tool calls, an uncaught exception throws away all the work done so far and leaves external state (a half-written file, a partially-applied transaction) in an unknown condition.
So the reframe is: an error is just another observation flowing back into the control loop. Recall the loop from the track's mental model — the agent decides, acts, observes, and updates state. A tool returning HTTP 500 is an observation exactly like a tool returning useful data; it simply routes to a different branch. The skill we are building is not "avoid errors" (impossible in a dynamic real-world environment) but "have a named, bounded, recorded response ready for each class of error."
The book stresses both proactive and reactive strategies. Proactive: monitoring tool outputs and API responses, even using a separate agent or monitoring system to catch anomalies before they cascade. Reactive: the log / retry / fallback / degrade / notify / recover machinery we build below. It also notes that recovery frequently composes with reflection (lesson 06): after a failure throws, a reflection pass can analyze why it failed and retry with an improved approach — for example a rewritten prompt — rather than blindly repeating the same call.
2 · Detection — how the agent notices
You cannot recover from a failure you never saw. The book lists concrete detection signals, and a real agent layers several because each catches a different class:
| Signal | What it catches | Example |
|---|---|---|
| Status / error codes | Hard service failures | API returns 404 (not found) or 500 (server error) |
| Timeouts | Hangs and abnormal latency | A tool that normally answers in 800 ms takes > 30 s |
| Schema / format validation | Malformed or invalid output | Model returns prose where typed JSON was required (see lesson 02's contract) |
| Semantic / content checks | Plausible-looking but wrong output | A "summary" that is empty, or a citation to a nonexistent file |
| No-progress detection | Loops that burn budget without advancing | 3 consecutive actions, no change in the monitored progress metric |
| External monitor / second agent | Anomalies the actor misses | A watcher agent flags an unsafe action or an out-of-distribution response |
The cheap signals (codes, timeouts, schema) are necessary but not sufficient. The dangerous failures are the ones that look like success: an API returns 200 OK with an empty body, or the model confidently fabricates. That is why semantic checks and progress monitoring matter — they catch the failures that do not raise an exception.
3 · The handling menu — what to do once you have noticed
The book names a specific ladder of responses, ordered roughly from cheapest/most-local to most-disruptive. The art of recovery is matching the class of failure to the cheapest adequate response — not reaching for escalation when a retry would do, and not retrying forever when the failure is permanent.
insufficient funds or market closed error must not be retried — repeating an invalid trade wastes money and may trip rate limits or fraud controls. Retry is only for failures that might succeed on a second identical attempt. permission_denied, 404, and insufficient funds are permanent for the current inputs; retrying them just amplifies cost.4 · Recovery — getting back to a stable state
Handling stops the bleeding; recovery restores a known-good state and reduces the chance of recurrence. The book lists four mechanisms:
- State rollback — undo the most recent changes or transaction so the agent is not left in a half-applied state. (If you appended to a file and then a follow-up step failed, roll the file back.)
- Diagnosis — investigate the root cause deeply enough to prevent a repeat, rather than papering over the symptom.
- Self-correction / re-planning — adjust the plan, logic, or parameters to avoid the same error next time. This is where recovery fuses with reflection: analyze the failure, then act differently.
- Escalation — for complex or severe problems, hand off to a human operator or a higher-level system.
The non-negotiable rule the book implies and we make explicit: silent recovery is dangerous. The trace must show what failed and which policy was applied. An agent that quietly swallows three failures and returns a confident answer has not recovered — it has hidden uncertainty, which is strictly worse than a visible failure because nobody downstream knows to distrust the result.
5 · A real policy, with a retry budget and cost arithmetic
"Retry on failure" is not a policy until it has a budget and a backoff schedule. Unbounded retries are how an agent turns a transient blip into a runaway bill. Make it concrete.
Suppose the coding/research assistant calls a flaky documentation API. Each call costs roughly: 2,000 input + 500 output tokens of model context to interpret the result, plus the API itself. At a model price of $3 / 1M input and $15 / 1M output tokens, one attempt costs:
Now compare two policies on a failure that turns out to be permanent (the API endpoint was deprecated, every call returns 404):
The lesson in the numbers: classification beats budgeting, and budgeting beats nothing. A bounded retry caps a permanent failure at 4 cents, but classifying the 404 as permanent skips retries entirely and costs one attempt. The budget is the safety net for the failures you didn't classify correctly. Backoff (exponential: 0.5s, 1s, 2s, capped) protects the upstream service from a thundering retry storm and gives a genuinely transient problem time to clear.
Here is the policy as code — the same shape the original sketch had, now with the detection signals, the budget, and the rollback wired in:
def handle(observation, state):
failure = classify(observation, goal=state.goal) # uses §2 signals + progress
log(failure, state) # always record evidence
match failure.type:
case "timeout" | "rate_limited" if state.attempts < state.budget:
backoff(0.5 * 2 ** state.attempts) # 0.5s, 1s, 2s ...
return retry(state)
case "invalid_schema":
return reprompt_with_repair(failure) # reflection-style self-correction
case "permission_denied" | "unsafe_action":
return escalate_to_human(failure) # never work around (lesson 15)
case "service_unavailable" if has_fallback(failure.tool):
return use_fallback_tool(failure) # the ADK primary→fallback move
case "test_failed":
return route_to_debugging(failure) # running-example self-correction
case "no_progress" if state.steps_since_gain > 3:
return stop_with_partial(state) # graceful degradation, not a loop
case _:
rollback(state) # undo half-applied changes
return fail_visibly(failure) # honest stop, full trace
Notice what each branch is doing: timeout is transient and idempotent, so it retries under budget; invalid_schema is the model's fault, so it self-corrects; permission_denied is permanent and also a safety boundary, so it escalates rather than retries or works around; service_unavailable takes the fallback path; the default branch rolls back and fails loudly. No branch hides a failure.
6 · The book's ADK fallback pattern
The chapter's worked code makes the fallback branch concrete using Google ADK's SequentialAgent, which runs its sub-agents in a fixed order and passes shared state between them. Three sub-agents form a layered location lookup:
get_precise_location_info with the user's address. On failure it sets state["primary_location_failed"] = True.get_general_area_info (graceful degradation: less precise, still useful). If the primary succeeded, it does nothing.state["location_result"] and presents it. If it is empty or missing, it apologizes that location could not be determined — a clean, honest failure rather than a fabrication.The architectural insight: recovery here is not a tangle of try/except inside one agent. It is expressed as structure — a sequence where downstream agents inspect shared state and branch on whether upstream succeeded. The failure becomes an observation written to state, exactly the reframe from section 1. That is also why the response agent can degrade gracefully: it treats "no result" as a first-class case with a defined behavior. Other frameworks express the same idea differently — LangGraph routes a failure edge to a recovery node, CrewAI hands a failed task to a fallback crew member — but the contract is identical: failure is data that selects the next node.
7 · Running example: the coding/research assistant
Thread the policy through the track's running agent. The assistant is asked to add a feature and verify it with the test suite.
- Test command fails (
pytestexits non-zero). Detection: non-zero exit code. The agent stores the command, the exit code, and a failure excerpt (the assertion that broke), then routes todebugging— self-correction, not a blind retry. Re-running an identical failing test would just reproduce the same exit code at full cost. - Permission denied writing outside the workspace. Detection:
permission_denied. The agent does not hunt for asudoworkaround or an alternate path; it escalates to the human. This is both a recovery rule and a safety boundary (lesson 20). - Docs API times out. Detection: timeout. Transient + idempotent ⇒ retry with backoff, capped at the 3-attempt budget. If all three fail, fall back to a cached local copy (degradation) and annotate the answer that the docs may be stale.
- Corrupt input file in a batch. The book's data-processing example: skip the bad file, log it, continue the rest, and report the skipped files at the end — never abort the whole batch over one bad record.
In every case the trace records what failed and what policy fired, so evaluation (lesson 21) can later ask "how often did the docs API time out?" and "did debugging actually fix the test?" The failure is preserved, not erased.
Failure modes & implementation checklist
Failure modes
- Blind retry loops that re-attempt permanent failures (404, permission, insufficient funds) and amplify cost without any chance of success.
- Treating permission denial as a prompt problem — rewording the request or finding a workaround instead of escalating; this also breaches the safety boundary.
- Silent recovery — swallowing failures and returning a confident answer that hides uncertainty, leaving no trace for debugging or evaluation.
- Success-shaped failures — a
200 OKwith an empty body, or a fabricated citation, that passes status checks but fails semantically. - Unbounded budget — retries with no cap and no backoff, turning a transient blip into a runaway bill and a thundering herd on the upstream service.
- No rollback — leaving external state half-applied after a mid-sequence failure, so the next run starts from an unknown condition.
Implementation checklist
- What detection signals are wired up (codes, timeouts, schema, semantic, no-progress)?
- What are the failure types, and which are transient vs permanent?
- Which failures are retryable (transient and idempotent only)?
- What is the retry budget and the backoff schedule?
- Which failures need a fallback tool/model, and which need escalation?
- Is there a rollback for any action with external side effects?
- Does every recovery write to the trace (no silent paths)?
- What does the final failure report contain — enough evidence to debug?
Checkpoint exercise
Where this points next
Several branches in our policy ended at the same place: escalate_to_human. Permission denials, unsafe actions, and irreducible ambiguity are precisely the cases where the agent should not recover on its own — recovering would mean guessing past a boundary that exists for a reason. That hand-off is not an afterthought; it is a designed pattern with its own contract: when to ask, what to show the human, how to resume after approval or correction, and where accountability lives. Lesson 15, Human-in-the-loop, builds exactly that boundary — approval, correction, and escalation — turning "notify a human" from a vague fallback into a first-class control surface.
Interview prompts
- Why should an agent treat an error as an observation rather than an exception? (§1 — an uncaught exception discards all in-flight work and leaves external state unknown; routing the error back through the loop lets the agent choose a bounded response and keep partial value.)
- Which failures are safe to retry, and which are not? (§3 — retry only transient and idempotent failures (timeout, rate limit, flaky network); never permanent ones like 404, permission_denied, or insufficient funds — the book's trading-bot example — because retrying just amplifies cost.)
- An agent calls a flaky API at 1.35¢ per attempt and the failure is permanent. Compare unbounded retry, a 3-try budget, and classifying first. (§5 — unbounded → runaway; 3-try budget → 4.05¢ then stop; classifying the 404 as permanent → 1 attempt = 1.35¢. Classification beats budgeting, budgeting beats nothing.)
- Why is silent recovery worse than visible failure? (§4 — it hides uncertainty; downstream consumers and evaluation get a confident answer with no signal to distrust it, so the trace must record what failed and which policy fired.)
- How does the book's ADK example express fallback without try/except? (§6 — a SequentialAgent: primary_handler sets a state flag on failure, fallback_handler branches on that flag to a coarser tool, response_agent degrades gracefully if no result exists. Failure becomes shared state that selects the next node.)
- Where does recovery fuse with reflection? (§4 — self-correction: after a failure throws, a reflection pass analyzes the cause and retries with an improved approach, e.g. a rewritten prompt or an adjusted plan, instead of repeating the same call.)
- Why does this pattern come after goal setting and monitoring? (§1 — classifying a failure and asking whether the goal is still reachable both require a goal and a progress signal; a no-progress loop is only detectable because monitoring supplies that signal.)