all_lessons/agentic_systems/14 · exception handling & recoverylesson 15 / 25

Part V - Reliability and control

Exception handling and recovery - controlled failure

A demo agent assumes every tool call works. A deployed agent assumes none of them do. This lesson is about the difference: turning errors from "the run crashed" into expected observations that route to a bounded, recorded recovery path — or to a clean, honest stop.

Book source
Chapter 12 - Exception Handling and Recovery (异常处理与恢复); PDF outline pages 141-146. The chapter's worked code is a Google ADK SequentialAgent that chains a primary handler, a state-driven fallback handler, and a response agent.
The plan
Five moves. (1) Frame the core shift: a robust agent treats failure as a normal observation, not an exception that ends the run. (2) Build the detection layer — how an agent notices something went wrong (bad tool output, HTTP 404/500, timeouts, schema violations, no-progress loops). (3) Lay out the handling menu the book names: log, retry, fallback, graceful degradation, notify. (4) Add the recovery layer: rollback, diagnosis, self-correction (where it fuses with reflection), and escalation. (5) Wire it into a real policy with a retry budget, worked with actual cost arithmetic, then thread it through the running coding/research assistant and the book's ADK primary/fallback example.
Linear position
Prerequisite: Lesson 13, Goal setting and monitoring — recovery only makes sense once the agent has a measurable objective and a notion of progress, because "did this fail?" and "is the goal still reachable?" are both questions about the goal.
New capability: A failure taxonomy, a detection layer, a bounded handling-and-recovery policy (retry / repair / fallback / degrade / escalate / rollback / stop), and clean visible failure when recovery would only hide uncertainty.

1 · The core shift: failure is an observation, not a crash

The book opens with a human analogy: just as a person adapts to an unexpected obstacle, an agent needs robust machinery to detect a problem, start a recovery process, or at minimum fail in a controlled way. That last clause is the whole point. In a notebook demo, the implicit error policy is "let the exception propagate and the cell turns red." In a deployed agent that runs for minutes across dozens of tool calls, an uncaught exception throws away all the work done so far and leaves external state (a half-written file, a partially-applied transaction) in an unknown condition.

So the reframe is: an error is just another observation flowing back into the control loop. Recall the loop from the track's mental model — the agent decides, acts, observes, and updates state. A tool returning HTTP 500 is an observation exactly like a tool returning useful data; it simply routes to a different branch. The skill we are building is not "avoid errors" (impossible in a dynamic real-world environment) but "have a named, bounded, recorded response ready for each class of error."

The book stresses both proactive and reactive strategies. Proactive: monitoring tool outputs and API responses, even using a separate agent or monitoring system to catch anomalies before they cascade. Reactive: the log / retry / fallback / degrade / notify / recover machinery we build below. It also notes that recovery frequently composes with reflection (lesson 06): after a failure throws, a reflection pass can analyze why it failed and retry with an improved approach — for example a rewritten prompt — rather than blindly repeating the same call.

Why this sits after monitoring, not before
You cannot classify a failure without a goal to measure against. "The search returned 3 results" is a success if the goal was "find any relevant paper" and a failure if the goal was "find the 2024 benchmark numbers." A no-progress loop — three actions, zero movement toward the objective — is only detectable because lesson 13 gave the agent a progress signal. Recovery is monitoring's natural sequel.

2 · Detection — how the agent notices

You cannot recover from a failure you never saw. The book lists concrete detection signals, and a real agent layers several because each catches a different class:

SignalWhat it catchesExample
Status / error codesHard service failuresAPI returns 404 (not found) or 500 (server error)
TimeoutsHangs and abnormal latencyA tool that normally answers in 800 ms takes > 30 s
Schema / format validationMalformed or invalid outputModel returns prose where typed JSON was required (see lesson 02's contract)
Semantic / content checksPlausible-looking but wrong outputA "summary" that is empty, or a citation to a nonexistent file
No-progress detectionLoops that burn budget without advancing3 consecutive actions, no change in the monitored progress metric
External monitor / second agentAnomalies the actor missesA watcher agent flags an unsafe action or an out-of-distribution response

The cheap signals (codes, timeouts, schema) are necessary but not sufficient. The dangerous failures are the ones that look like success: an API returns 200 OK with an empty body, or the model confidently fabricates. That is why semantic checks and progress monitoring matter — they catch the failures that do not raise an exception.

3 · The handling menu — what to do once you have noticed

The book names a specific ladder of responses, ordered roughly from cheapest/most-local to most-disruptive. The art of recovery is matching the class of failure to the cheapest adequate response — not reaching for escalation when a retry would do, and not retrying forever when the failure is permanent.

LOGLog — record the failure with enough evidence (command, inputs, error code, response excerpt) to debug and to feed evaluation later. Always do this, regardless of what else you choose.
RETRYRetry — re-attempt, often with backoff and slightly adjusted parameters. Only valid for transient failures (timeout, rate limit, flaky network) and only for idempotent actions.
FALLBACKFallback — switch to an alternative tool, model, or method that achieves a similar result. The book's ADK example is exactly this: precise-location tool fails → general-area tool.
DEGRADEGraceful degradation — when full recovery is impossible, preserve partial function and still deliver some value rather than nothing.
NOTIFYNotify — alert a human operator or another agent so a person can intervene or collaborate. This is the bridge to lesson 15.
The "blind retry" foot-gun
Retrying a non-transient failure is the most common recovery bug. The book's trading-bot example is pointed: an insufficient funds or market closed error must not be retried — repeating an invalid trade wastes money and may trip rate limits or fraud controls. Retry is only for failures that might succeed on a second identical attempt. permission_denied, 404, and insufficient funds are permanent for the current inputs; retrying them just amplifies cost.

4 · Recovery — getting back to a stable state

Handling stops the bleeding; recovery restores a known-good state and reduces the chance of recurrence. The book lists four mechanisms:

The non-negotiable rule the book implies and we make explicit: silent recovery is dangerous. The trace must show what failed and which policy was applied. An agent that quietly swallows three failures and returns a confident answer has not recovered — it has hidden uncertainty, which is strictly worse than a visible failure because nobody downstream knows to distrust the result.

5 · A real policy, with a retry budget and cost arithmetic

"Retry on failure" is not a policy until it has a budget and a backoff schedule. Unbounded retries are how an agent turns a transient blip into a runaway bill. Make it concrete.

Suppose the coding/research assistant calls a flaky documentation API. Each call costs roughly: 2,000 input + 500 output tokens of model context to interpret the result, plus the API itself. At a model price of $3 / 1M input and $15 / 1M output tokens, one attempt costs:

costattempt = (2000 × $3 + 500 × $15) / 1{,}000{,}000 = ($0.006 + $0.0075) = $0.0135 ≈ 1.35¢

Now compare two policies on a failure that turns out to be permanent (the API endpoint was deprecated, every call returns 404):

Blind retry, no cap
∞ · 1.35¢ → runaway
Bounded: 3 tries
3 · 1.35¢ = 4.05¢, then stop
Classified first
404 ⇒ 0 retries = 1.35¢
Backoff total wait
0.5 + 1 + 2 = 3.5 s

The lesson in the numbers: classification beats budgeting, and budgeting beats nothing. A bounded retry caps a permanent failure at 4 cents, but classifying the 404 as permanent skips retries entirely and costs one attempt. The budget is the safety net for the failures you didn't classify correctly. Backoff (exponential: 0.5s, 1s, 2s, capped) protects the upstream service from a thundering retry storm and gives a genuinely transient problem time to clear.

Here is the policy as code — the same shape the original sketch had, now with the detection signals, the budget, and the rollback wired in:

def handle(observation, state):
    failure = classify(observation, goal=state.goal)   # uses §2 signals + progress
    log(failure, state)                                # always record evidence

    match failure.type:
        case "timeout" | "rate_limited" if state.attempts < state.budget:
            backoff(0.5 * 2 ** state.attempts)         # 0.5s, 1s, 2s ...
            return retry(state)

        case "invalid_schema":
            return reprompt_with_repair(failure)       # reflection-style self-correction

        case "permission_denied" | "unsafe_action":
            return escalate_to_human(failure)          # never work around (lesson 15)

        case "service_unavailable" if has_fallback(failure.tool):
            return use_fallback_tool(failure)          # the ADK primary→fallback move

        case "test_failed":
            return route_to_debugging(failure)         # running-example self-correction

        case "no_progress" if state.steps_since_gain > 3:
            return stop_with_partial(state)            # graceful degradation, not a loop

        case _:
            rollback(state)                            # undo half-applied changes
            return fail_visibly(failure)               # honest stop, full trace

Notice what each branch is doing: timeout is transient and idempotent, so it retries under budget; invalid_schema is the model's fault, so it self-corrects; permission_denied is permanent and also a safety boundary, so it escalates rather than retries or works around; service_unavailable takes the fallback path; the default branch rolls back and fails loudly. No branch hides a failure.

6 · The book's ADK fallback pattern

The chapter's worked code makes the fallback branch concrete using Google ADK's SequentialAgent, which runs its sub-agents in a fixed order and passes shared state between them. Three sub-agents form a layered location lookup:

01primary_handler — tries the precise tool get_precise_location_info with the user's address. On failure it sets state["primary_location_failed"] = True.
02fallback_handler — reads that state flag. If the primary failed, it extracts just the city and calls the coarser get_general_area_info (graceful degradation: less precise, still useful). If the primary succeeded, it does nothing.
03response_agent — reads state["location_result"] and presents it. If it is empty or missing, it apologizes that location could not be determined — a clean, honest failure rather than a fabrication.

The architectural insight: recovery here is not a tangle of try/except inside one agent. It is expressed as structure — a sequence where downstream agents inspect shared state and branch on whether upstream succeeded. The failure becomes an observation written to state, exactly the reframe from section 1. That is also why the response agent can degrade gracefully: it treats "no result" as a first-class case with a defined behavior. Other frameworks express the same idea differently — LangGraph routes a failure edge to a recovery node, CrewAI hands a failed task to a fallback crew member — but the contract is identical: failure is data that selects the next node.

7 · Running example: the coding/research assistant

Thread the policy through the track's running agent. The assistant is asked to add a feature and verify it with the test suite.

In every case the trace records what failed and what policy fired, so evaluation (lesson 21) can later ask "how often did the docs API time out?" and "did debugging actually fix the test?" The failure is preserved, not erased.

Failure modes & implementation checklist

Failure modes

  • Blind retry loops that re-attempt permanent failures (404, permission, insufficient funds) and amplify cost without any chance of success.
  • Treating permission denial as a prompt problem — rewording the request or finding a workaround instead of escalating; this also breaches the safety boundary.
  • Silent recovery — swallowing failures and returning a confident answer that hides uncertainty, leaving no trace for debugging or evaluation.
  • Success-shaped failures — a 200 OK with an empty body, or a fabricated citation, that passes status checks but fails semantically.
  • Unbounded budget — retries with no cap and no backoff, turning a transient blip into a runaway bill and a thundering herd on the upstream service.
  • No rollback — leaving external state half-applied after a mid-sequence failure, so the next run starts from an unknown condition.

Implementation checklist

  • What detection signals are wired up (codes, timeouts, schema, semantic, no-progress)?
  • What are the failure types, and which are transient vs permanent?
  • Which failures are retryable (transient and idempotent only)?
  • What is the retry budget and the backoff schedule?
  • Which failures need a fallback tool/model, and which need escalation?
  • Is there a rollback for any action with external side effects?
  • Does every recovery write to the trace (no silent paths)?
  • What does the final failure report contain — enough evidence to debug?

Checkpoint exercise

Try it
Pick one tool in your agent. List five distinct failure types it can produce. For each, decide: transient or permanent? idempotent or not? Then assign exactly one response — retry (with budget), repair, fallback, degrade, escalate, or stop — and say what gets written to the trace. If two failures get the same response, ask whether you are under-classifying.

Where this points next

Several branches in our policy ended at the same place: escalate_to_human. Permission denials, unsafe actions, and irreducible ambiguity are precisely the cases where the agent should not recover on its own — recovering would mean guessing past a boundary that exists for a reason. That hand-off is not an afterthought; it is a designed pattern with its own contract: when to ask, what to show the human, how to resume after approval or correction, and where accountability lives. Lesson 15, Human-in-the-loop, builds exactly that boundary — approval, correction, and escalation — turning "notify a human" from a vague fallback into a first-class control surface.

Takeaway
Reliability is not the absence of failure; it is the presence of a named, bounded, recorded response to every class of failure. Detect with layered signals (codes, timeouts, schema, semantic, no-progress). Match the failure class to the cheapest adequate handler — log always, then retry (transient + idempotent only, under a budget with backoff), fallback, degrade, or notify. Recover to a stable state via rollback, diagnosis, self-correction (reflection), or escalation. The two unbreakable rules: never blind-retry a permanent failure (it only amplifies cost), and never recover silently (it hides uncertainty). Failure is just another observation routing through the loop — handle it on purpose, or fail visibly with a full trace.

Interview prompts