Part II - Core execution patterns
Reflection - critique, repair, and second-pass quality
Lessons 03-05 gave the agent ways to route and spread work, but they all share one blind spot: the loop trusts whatever the model emitted on the first try. Reflection is the pattern that breaks that trust on purpose. It wraps generation in a feedback loop — produce, critique, repair — so the system can catch its own mistakes before they leave the building. This is the first pattern where the agent looks back at what it did and decides it is not good enough yet.
calculate_factorial function with a "senior software engineer" critic, and a Google ADK producer-critic pipeline — are threaded through this lesson.New capability: a bounded producer-critic-reviser loop that inspects an artifact against explicit criteria and improves it before the agent continues — the agent's first self-correction mechanism.
1 · The gap a single forward pass cannot fill
By lesson 05 the agent could decompose a task into steps (chaining), pick the right step for the situation (routing), and run independent steps at once (parallelization). Every one of those patterns has the same unstated assumption: the output of each model call is taken as correct and passed forward. If the first draft of a summary misses a key fact, or the first code patch silently breaks a public API, nothing in the chain notices. The error simply flows downstream and becomes the foundation for everything after it.
The reason is mechanical. A language model emits tokens left to right, committing to each one before seeing the consequences of the whole. It has no built-in step that says "now read what you just wrote as if a stranger handed it to you, and look for what is wrong." That re-reading-as-a-stranger step is exactly what humans do — a writer drafts, then edits; an engineer codes, then reviews — and it is what reflection adds to an agent.
The analogy the chapter leans on is the writer's workflow: a first draft is rarely the final draft. What makes the agentic version powerful is that the "edit" step can be far more rigorous than a human re-read — it can run the code, diff against requirements, fact-check claims against sources — because we get to design the critic.
2 · The four-phase loop and the producer-critic split
The book's canonical reflection loop has four phases. The first three are the core cycle; the fourth is what makes it iterative rather than a single second pass.
Phase 2 is where reflection lives or dies, and the book makes a strong architectural recommendation about it: split the work into two logical roles, the producer and the critic (also called generator-critic or producer-reviewer). A single agent can critique itself, but two specialized agents — or, cheaply, two LLM calls with different system prompts — give more objective, more structured results. The reason is a cognitive-bias one: a model asked "is what I just wrote good?" is primed to defend it; a model told "you are a strict reviewer, find every flaw" approaches the same text adversarially.
The book's LangChain example makes this concrete. It builds the producer from the task prompt and message history, then builds the critic from a separate reflector_prompt whose system message casts it as a senior engineer and tells it: if the code is perfect and meets every requirement, reply only with CODE_IS_PERFECT; otherwise return a bulleted list of issues. That single sentence does two jobs at once — it defines the rubric ("meets every requirement") and it defines the stop signal (the sentinel string). The Google ADK example uses the same split with explicit agent objects: a DraftWriter (LlmAgent) writes to a shared state key draft_text, and a FactChecker reads that key and writes a structured {status, reasoning} dict to review_output, wired together by a SequentialAgent so the producer always runs before the critic.
SequentialAgent is a single-step reflection: produce once, critique once, stop. The book is explicit that true iterative reflection — looping until the critic is satisfied — needs stateful orchestration: a LoopAgent in ADK, a graph in LangGraph, or a hand-written for loop in plain code. A linear chain (LangChain LCEL) can demonstrate one round cheaply, but it cannot carry state across rounds on its own. Know which one you are building.3 · The central trade-off: quality up, cost up, and where to stop
Reflection is never free. Every extra round is at least one more critic call and one more producer call, which means more tokens, more latency, more money, and a conversation history that grows every iteration (the producer's draft, the critique, the next draft all accumulate in context). The book's rule of thumb: reach for reflection when quality, accuracy, and detail matter more than speed and cost — and not otherwise.
The decision is quantitative, so let us actually do the arithmetic. Take the running code-patch task with a reasonably capable model:
- Each producer call: ~1,500 input tokens (task + growing history) and ~600 output tokens.
- Each critic call: ~1,200 input tokens (task + current draft) and ~300 output tokens.
- Pricing, for the math: $3 per million input tokens, $15 per million output tokens. Latency ≈ 2.5 s per call.
latency per round ≈ 2 calls · 2.5 s = 5 s
Worked number. One reflection round adds ~$0.022 and ~5 s on top of the bare generation. Three rounds: ~$0.066 and ~15 s. That is a 3-4× cost-and-latency multiplier over a no-reflection answer. The question is whether the quality it buys is worth that multiplier — and the catch is that quality improvements diminish. Round 1 typically fixes the obvious, high-severity bugs (the missing negative-input check, the absent docstring). Round 2 catches subtler issues. By round 4 the critic is mostly polishing wording, while you are still paying full freight per round. The economically correct move is to spend rounds only while the marginal quality gain exceeds the marginal cost — which for most tasks means 1 to 3 rounds, with an early exit the moment the critic returns its clean-bill signal.
The widget below makes this trade-off something you can feel. Quality rises with diminishing returns (each round closes a fraction of the remaining gap to perfect); cost and latency rise linearly. Move the slider and watch the "net value" curve peak and then fall — that peak is your stop round.
max_iterations ceiling (the LangChain example uses 3) in addition to the satisfied-critic exit. Two stop conditions, always.4 · Reasoning longer vs checking better
Reflection is easy to confuse with "let the model think more," but the book draws a sharp line, and it is the most important pedagogical point in the chapter: reflection is about checking better, not reasoning longer. Giving the same model another turn to "make it better" with no structure produces self-talk — it changes tone, reorders sentences, and feels like progress while leaving correctness untouched. Real reflection needs four ingredients, and without them it degenerates into expensive noise:
What a real critique needs
- A rubric: the explicit criteria to judge against (facts accurate? edge cases handled? style met? requirements covered?). The critic checks the rubric, not vibes.
- The original task: the critic must see the goal contract, or it cannot tell whether the draft satisfies it. The book passes the original task prompt into the critic alongside the draft.
- Evidence or tests: ground truth to check against — source documents for a summary, a test suite or static analysis for code. This is what makes the critique objective rather than opinion.
- A stop rule: a clean-bill signal (
CODE_IS_PERFECT), no material critique remaining, a round budget, or escalation to a human/tool.
Properties of a good critique
- Specific: "line 4 does not raise
ValueErrorfor negative n," not "improve error handling." - Testable: phrased so the reviser (or a tool) can verify the fix landed.
- Prioritized: blocking correctness issues before cosmetic ones, so a one-round budget spends on what matters.
- Goal-tied: every point traces to a requirement in the task — a critic that invents new requirements is a failure mode, not rigor.
The book also notes two compounding effects worth knowing for later lessons. Reflection pairs naturally with goal setting and monitoring (lesson 13): the goal supplies the final standard the critic judges against, and monitoring feeds deviation signals back into the critique. And with memory (lesson 10): with conversation memory, the critic evaluates the draft in light of past interactions and prior critiques, so the agent stops repeating the same mistakes — reflection becomes cumulative rather than a fresh, isolated event each time.
5 · Running example: the code-patch quality gate
Thread it through our coding/research assistant. Before the agent finalizes a code patch, it does not just emit the diff — it runs a reflection loop with a critic playing senior reviewer, exactly mirroring the book's calculate_factorial example but on a real patch:
def patch_with_reflection(task, evidence, rubric, max_rounds=3):
history = [system_producer(), user(task, evidence)]
draft = producer(history) # phase 1: execute
for r in range(max_rounds):
critique = critic({ # phase 2: critique (different persona)
"task": task, # - sees the original objective
"draft": draft,
"rubric": rubric, # - checks the explicit rubric
"test_results": run_tests(draft), # - grounded in real evidence
})
if critique.is_clean: # stop rule A: clean-bill signal
return draft
history += [assistant(draft), user(f"Apply this critique:\n{critique}")]
draft = producer(history) # phase 3: refine using the critique
return draft # stop rule B: round budget exhausted
The rubric the critic enforces is the quality gate: is the diff minimal; were tests actually run and passing; are public APIs preserved; does the final report cite evidence? Notice that "were tests run and passing" is not a matter of model opinion — it comes from run_tests(draft). That is the single most reliable critique source, and it is the seam to the next lesson: right now run_tests is hand-waved, but it is really a tool call, and once the critic can invoke tools, its verdict stops being a plausible-sounding judgment and becomes a fact.
Where this points next
Reflection gives the agent a feedback loop, but in everything above the critic's most trustworthy signals — "tests pass," "this claim matches the source," "the API still type-checks" — were borrowed from outside the model. The pattern that lets the agent fetch those signals for itself is tool use. Lesson 07, Tool use - function calling as controlled action, builds the mechanism by which an agent leaves language and interacts with live systems: running a test suite, querying a database, calling an API. Once it exists, the strongest reflection loops replace model-judged critique with executable checks wherever they can — which is the whole reason this lesson's running example reached for run_tests before we had even defined what a tool was.
SequentialAgent; true iterative reflection needs stateful orchestration (LangGraph, ADK LoopAgent, or an explicit loop).Interview prompts
- What does reflection add that chaining and routing do not? (§1 — a feedback loop: the agent re-reads its own output, finds problems, and produces a better version, instead of passing the first result forward unchecked. It is the agent's first self-correction mechanism.)
- Why use a separate critic agent instead of asking the producer to critique itself? (§2 — to avoid the cognitive bias of self-review; a different system prompt ("you are a strict reviewer") approaches the text adversarially and returns more objective, structured feedback. The book implements this with a distinct critic persona / a second LlmAgent.)
- How do you stop a reflection loop, and why do you need two conditions? (§2, §3 — a satisfied-critic signal like CODE_IS_PERFECT for the normal case, AND a hard max_iterations cap so an over-strict critic cannot loop forever, overflow context, or hit rate limits.)
- Reflection roughly tripled your cost and latency. How do you decide it is worth it and where to stop? (§3 — quality gains diminish each round (~closing a fraction of the remaining gap) while cost grows linearly, so net value peaks at a small round count; reach for reflection only when quality/accuracy outweigh speed/cost, then spend 1-3 rounds with an early exit on a clean critique.)
- What separates real reflection from expensive self-talk? (§4 — a rubric, access to the original task, evidence or executable tests to check against, and a stop rule. Without them "make it better" just changes tone, not correctness; a critic that invents new requirements is a failure, not rigor.)
- Where does reflection connect to other patterns? (§4, §5 — to goal-setting/monitoring (the goal is the standard the critic judges against), to memory (critiques accumulate so mistakes aren't repeated), and to tool use (executable checks replace model-judged critique for a far more reliable verdict).)
Failure modes
- A vague "make it better" pass that changes tone and wording but not correctness — reasoning longer instead of checking better.
- A critic with no rubric or task context that invents new requirements the task never asked for.
- An over-strict critic that never emits the clean-bill signal, so the loop runs to the cap (or forever) burning cost.
- Infinite or unbudgeted revise loops that overflow the context window or trip API rate limits.
- Producer and critic sharing one persona/prompt, so the "critique" is self-defense and finds nothing.
Implementation checklist
- What explicit rubric does the critic check against, and does it see the original objective?
- Is the critic a distinct persona / separate call from the producer?
- Which critique signals can be replaced by executable checks (tests, static analysis, source-matching)?
- Two stop conditions present: a satisfied-critic signal and a hard round cap?
- Is this single-step (chain / SequentialAgent) or iterative (LangGraph / LoopAgent / explicit loop) — and is the orchestration stateful enough for what you chose?
- Are producer, critic, and revision artifacts kept separate and logged for evaluation (lesson 21)?