all_lessons/agentic_systems/06 · reflectionlesson 7 / 25

Part II - Core execution patterns

Reflection - critique, repair, and second-pass quality

Lessons 03-05 gave the agent ways to route and spread work, but they all share one blind spot: the loop trusts whatever the model emitted on the first try. Reflection is the pattern that breaks that trust on purpose. It wraps generation in a feedback loop — produce, critique, repair — so the system can catch its own mistakes before they leave the building. This is the first pattern where the agent looks back at what it did and decides it is not good enough yet.

Book source
Chapter 4 - Reflection (反思); PDF outline pages 53-62. The chapter's two worked examples — a LangChain loop that iteratively writes and critiques a calculate_factorial function with a "senior software engineer" critic, and a Google ADK producer-critic pipeline — are threaded through this lesson.
The plan
Five moves. (1) Name the gap reflection fills — why a single forward pass is structurally unable to check itself, and how a feedback loop changes that. (2) Build the canonical four-phase loop (execute → critique → revise → iterate) and the producer-critic split that makes the critique honest. (3) Make the central trade-off concrete with a worked cost/latency calculation and an interactive widget: quality rises with diminishing returns while cost grows linearly, so there is an optimal stopping round. (4) Separate the two things people conflate — checking harder (rubric, tests) versus reasoning longer. (5) Thread it through the running code-patch example, then hand off to tool use (lesson 07), where the critique stops being an opinion and becomes a test result.
Linear position
Prerequisite: lessons 03-05 — a prompt chain, a router, or a parallel fan-in that produces an artifact (a draft, a plan, a patch) worth evaluating. Reflection needs something to reflect on.
New capability: a bounded producer-critic-reviser loop that inspects an artifact against explicit criteria and improves it before the agent continues — the agent's first self-correction mechanism.

1 · The gap a single forward pass cannot fill

By lesson 05 the agent could decompose a task into steps (chaining), pick the right step for the situation (routing), and run independent steps at once (parallelization). Every one of those patterns has the same unstated assumption: the output of each model call is taken as correct and passed forward. If the first draft of a summary misses a key fact, or the first code patch silently breaks a public API, nothing in the chain notices. The error simply flows downstream and becomes the foundation for everything after it.

The reason is mechanical. A language model emits tokens left to right, committing to each one before seeing the consequences of the whole. It has no built-in step that says "now read what you just wrote as if a stranger handed it to you, and look for what is wrong." That re-reading-as-a-stranger step is exactly what humans do — a writer drafts, then edits; an engineer codes, then reviews — and it is what reflection adds to an agent.

Definition
Reflection is the pattern in which an agent evaluates its own output (or internal state) and uses that evaluation to improve. Unlike chaining or routing, which pass a result forward, reflection introduces a feedback loop: the agent generates an output, examines it for problems, and produces a better version. The book frames it as a metacognitive layer — the agent thinks about its own thinking — and the central self-correction mechanism that turns a one-shot executor into a system that can recognize and fix its own mistakes.

The analogy the chapter leans on is the writer's workflow: a first draft is rarely the final draft. What makes the agentic version powerful is that the "edit" step can be far more rigorous than a human re-read — it can run the code, diff against requirements, fact-check claims against sources — because we get to design the critic.

2 · The four-phase loop and the producer-critic split

The book's canonical reflection loop has four phases. The first three are the core cycle; the fourth is what makes it iterative rather than a single second pass.

┌─────────────────────────────────────────────────────────────┐ │ │ ▼ │ (1) EXECUTE ──▶ draft / plan / patch │ │ │ ▼ │ (2) EVALUATE / CRITIQUE ──▶ check facts, coherence, style, │ │ completeness, instruction-following │ ▼ │ ├── critique is empty (e.g. "CODE_IS_PERFECT") ──▶ STOP ──▶ output │ │ ▼ │ (3) REFINE ──▶ apply the critique to produce a better version ────┘ │ (4) ITERATE └── OR: round budget exhausted ──────────────────▶ STOP ──▶ output

Phase 2 is where reflection lives or dies, and the book makes a strong architectural recommendation about it: split the work into two logical roles, the producer and the critic (also called generator-critic or producer-reviewer). A single agent can critique itself, but two specialized agents — or, cheaply, two LLM calls with different system prompts — give more objective, more structured results. The reason is a cognitive-bias one: a model asked "is what I just wrote good?" is primed to defend it; a model told "you are a strict reviewer, find every flaw" approaches the same text adversarially.

Producer
Generates content. Writes the code, drafts the blog post, builds the plan. System prompt is goal-oriented: "produce the best first version you can." Owns phases 1 and 3 (it also applies the critique in the refine step).
Critic
Evaluates content. A different persona — the book's examples literally instruct it: "You are a senior software engineer expert in Python" or "You are a rigorous fact-checker." Checks against specific criteria and returns structured, prioritized feedback (or a clean-bill signal). Owns phase 2.

The book's LangChain example makes this concrete. It builds the producer from the task prompt and message history, then builds the critic from a separate reflector_prompt whose system message casts it as a senior engineer and tells it: if the code is perfect and meets every requirement, reply only with CODE_IS_PERFECT; otherwise return a bulleted list of issues. That single sentence does two jobs at once — it defines the rubric ("meets every requirement") and it defines the stop signal (the sentinel string). The Google ADK example uses the same split with explicit agent objects: a DraftWriter (LlmAgent) writes to a shared state key draft_text, and a FactChecker reads that key and writes a structured {status, reasoning} dict to review_output, wired together by a SequentialAgent so the producer always runs before the critic.

Single-step vs truly iterative
That ADK SequentialAgent is a single-step reflection: produce once, critique once, stop. The book is explicit that true iterative reflection — looping until the critic is satisfied — needs stateful orchestration: a LoopAgent in ADK, a graph in LangGraph, or a hand-written for loop in plain code. A linear chain (LangChain LCEL) can demonstrate one round cheaply, but it cannot carry state across rounds on its own. Know which one you are building.

3 · The central trade-off: quality up, cost up, and where to stop

Reflection is never free. Every extra round is at least one more critic call and one more producer call, which means more tokens, more latency, more money, and a conversation history that grows every iteration (the producer's draft, the critique, the next draft all accumulate in context). The book's rule of thumb: reach for reflection when quality, accuracy, and detail matter more than speed and cost — and not otherwise.

The decision is quantitative, so let us actually do the arithmetic. Take the running code-patch task with a reasonably capable model:

cost per round ≈ (1500+1200)·$3/1e6 + (600+300)·$15/1e6 = $0.0081 + $0.0135 ≈ $0.022
latency per round ≈ 2 calls · 2.5 s = 5 s

Worked number. One reflection round adds ~$0.022 and ~5 s on top of the bare generation. Three rounds: ~$0.066 and ~15 s. That is a 3-4× cost-and-latency multiplier over a no-reflection answer. The question is whether the quality it buys is worth that multiplier — and the catch is that quality improvements diminish. Round 1 typically fixes the obvious, high-severity bugs (the missing negative-input check, the absent docstring). Round 2 catches subtler issues. By round 4 the critic is mostly polishing wording, while you are still paying full freight per round. The economically correct move is to spend rounds only while the marginal quality gain exceeds the marginal cost — which for most tasks means 1 to 3 rounds, with an early exit the moment the critic returns its clean-bill signal.

The widget below makes this trade-off something you can feel. Quality rises with diminishing returns (each round closes a fraction of the remaining gap to perfect); cost and latency rise linearly. Move the slider and watch the "net value" curve peak and then fall — that peak is your stop round.

Reflection economics — find the round where you should stop
Each round closes a fraction (the critic effectiveness) of the remaining gap between current quality and perfect (1.0), so quality climbs but flattens. Cost and latency grow linearly per round. Net value = value-of-quality minus cost-of-rounds. The bar that peaks is the round you should stop at. Raise the value-of-quality and reflection pays for more rounds; lower the critic's effectiveness and it stops paying almost immediately.
Cost / round
$0.022
Best stop round
1
Quality there
0.75
Net value there
$0.13
Show the core JS
const COST = 0.022;            // $ added per reflection round (producer + critic)
let q = q0;                    // quality after 0 rounds
for (let r = 1; r <= R; r++) {
  q = q + eff * (1 - q);       // close `eff` of the remaining gap to perfect
  const value = valueFull * q; // $ value of the artifact at this quality
  const net   = value - COST * r;  // minus the cost of r rounds
  // the r that maximizes `net` is the round to stop at
}
Trap: the runaway loop
If the critic is too strict it never returns the clean-bill signal, and a "loop until perfect" with no cap runs forever — burning money and eventually overflowing the context window or hitting API rate limits. The book lists context-window overflow and rate-limiting as named risks. Every iterative reflection loop needs a hard max_iterations ceiling (the LangChain example uses 3) in addition to the satisfied-critic exit. Two stop conditions, always.

4 · Reasoning longer vs checking better

Reflection is easy to confuse with "let the model think more," but the book draws a sharp line, and it is the most important pedagogical point in the chapter: reflection is about checking better, not reasoning longer. Giving the same model another turn to "make it better" with no structure produces self-talk — it changes tone, reorders sentences, and feels like progress while leaving correctness untouched. Real reflection needs four ingredients, and without them it degenerates into expensive noise:

What a real critique needs

  • A rubric: the explicit criteria to judge against (facts accurate? edge cases handled? style met? requirements covered?). The critic checks the rubric, not vibes.
  • The original task: the critic must see the goal contract, or it cannot tell whether the draft satisfies it. The book passes the original task prompt into the critic alongside the draft.
  • Evidence or tests: ground truth to check against — source documents for a summary, a test suite or static analysis for code. This is what makes the critique objective rather than opinion.
  • A stop rule: a clean-bill signal (CODE_IS_PERFECT), no material critique remaining, a round budget, or escalation to a human/tool.

Properties of a good critique

  • Specific: "line 4 does not raise ValueError for negative n," not "improve error handling."
  • Testable: phrased so the reviser (or a tool) can verify the fix landed.
  • Prioritized: blocking correctness issues before cosmetic ones, so a one-round budget spends on what matters.
  • Goal-tied: every point traces to a requirement in the task — a critic that invents new requirements is a failure mode, not rigor.

The book also notes two compounding effects worth knowing for later lessons. Reflection pairs naturally with goal setting and monitoring (lesson 13): the goal supplies the final standard the critic judges against, and monitoring feeds deviation signals back into the critique. And with memory (lesson 10): with conversation memory, the critic evaluates the draft in light of past interactions and prior critiques, so the agent stops repeating the same mistakes — reflection becomes cumulative rather than a fresh, isolated event each time.

5 · Running example: the code-patch quality gate

Thread it through our coding/research assistant. Before the agent finalizes a code patch, it does not just emit the diff — it runs a reflection loop with a critic playing senior reviewer, exactly mirroring the book's calculate_factorial example but on a real patch:

def patch_with_reflection(task, evidence, rubric, max_rounds=3):
    history = [system_producer(), user(task, evidence)]
    draft = producer(history)                    # phase 1: execute
    for r in range(max_rounds):
        critique = critic({                      # phase 2: critique (different persona)
            "task": task,                        #   - sees the original objective
            "draft": draft,
            "rubric": rubric,                    #   - checks the explicit rubric
            "test_results": run_tests(draft),    #   - grounded in real evidence
        })
        if critique.is_clean:                    # stop rule A: clean-bill signal
            return draft
        history += [assistant(draft), user(f"Apply this critique:\n{critique}")]
        draft = producer(history)                # phase 3: refine using the critique
    return draft                                 # stop rule B: round budget exhausted

The rubric the critic enforces is the quality gate: is the diff minimal; were tests actually run and passing; are public APIs preserved; does the final report cite evidence? Notice that "were tests run and passing" is not a matter of model opinion — it comes from run_tests(draft). That is the single most reliable critique source, and it is the seam to the next lesson: right now run_tests is hand-waved, but it is really a tool call, and once the critic can invoke tools, its verdict stops being a plausible-sounding judgment and becomes a fact.

Where this points next

Reflection gives the agent a feedback loop, but in everything above the critic's most trustworthy signals — "tests pass," "this claim matches the source," "the API still type-checks" — were borrowed from outside the model. The pattern that lets the agent fetch those signals for itself is tool use. Lesson 07, Tool use - function calling as controlled action, builds the mechanism by which an agent leaves language and interacts with live systems: running a test suite, querying a database, calling an API. Once it exists, the strongest reflection loops replace model-judged critique with executable checks wherever they can — which is the whole reason this lesson's running example reached for run_tests before we had even defined what a tool was.

Takeaway
Reflection wraps generation in a feedback loop — execute, critique, refine, iterate — so the agent can catch and fix its own errors before continuing, the first true self-correction pattern in the track. The robust implementation is the producer-critic split: two LLM calls with different system prompts (one generating, one playing "senior reviewer" or "strict fact-checker") to get objective, structured critique instead of self-defense. It pays off only when it is checking better, not reasoning longer: it needs a rubric, the original task, evidence/tests, and a stop rule. The trade-off is real — extra calls, latency, cost, and context growth — so spend 1-3 rounds at high-value checkpoints, always with both a satisfied-critic exit and a hard round cap. Single-step reflection fits in a LangChain chain or an ADK SequentialAgent; true iterative reflection needs stateful orchestration (LangGraph, ADK LoopAgent, or an explicit loop).

Interview prompts

Failure modes

  • A vague "make it better" pass that changes tone and wording but not correctness — reasoning longer instead of checking better.
  • A critic with no rubric or task context that invents new requirements the task never asked for.
  • An over-strict critic that never emits the clean-bill signal, so the loop runs to the cap (or forever) burning cost.
  • Infinite or unbudgeted revise loops that overflow the context window or trip API rate limits.
  • Producer and critic sharing one persona/prompt, so the "critique" is self-defense and finds nothing.

Implementation checklist

  • What explicit rubric does the critic check against, and does it see the original objective?
  • Is the critic a distinct persona / separate call from the producer?
  • Which critique signals can be replaced by executable checks (tests, static analysis, source-matching)?
  • Two stop conditions present: a satisfied-critic signal and a hard round cap?
  • Is this single-step (chain / SequentialAgent) or iterative (LangGraph / LoopAgent / explicit loop) — and is the orchestration stateful enough for what you chose?
  • Are producer, critic, and revision artifacts kept separate and logged for evaluation (lesson 21)?