Part IX - Composition and implementation
Capstone - compose a reliable coding and research agent
Twenty-one patterns, each learned in isolation, now have to live in one process at the same time. This final lesson does not introduce a new trick. It teaches the harder skill: how to compose the patterns into a single architecture that acts, remembers, recovers, evaluates, and improves — and how to choose the framework that holds it together without hiding the parts you must control.
1 · The composition problem
Here is the trap this lesson exists to defuse. You can demonstrate routing, reflection, tool use, memory, and guardrails each as a clean five-line example. None of that proves they survive contact with each other. The book's conclusion makes the point bluntly: agent systems need architecture, not only model capability. A frontier model with no architecture around it is a brilliant intern with no desk, no version control, and no manager — capable of any single step and reliable at none of them across a whole task.
Composition fails in specific, predictable ways. Reflection (lesson 06) loops forever because nothing told it when "good enough" is reached — that is a goal-and-monitoring gap (lesson 13). A tool error (lesson 07) crashes the run because recovery (lesson 14) was bolted on as an afterthought rather than designed into the loop. Memory (lesson 10) silently grows the context until a single turn costs ten times what the first turn cost — a resource-optimization failure (lesson 18) that no individual pattern owns. The patterns are not independent features you can union together; they are constraints on one shared state object, and they interact. Composition is the discipline of making those interactions explicit.
2 · The reference architecture
Recall the spine of the whole track (lesson 01): an agent is a stateful control loop around a model. Everything we have learned is a way to strengthen one stage of that loop. Composition means slotting each pattern into its stage and being explicit about the shared state that flows between them. Here is the loop with the patterns named in place.
Read it as a sentence. A goal arrives and is turned into a typed contract (lesson 02). A router (lesson 04) decides whether this is a quick lookup or a multi-step job. If multi-step, the planner (lesson 08) lays out a path, anchored by an explicit goal and a way to monitor progress (lesson 13). The agent then enters the act/observe loop: each step may call a tool (lesson 07) or retrieve evidence (lesson 16), reads and writes scoped memory (lesson 10), and every action passes a guardrail (lesson 20) and decrements a budget (lesson 18). Errors become observations handled by recovery (lesson 14), not exceptions that kill the run. After acting, the agent reflects and verifies (lessons 06 and 19). Risky actions escalate to a human (lesson 15). When the goal is met, the budget is spent, or a stop condition fires, the loop ends and the whole trace is scored (lesson 21).
The non-obvious move is that the patterns share one state object. The plan, the scratchpad, the memory reads, the running cost, and the trace are all fields on the same record that flows around the loop. This is exactly why the book (Appendix C) singles out frameworks that make state explicit and persistent — because the moment state is hidden, the interactions between patterns become invisible, and invisible interactions are where composed agents fail.
Tracing one task through the loop
Take the running example for this lesson, the same kind of task the book's CLI-agent appendix uses for Claude Code: "The current user authentication uses session cookies. Refactor the codebase to stateless JWTs: update the login and logout endpoints, the middleware, and the front-end token handling." Here is the trace, stage by stage.
3 · The resource arithmetic
The reference architecture is correct but abstract. To make it real, do the math the book's resource-optimization material (lesson 18) demands, because the loop must live inside a finite budget of tokens, latency, and money. Vague claims like "agents are expensive" are useless; a worked number tells you where to cut.
Worked example — the JWT refactor. Suppose the agent loop runs 8 steps before the goal is met (the trace above plus a couple of read-only inspection steps). At each step it sends the full context and gets back a model response. Take realistic sizes:
The input grows as the scratchpad and file contents accumulate — call it a flat 12k average. Total input is 8 × 12{,}000 = 96{,}000 tokens; total output is 8 × 1{,}500 = 12{,}000 tokens. At an illustrative price of $3 per million input tokens and $15 per million output tokens, the model cost is (96{,}000 / 10^6)×3 + (12{,}000 / 10^6)×15 = $0.288 + $0.180 = $0.47 for the whole task. Latency is 8 × 9 = 72 seconds of model time, plus tool time (test runs dominate here — say 8 × 5 = 40 s), so roughly 112 seconds wall-clock.
Now apply the patterns as optimizations and watch the budget move:
- Prompt caching the static context (the system prompt, tool schemas, and unchanging files — say 7,000 of the 12,000 input tokens) cuts repeated input cost. If cached reads bill at one-tenth, the cached portion across 7 reuse steps drops from 7×7{,}000×$3/10^6 = $0.147 to about $0.015 — a ~28% cut on total cost for one knob.
- Routing (lesson 04) sends trivial sub-questions to a cheaper, faster model. If 3 of the 8 steps are simple file lookups routed to a model at one-fifth the price, those steps' cost falls proportionally.
- A budget cap (lesson 18) of, say, 20 steps and $2 turns "the agent looped forever and cost $40" from an incident into a clean, logged stop. The cap is not a nicety; it is the difference between a bounded and an unbounded failure.
4 · Choosing a framework (Appendix C)
The architecture is framework-agnostic, but you have to build it on something. Appendix C surveys the ecosystem, and its decision rule is refreshingly simple: do not pick the fashionable framework; pick the abstraction level that exposes the state, control flow, and observability your system needs. Here is the book's own comparison, sharpened for the choice you actually face.
| Framework | Core abstraction | Control flow | State | Reach for it when |
|---|---|---|---|---|
| LangChain | Chain (LCEL: prompt | model | parser) | Linear, DAG, no loops | Mostly stateless per run | The flow is A → B → C with no looping: simple RAG, summarization, extraction. |
| LangGraph | Graph of nodes + conditional edges | Loops, retries, branches | Explicit persistent state object passed between nodes | The agent must reason, plan, call tools, reflect, and loop until done — i.e. the reference architecture above. |
| Google ADK | Prebuilt agent types (SequentialAgent, ParallelAgent) on a "team" model | Managed by the framework | Implicit; you do not hand-thread it | You want production-grade multi-agent orchestration fast and accept hidden state for it. |
| CrewAI | Agents with role/goal/backstory + Tasks + a Crew process | Sequential or hierarchical (manager agent) | Implicit, team-charter style | You are modeling a team of specialists and want to design roles, not graphs. |
The book frames it as a ladder of abstraction. LangChain gives you the lowest-level glue: pipe a model call into a parser, ship a predictable linear flow. LangGraph sits above it and is the natural home for our loop, because its whole reason to exist is supporting cycles with an explicit, persistent state object — exactly the shared state field we drew in section 2. Google ADK and CrewAI sit higher still: they hide the graph and give you team primitives, trading control for speed. The book also notes the wider field — AutoGen (conversation-driven multi-agent), LlamaIndex (data/RAG-centric), Haystack (production search pipelines), MetaGPT (SOP-driven software-company simulation), Semantic Kernel (LLM-into-.NET/Python plugins), Strands Agents (lightweight, MCP-native) — each strong in its niche and weak outside it.
5 · The book's strongest capstone: a human-led agent team (Appendix G)
Appendices E and G push past the single autonomous loop into the form the book argues is the real frontier: a human-led team of specialist agents. The framing matters. The point is not to replace the developer but to make them an orchestrator of tireless specialists — the book cites the industry signal that a large share of new code at major companies is now model-assisted. Three principles anchor the framework:
The team is a set of role-prompted "personas" evoked inside the model, each owning one part of the development lifecycle. This is the multi-agent pattern (lesson 09) applied to coding, with the human as the synthesizer:
- Scaffolder (implementer): writes new code to a detailed spec, following the existing patterns in the provided codebase.
- Test engineer (quality guardian): writes unit, integration, and end-to-end tests covering edge cases, to the project's testing philosophy.
- Documenter (recorder): generates clear docs — function comments, API endpoint docs with request/response examples.
- Optimizer (refactoring partner): proposes specific performance and readability improvements with reasons.
- Reviewer (code supervisor): the most important role. It works in two passes — first critique (flag bugs, style violations, logic flaws like a static analyzer), then reflect on its own critique to produce a prioritized, actionable summary that drops the trivial noise. This critique-then-reflect move is reflection (lesson 06) applied to the review itself.
The book's implementation checklist makes this repeatable rather than ad-hoc:
context.toml) that declares exactly which files, directories, and URLs are bundled into each prompt — full transparency into what the model sees./prompts directory in Git (reviewer.md, tester.md, documenter.md) — prompts treated as code: reviewed, refined, versioned.pre-commit hook that auto-invokes the Reviewer agent on staged changes and prints its critique-and-reflection summary in the terminal, folding quality review into the development rhythm.This is also where the CLI-agent landscape from Appendix E becomes concrete. There is no single best tool, only a specialized ecosystem: Claude Code for deep, architecture-aware refactors (it builds a mental model of the repo and explains its plan before acting, extensible via MCP); Gemini CLI for multimodal and Google-Cloud-centric work with a large context window; Aider for Git-driven, auto-committing pair-programming and TDD; GitHub Copilot CLI for native GitHub workflows (assign it an issue, get a PR). And Terminal-Bench exists precisely to evaluate these agents on the terminal — a reminder that even the team workflow must be measured, not just felt.
6 · The release gate — shipping a non-deterministic system
One question separates a demo from a system: can a new version ship without a human re-testing everything by hand? A normal program is deterministic, so a passing unit test today passes tomorrow. An agent is non-deterministic — same input, different trajectory — so you cannot prove correctness, only guard against regression. The release gate is a suite of saved traces, scored by the evaluation pattern (lesson 21), that a new prompt, model, or framework version must clear before it ships.
graph = AgentGraph(state=RunState) # explicit, persistent state (section 2)
graph.add(intake_goal) # L02 contract
graph.add(route_task) # L04
graph.add(plan_steps) # L08 + L13 goal/monitor
graph.add(execute_step_loop) # L07 tools, L16 retrieve, L10 memory,
# L20 guardrail + L18 budget per step
graph.add(recover_or_escalate) # L14 recovery, L15 human gate
graph.add(reflect_and_verify) # L06 + L19
graph.add(final_report) # evidence + side effects
release_gate = RegressionSuite([ # L21 — must pass before any version ships
"tool_failure", # a tool errors mid-run -> does recovery fire?
"bad_route", # ambiguous goal -> does it route safely?
"unsafe_action", # request to write outside repo -> does guardrail block?
"missing_citation", # research claim with no source -> does verify catch it?
"coding_patch", # the JWT refactor -> tests pass, PR opened?
"budget_blowout", # pathological loop -> does the cap stop it?
])
Each entry is a real recorded trajectory plus an assertion about how the agent should behave — note these assertions score the trajectory, not just the final string, because a right answer reached by an unsafe or unbounded path is still a failure. When you swap the underlying model or rewrite the planner prompt, you replay the suite; if "unsafe_action" stops being blocked, the gate stays shut. This is how you operate a probabilistic system with engineering discipline instead of hope.
Where this points next
This is the last lesson, so the hand-off is back to the work. You arrived knowing twenty-one patterns as separate ideas; you leave able to compose them into one architecture, put a dollar-and-second budget around it, choose a framework that keeps the parts visible, and ship behind a regression gate. The natural next step is to build the smallest honest version: take the reference loop from section 2, implement it for one real task (the JWT refactor is a good first target), wire in exactly one guardrail and one budget cap, and write your first five regression traces. Then add patterns only as a failure forces each one — composition by necessity, not by fashion. If you want to revisit a piece, the track index links every pattern by name; the recovery and resource lessons (14 and 18) are the two most people under-build.
Running example — the full assembly
Pulled together, the capstone task "add a feature to this repo and explain it" exercises the entire book: the agent contracts the goal (02), routes it (04), plans the path (08) with a progress metric (13), inspects files and edits through patches and runs tests as tools (07) while reading scoped memory (10), retrieves any external API docs it needs (16), checks every write against a guardrail (20) and decrements a budget (18), treats a failed test as an observation to recover from (14), reflects on and verifies the diff (06, 19), escalates the PR for human approval (15), reports the evidence and side effects, and logs the whole trace so the evaluation suite (21) can score it and the release gate can defend the next version.
Failure modes
- Demo-grade, not replayable. A run that looked great once but produced no trace, so no one can reproduce, debug, or regression-test it.
- State-hiding framework. An abstraction that conceals the context, permissions, or evaluation — the section-1 interactions fail invisibly inside it.
- Unbounded loop. Reflection or planning loops with no goal/budget, burning tokens until someone notices the bill (section 3).
- Evidence-free answers. A final report that states success but omits the sources, the uncertainty, and the side effects (new env var, deleted file).
- Bolted-on recovery. Errors thrown as exceptions that kill the run instead of being handled as observations inside the loop.
Implementation checklist
- Can an operator replay any run from its trace?
- Is the shared state (plan, memory, cost, trace) explicit and inspectable?
- Can a risky action be blocked by a guardrail before it executes?
- Does every step decrement a budget, with a hard cap that stops cleanly?
- Do failures recover or stop as observations, not crashes?
- Does the final answer carry evidence and named side effects?
- Must a new version pass the regression gate before shipping?
Interview prompts
- Why doesn't having twenty-one working patterns give you a working agent? (§1 — the patterns are constraints on one shared state object and they interact; an unbounded reflection loop, a crash from a bolted-on tool error, or runaway memory growth are interaction failures no single pattern owns. Composition is making those interactions explicit.)
- Walk through where each pattern sits in the agent control loop. (§2 — intake/contract, route, plan + goal-monitor, then an act/observe loop with tools, retrieval, memory, guardrail and budget per step, recovery on error, reflect/verify, human gate on risk, and trace scoring at the end — all over one persistent state object.)
- Estimate the cost and latency of one multi-step coding task, then cut it. (§3 — e.g. 8 steps × ~12k input + ~1.5k output ≈ $0.47 and ~112 s; cut with prompt caching of static context, routing trivial steps to a cheaper model, and a hard budget cap to bound the failure case.)
- LangChain vs LangGraph vs ADK vs CrewAI — how do you choose? (§4 — LangChain for linear DAGs with no loops; LangGraph for stateful loops with explicit persistent state (the home for an agent loop); ADK and CrewAI hide the graph for managed multi-agent teams. Pick by the abstraction that exposes the state, control, and observability you need — not by fashion.)
- Describe the book's human-led agent team and the reviewer's two passes. (§5 — human as orchestrator/architect and final gate; specialist personas for scaffolding, testing, docs, and optimization; the reviewer first critiques like a static analyzer, then reflects on its own critique to emit prioritized, actionable feedback. Backed by deliberate context, versioned prompts, and a pre-commit Git hook.)
- How do you safely ship a new version of a non-deterministic agent? (§6 — you cannot prove correctness, only guard against regression: a suite of saved traces scored on the whole trajectory (tool_failure, unsafe_action, budget_blowout, the coding patch, etc.) that a new prompt/model/framework must clear before release.)
- Where does "vibe coding" belong, and where does it break? (§5 callout — strong for discovery/ideation (beating the blank page, exploring APIs, prototyping), but robust software needs the structured shift to a specialist-agent team with explicit context, tests, and review; don't ship the spark.)