Part IX - Composition and implementation

Capstone - compose a reliable coding and research agent

Twenty-one patterns, each learned in isolation, now have to live in one process at the same time. This final lesson does not introduce a new trick. It teaches the harder skill: how to compose the patterns into a single architecture that acts, remembers, recovers, evaluates, and improves — and how to choose the framework that holds it together without hiding the parts you must control.

Book source

Conclusion plus Appendices B, C, D, E, F, G; PDF outline pages 237-304. This lesson folds in the framework survey (Appendix C: LangChain, LangGraph, Google ADK, CrewAI, and others), the CLI-agent landscape (Appendix E: Claude Code, Gemini CLI, Aider, GitHub Copilot CLI, Terminal-Bench), and the human-led programming-agent team (Appendix G).

The plan

Five moves. (1) State the composition problem — why twenty-one working patterns do not automatically make one working agent. (2) Lay out the reference architecture: the control loop with every pattern slotted into its role, and trace one task through it end to end. (3) Do the resource arithmetic on a real coding task so you feel the token, latency, and cost budget the loop must live inside. (4) Choose a framework honestly using the book's own decision rule (LangChain vs LangGraph vs ADK vs CrewAI). (5) Adopt the book's strongest capstone: the human-led team of specialist agents from Appendix G, with versioned prompts and Git-hook review. We close on the release gate that lets you ship a non-deterministic system at all.

Linear position

Prerequisite: every pattern from lesson 01 (the agent boundary) through lesson 23 (exploration and discovery) — this lesson assumes you can already define each one. New capability: assemble those patterns into one reliable architecture, put a number on its resource budget, pick a framework deliberately, and operate it behind a regression gate.

1 · The composition problem

Here is the trap this lesson exists to defuse. You can demonstrate routing, reflection, tool use, memory, and guardrails each as a clean five-line example. None of that proves they survive contact with each other. The book's conclusion makes the point bluntly: agent systems need architecture, not only model capability. A frontier model with no architecture around it is a brilliant intern with no desk, no version control, and no manager — capable of any single step and reliable at none of them across a whole task.

Composition fails in specific, predictable ways. Reflection (lesson 06) loops forever because nothing told it when "good enough" is reached — that is a goal-and-monitoring gap (lesson 13). A tool error (lesson 07) crashes the run because recovery (lesson 14) was bolted on as an afterthought rather than designed into the loop. Memory (lesson 10) silently grows the context until a single turn costs ten times what the first turn cost — a resource-optimization failure (lesson 18) that no individual pattern owns. The patterns are not independent features you can union together; they are constraints on one shared state object, and they interact. Composition is the discipline of making those interactions explicit.

Mental model

Think of the agent as an orchestra, not a soloist. Each pattern is one section that plays well alone. The architecture is the score and the conductor: it decides who plays when, who waits, who stops, and what happens when the violinist drops the bow mid-phrase. The book's twenty-one chapters taught you the instruments. This lesson is the conducting.

2 · The reference architecture

Recall the spine of the whole track (lesson 01): an agent is a stateful control loop around a model. Everything we have learned is a way to strengthen one stage of that loop. Composition means slotting each pattern into its stage and being explicit about the shared state that flows between them. Here is the loop with the patterns named in place.

┌──────────────── shared run state ────────────────┐ │ goal · plan · scratchpad · memory · trace · cost │ └───────────────────────────────────────────────────┘ ▲ │ goal in ──▶ [INTAKE] ──▶ [ROUTE] ──▶ [PLAN] ──▶ [ACT] ──▶ [OBSERVE] ──▶ [REFLECT] ──▶ done? contract L04 L08 loop: tool result L06 critique │ L02 decide goal L13 L07 tool + error + verify L19 │ no branch monitor L16 retrieve L14 recover │ L10 memory └──┐ ▲ L20 guardrail (every action) │ │ L18 budget (every step) │ └─────────────────────────────── loop until goal met / budget spent / stop ────────┘ │ [HUMAN REVIEW L15] ◀── risk gate ─┘ ──▶ [FINAL REPORT + EVIDENCE] L21 score the whole trace

Read it as a sentence. A goal arrives and is turned into a typed contract (lesson 02). A router (lesson 04) decides whether this is a quick lookup or a multi-step job. If multi-step, the planner (lesson 08) lays out a path, anchored by an explicit goal and a way to monitor progress (lesson 13). The agent then enters the act/observe loop: each step may call a tool (lesson 07) or retrieve evidence (lesson 16), reads and writes scoped memory (lesson 10), and every action passes a guardrail (lesson 20) and decrements a budget (lesson 18). Errors become observations handled by recovery (lesson 14), not exceptions that kill the run. After acting, the agent reflects and verifies (lessons 06 and 19). Risky actions escalate to a human (lesson 15). When the goal is met, the budget is spent, or a stop condition fires, the loop ends and the whole trace is scored (lesson 21).

The non-obvious move is that the patterns share one state object. The plan, the scratchpad, the memory reads, the running cost, and the trace are all fields on the same record that flows around the loop. This is exactly why the book (Appendix C) singles out frameworks that make state explicit and persistent — because the moment state is hidden, the interactions between patterns become invisible, and invisible interactions are where composed agents fail.

Tracing one task through the loop

Take the running example for this lesson, the same kind of task the book's CLI-agent appendix uses for Claude Code: "The current user authentication uses session cookies. Refactor the codebase to stateless JWTs: update the login and logout endpoints, the middleware, and the front-end token handling." Here is the trace, stage by stage.

intakeContract. Goal: replace cookie auth with JWT. Success: all auth tests pass, no cookie references remain, a PR is opened. Constraints: do not touch unrelated modules; no secrets in code.

routeRoute. This is a large refactor across files, not a one-shot edit, so route to the planning path rather than a direct patch.

planPlan + goal. Steps: (1) read all auth-touching files; (2) replace the token issuer in login; (3) update logout; (4) rewrite middleware; (5) update front-end storage; (6) run tests. Progress metric: tests passing / tests total.

act/observeLoop. Each step reads files (tool), edits via a patch (tool), and runs the test suite (tool). Guardrail check before each write: is the path inside the repo and not in a denylisted directory? Budget decrement after each model call.

recoverRecovery. The test run on step 4 fails — middleware still reads a cookie. The failure is an observation. The agent retries that step with the error in context; second attempt passes.

reflect/verifyReflect. Critique the diff: any leftover cookie code? any token logged in plaintext? Verify against the constraint list before finalizing.

humanHuman gate. Opening a PR and changing security-sensitive auth crosses the risk line — escalate the diff for approval (lesson 15).

reportReport + score. Final answer cites the changed files, the passing test count, and the side effects (a new env var for the signing key). The whole trace is logged for evaluation (lesson 21).

3 · The resource arithmetic

The reference architecture is correct but abstract. To make it real, do the math the book's resource-optimization material (lesson 18) demands, because the loop must live inside a finite budget of tokens, latency, and money. Vague claims like "agents are expensive" are useless; a worked number tells you where to cut.

Worked example — the JWT refactor. Suppose the agent loop runs 8 steps before the goal is met (the trace above plus a couple of read-only inspection steps). At each step it sends the full context and gets back a model response. Take realistic sizes:

Steps in the loop

Input tokens / step

~12,000

Output tokens / step

~1,500

Model latency / step

~9 s

The input grows as the scratchpad and file contents accumulate — call it a flat 12k average. Total input is 8 × 12{,}000 = 96{,}000 tokens; total output is 8 × 1{,}500 = 12{,}000 tokens. At an illustrative price of $3 per million input tokens and $15 per million output tokens, the model cost is (96{,}000 / 10^6)×3 + (12{,}000 / 10^6)×15 = $0.288 + $0.180 = $0.47 for the whole task. Latency is 8 × 9 = 72 seconds of model time, plus tool time (test runs dominate here — say 8 × 5 = 40 s), so roughly 112 seconds wall-clock.

Now apply the patterns as optimizations and watch the budget move:

Prompt caching the static context (the system prompt, tool schemas, and unchanging files — say 7,000 of the 12,000 input tokens) cuts repeated input cost. If cached reads bill at one-tenth, the cached portion across 7 reuse steps drops from 7×7{,}000×$3/10^6 = $0.147 to about $0.015 — a ~28% cut on total cost for one knob.
Routing (lesson 04) sends trivial sub-questions to a cheaper, faster model. If 3 of the 8 steps are simple file lookups routed to a model at one-fifth the price, those steps' cost falls proportionally.
A budget cap (lesson 18) of, say, 20 steps and $2 turns "the agent looped forever and cost $40" from an incident into a clean, logged stop. The cap is not a nicety; it is the difference between a bounded and an unbounded failure.

Why the number matters

A single task at $0.47 looks free. Multiply by a team of 50 developers each running 30 agent tasks a day: 50 × 30 × $0.47 ≈ $700/day ≈ $250{,}000/year, before any retries or failed runs. Composition without budgets is how a useful tool becomes a line item nobody approved. The arithmetic is the argument for putting lesson 18's controls in the loop, not around it.

4 · Choosing a framework (Appendix C)

The architecture is framework-agnostic, but you have to build it on something. Appendix C surveys the ecosystem, and its decision rule is refreshingly simple: do not pick the fashionable framework; pick the abstraction level that exposes the state, control flow, and observability your system needs. Here is the book's own comparison, sharpened for the choice you actually face.

Framework	Core abstraction	Control flow	State	Reach for it when
LangChain	Chain (LCEL: `prompt \| model \| parser`)	Linear, DAG, no loops	Mostly stateless per run	The flow is A → B → C with no looping: simple RAG, summarization, extraction.
LangGraph	Graph of nodes + conditional edges	Loops, retries, branches	Explicit persistent state object passed between nodes	The agent must reason, plan, call tools, reflect, and loop until done — i.e. the reference architecture above.
Google ADK	Prebuilt agent types (SequentialAgent, ParallelAgent) on a "team" model	Managed by the framework	Implicit; you do not hand-thread it	You want production-grade multi-agent orchestration fast and accept hidden state for it.
CrewAI	Agents with role/goal/backstory + Tasks + a Crew process	Sequential or hierarchical (manager agent)	Implicit, team-charter style	You are modeling a team of specialists and want to design roles, not graphs.

The book frames it as a ladder of abstraction. LangChain gives you the lowest-level glue: pipe a model call into a parser, ship a predictable linear flow. LangGraph sits above it and is the natural home for our loop, because its whole reason to exist is supporting cycles with an explicit, persistent state object — exactly the shared state field we drew in section 2. Google ADK and CrewAI sit higher still: they hide the graph and give you team primitives, trading control for speed. The book also notes the wider field — AutoGen (conversation-driven multi-agent), LlamaIndex (data/RAG-centric), Haystack (production search pipelines), MetaGPT (SOP-driven software-company simulation), Semantic Kernel (LLM-into-.NET/Python plugins), Strands Agents (lightweight, MCP-native) — each strong in its niche and weak outside it.

Trap

The dangerous framework is the one that hides state, permissions, or evaluation behind a clean demo. If you cannot answer "what is in the model's context right now?", "what blocks an unsafe tool call?", and "how do I replay this run?" by reading the framework's own state — you have bought a polished black box, and the failures in section 1 are waiting inside it. For the capstone architecture, prefer explicit state (LangGraph-style) unless you have a concrete reason to climb the abstraction ladder.

5 · The book's strongest capstone: a human-led agent team (Appendix G)

Appendices E and G push past the single autonomous loop into the form the book argues is the real frontier: a human-led team of specialist agents. The framing matters. The point is not to replace the developer but to make them an orchestrator of tireless specialists — the book cites the industry signal that a large share of new code at major companies is now model-assisted. Three principles anchor the framework:

Human-led orchestration

The developer is team lead and architect: defines tasks, chooses which agent to invoke, supplies context, and is the final gate on every output. Agents are collaborators, never decision-makers.

Context is king

A frontier model with poor context is useless. The human prepares a deliberate task brief — full relevant code, external docs and API specs, and an explicit goal/style brief — rather than trusting an automated black-box retrieval.

Direct model access

Connect agents straight to frontier models. Routing through weak models or lossy middle layers degrades every downstream step. A dual-vendor strategy (e.g. two providers) hedges outages and enables comparison.

The team is a set of role-prompted "personas" evoked inside the model, each owning one part of the development lifecycle. This is the multi-agent pattern (lesson 09) applied to coding, with the human as the synthesizer:

Scaffolder (implementer): writes new code to a detailed spec, following the existing patterns in the provided codebase.
Test engineer (quality guardian): writes unit, integration, and end-to-end tests covering edge cases, to the project's testing philosophy.
Documenter (recorder): generates clear docs — function comments, API endpoint docs with request/response examples.
Optimizer (refactoring partner): proposes specific performance and readability improvements with reasons.
Reviewer (code supervisor): the most important role. It works in two passes — first critique (flag bugs, style violations, logic flaws like a static analyzer), then reflect on its own critique to produce a prioritized, actionable summary that drops the trivial noise. This critique-then-reflect move is reflection (lesson 06) applied to the review itself.

The book's implementation checklist makes this repeatable rather than ad-hoc:

1Frontier model access. API keys for at least two strong models; manage credentials like production secrets.

2Local context orchestrator. A lightweight CLI/runner with a config file (e.g. context.toml) that declares exactly which files, directories, and URLs are bundled into each prompt — full transparency into what the model sees.

3Versioned prompt library. A /prompts directory in Git (reviewer.md, tester.md, documenter.md) — prompts treated as code: reviewed, refined, versioned.

4Git-hook integration. A pre-commit hook that auto-invokes the Reviewer agent on staged changes and prints its critique-and-reflection summary in the terminal, folding quality review into the development rhythm.

This is also where the CLI-agent landscape from Appendix E becomes concrete. There is no single best tool, only a specialized ecosystem: Claude Code for deep, architecture-aware refactors (it builds a mental model of the repo and explains its plan before acting, extensible via MCP); Gemini CLI for multimodal and Google-Cloud-centric work with a large context window; Aider for Git-driven, auto-committing pair-programming and TDD; GitHub Copilot CLI for native GitHub workflows (assign it an issue, get a PR). And Terminal-Bench exists precisely to evaluate these agents on the terminal — a reminder that even the team workflow must be measured, not just felt.

Where "vibe coding" fits

Appendix B and G both name vibe coding — steering an AI with high-level, conversational intent ("make this landing page clean and modern", "make this function more Pythonic"). The book is clear about its role: it is superb for the discovery and ideation phase (lesson 23) — beating the blank page, exploring an unfamiliar API, prototyping fast. But robust, maintainable software needs the structured move away from pure generation toward the specialist-agent team above. Vibe coding is the spark; the architecture is the engine. Do not ship the spark.

6 · The release gate — shipping a non-deterministic system

One question separates a demo from a system: can a new version ship without a human re-testing everything by hand? A normal program is deterministic, so a passing unit test today passes tomorrow. An agent is non-deterministic — same input, different trajectory — so you cannot prove correctness, only guard against regression. The release gate is a suite of saved traces, scored by the evaluation pattern (lesson 21), that a new prompt, model, or framework version must clear before it ships.

graph = AgentGraph(state=RunState)        # explicit, persistent state (section 2)
graph.add(intake_goal)                    # L02 contract
graph.add(route_task)                     # L04
graph.add(plan_steps)                     # L08 + L13 goal/monitor
graph.add(execute_step_loop)              # L07 tools, L16 retrieve, L10 memory,
                                          #   L20 guardrail + L18 budget per step
graph.add(recover_or_escalate)            # L14 recovery, L15 human gate
graph.add(reflect_and_verify)             # L06 + L19
graph.add(final_report)                   # evidence + side effects

release_gate = RegressionSuite([          # L21 — must pass before any version ships
  "tool_failure",        # a tool errors mid-run            -> does recovery fire?
  "bad_route",           # ambiguous goal                  -> does it route safely?
  "unsafe_action",       # request to write outside repo   -> does guardrail block?
  "missing_citation",    # research claim with no source   -> does verify catch it?
  "coding_patch",        # the JWT refactor                -> tests pass, PR opened?
  "budget_blowout",      # pathological loop               -> does the cap stop it?
])

Each entry is a real recorded trajectory plus an assertion about how the agent should behave — note these assertions score the trajectory, not just the final string, because a right answer reached by an unsafe or unbounded path is still a failure. When you swap the underlying model or rewrite the planner prompt, you replay the suite; if "unsafe_action" stops being blocked, the gate stays shut. This is how you operate a probabilistic system with engineering discipline instead of hope.

Where this points next

This is the last lesson, so the hand-off is back to the work. You arrived knowing twenty-one patterns as separate ideas; you leave able to compose them into one architecture, put a dollar-and-second budget around it, choose a framework that keeps the parts visible, and ship behind a regression gate. The natural next step is to build the smallest honest version: take the reference loop from section 2, implement it for one real task (the JWT refactor is a good first target), wire in exactly one guardrail and one budget cap, and write your first five regression traces. Then add patterns only as a failure forces each one — composition by necessity, not by fashion. If you want to revisit a piece, the track index links every pattern by name; the recovery and resource lessons (14 and 18) are the two most people under-build.

Running example — the full assembly

Pulled together, the capstone task "add a feature to this repo and explain it" exercises the entire book: the agent contracts the goal (02), routes it (04), plans the path (08) with a progress metric (13), inspects files and edits through patches and runs tests as tools (07) while reading scoped memory (10), retrieves any external API docs it needs (16), checks every write against a guardrail (20) and decrements a budget (18), treats a failed test as an observation to recover from (14), reflects on and verifies the diff (06, 19), escalates the PR for human approval (15), reports the evidence and side effects, and logs the whole trace so the evaluation suite (21) can score it and the release gate can defend the next version.

Failure modes

Demo-grade, not replayable. A run that looked great once but produced no trace, so no one can reproduce, debug, or regression-test it.
State-hiding framework. An abstraction that conceals the context, permissions, or evaluation — the section-1 interactions fail invisibly inside it.
Unbounded loop. Reflection or planning loops with no goal/budget, burning tokens until someone notices the bill (section 3).
Evidence-free answers. A final report that states success but omits the sources, the uncertainty, and the side effects (new env var, deleted file).
Bolted-on recovery. Errors thrown as exceptions that kill the run instead of being handled as observations inside the loop.

Implementation checklist

Can an operator replay any run from its trace?
Is the shared state (plan, memory, cost, trace) explicit and inspectable?
Can a risky action be blocked by a guardrail before it executes?
Does every step decrement a budget, with a hard cap that stops cleanly?
Do failures recover or stop as observations, not crashes?
Does the final answer carry evidence and named side effects?
Must a new version pass the regression gate before shipping?

Takeaway

An agent is a stateful control loop, and the twenty-one patterns are the named ways to strengthen one stage of it — composition is slotting each into its place around a single explicit, persistent state object and making their interactions visible. Put real numbers on the loop (a multi-step task is cents and minutes per run, and a fortune at fleet scale), so cost, latency, and budget caps belong inside the loop. Choose a framework by the abstraction it exposes, not its fashion: LangChain for linear DAGs, LangGraph for stateful loops, ADK/CrewAI for managed teams. The book's strongest form is a human-led team of specialist agents — scaffolder, tester, documenter, optimizer, and a critique-then-reflect reviewer — driven by deliberate context, versioned prompts, and Git-hook review. And because the system is non-deterministic, the thing that lets you ship it at all is a regression gate over saved traces. The book's final word: agent systems are won with architecture, not model capability alone.

Interview prompts

Why doesn't having twenty-one working patterns give you a working agent? (§1 — the patterns are constraints on one shared state object and they interact; an unbounded reflection loop, a crash from a bolted-on tool error, or runaway memory growth are interaction failures no single pattern owns. Composition is making those interactions explicit.)
Walk through where each pattern sits in the agent control loop. (§2 — intake/contract, route, plan + goal-monitor, then an act/observe loop with tools, retrieval, memory, guardrail and budget per step, recovery on error, reflect/verify, human gate on risk, and trace scoring at the end — all over one persistent state object.)
Estimate the cost and latency of one multi-step coding task, then cut it. (§3 — e.g. 8 steps × ~12k input + ~1.5k output ≈ $0.47 and ~112 s; cut with prompt caching of static context, routing trivial steps to a cheaper model, and a hard budget cap to bound the failure case.)
LangChain vs LangGraph vs ADK vs CrewAI — how do you choose? (§4 — LangChain for linear DAGs with no loops; LangGraph for stateful loops with explicit persistent state (the home for an agent loop); ADK and CrewAI hide the graph for managed multi-agent teams. Pick by the abstraction that exposes the state, control, and observability you need — not by fashion.)
Describe the book's human-led agent team and the reviewer's two passes. (§5 — human as orchestrator/architect and final gate; specialist personas for scaffolding, testing, docs, and optimization; the reviewer first critiques like a static analyzer, then reflects on its own critique to emit prioritized, actionable feedback. Backed by deliberate context, versioned prompts, and a pre-commit Git hook.)
How do you safely ship a new version of a non-deterministic agent? (§6 — you cannot prove correctness, only guard against regression: a suite of saved traces scored on the whole trajectory (tool_failure, unsafe_action, budget_blowout, the coding patch, etc.) that a new prompt/model/framework must clear before release.)
Where does "vibe coding" belong, and where does it break? (§5 callout — strong for discovery/ideation (beating the blank page, exploring APIs, prototyping), but robust software needs the structured shift to a specialist-agent team with explicit context, tests, and review; don't ship the spark.)