Part IV - State, collaboration, and protocols

Multi-agent collaboration - roles, handoffs, synthesis

Up to now we have hardened one control loop: it routes, parallelizes its own subtasks, reflects on its own output, calls tools, and plans. This lesson asks the opposite question — when is it better to stop making one loop smarter and instead split the work across several specialized loops that talk to each other? The honest answer is "rarely, and only for specific reasons," and most of this lesson is about earning the right to add a second agent.

Book source

Chapter 7 — Multi-Agent Collaboration (多智能体协作); PDF outline pages 86–96. Examples and frameworks below (CrewAI crews, Google ADK hierarchical / Sequential / Parallel / LoopAgent / agent-as-tool) are the book's own.

The plan

Five moves. (1) State the one decision rule that justifies a second agent — and the cost it must overcome. (2) Name the six collaboration forms the book lists (sequential handoff, parallel, debate, hierarchy, expert team, critic-reviewer) and the six topologies (single, network, supervisor, tool-supervisor, hierarchical, custom). (3) Make the central trade-off numeric with a worked latency-and-cost example, then a widget you can push around. (4) Build the running coding/research assistant as a supervisor + typed handoffs, grounded in the book's CrewAI and ADK code. (5) Catalog the failure modes that kill real multi-agent systems and the checklist that prevents them.

Linear position

Prerequisite: lesson 08 (Planning) — a single agent that can turn a goal into a multi-step executable path, with explicit state and tools. You should also have lesson 05 (Parallelization) in mind: fan-out/fan-in inside one loop.
New capability: coordinating several role-specialized loops through typed artifacts and an explicit communication structure, so that division of expertise reduces complexity instead of multiplying it.
Not yet: remote agents that don't share a process or a trust boundary — that's lesson 17 (A2A). Here, all agents live in one orchestrated system.

1 · The one rule: add an agent only when it owns something

A single agent is efficient on well-scoped problems. The book is blunt that it gets worse, not better, as the task spreads across domains: one loop juggling research, statistics, and report-writing becomes the bottleneck — its context fills with unrelated material, its tool list balloons, and its single prompt has to be good at everything at once. Multi-agent collaboration is the structural answer: decompose the high-level goal into sub-problems and assign each to an agent that has the right tools, data access, or reasoning role for it.

The operative word is owns. A new agent must own at least one of four distinct things, or it is pure overhead:

a skill

A reasoning specialty the others shouldn't carry — e.g. a statistical-analysis agent vs a prose-writer.

a context

An isolated working set so one agent's noise doesn't pollute another's window (the researcher's 40 sources never enter the writer's prompt — only the summary does).

a permission

A tool/credential boundary — only the "deployer" agent may call apply_patch or touch production.

an evaluation

An independent judge — a reviewer whose job is to disagree, which a self-critiquing author cannot do honestly.

Define an agent's role as the pair (responsibility, tool set). If two proposed agents share both, they are the same agent wearing two hats — merge them. This is the multi-agent analogue of the lesson-08 rule that every plan step must change state: every agent must change who is accountable for some artifact.

Mental model

Think of a newsroom, not a focus group. A reporter gathers, an editor cuts, a fact-checker verifies, a managing editor decides what ships. Each desk owns a deliverable and hands a finished artifact to the next — not a stream of half-thoughts. A focus group, by contrast, is N people talking at once with nobody owning the result; that is exactly the multi-agent anti-pattern.

2 · The six forms and the six topologies

The book separates two things that beginners conflate: the form of collaboration (how work flows) and the topology (who is allowed to talk to whom). Get both right and the system is debuggable; get either wrong and you have an expensive chat room.

Collaboration forms (how the work flows):

Sequential handoff

Agent A finishes, passes its output to B. Like planning, but the steps are explicitly different agents.

Parallel processing

Several agents attack different parts at once; results merged later (fan-out/fan-in across agents).

Debate & consensus

Agents argue from different views/sources, then converge on a better decision. Needs a decision rule.

Hierarchy

A manager agent dynamically assigns subtasks to workers by capability and synthesizes results.

Expert team

Domain specialists (researcher, writer, editor) collaborate on a complex output.

Critic / reviewer

One agent drafts; another reviews for safety, compliance, correctness, quality, goal-alignment; the author revises. Strong for code, research writing, ethics.

Communication topologies (who talks to whom): the book's "relationship and communication structure" axis, from simplest to most flexible.

Topology	Structure	Strength	Risk
1 · Single	One agent, no peers.	Simplest; no coordination cost.	Capability-bound on multi-domain work.
2 · Network	Decentralized peer-to-peer; all share data/tasks.	Resilient, no single point of failure.	Hard to keep decisions consistent; chatty.
3 · Supervisor	One supervisor assigns tasks and resolves conflict.	Clear hierarchy, easy to manage.	Single point of failure / bottleneck.
4 · Tool-supervisor	Supervisor enables (gives resources/guidance) rather than commands.	More flexible; less rigid control.	Looser coordination; weaker guarantees.
5 · Hierarchical	Multiple supervisor layers over operational agents.	Scales to deep problems; distributed decisions.	Latency and complexity grow with depth.
6 · Custom	Bespoke mix tuned to the domain.	Maximal fit; can exploit emergent behavior.	Needs deep design of protocols and coordination.

The book's guidance for picking among these: weigh task complexity, number of agents, autonomy needs, robustness, and communication overhead. For most production systems the sweet spot is topology 3 (supervisor) running the sequential-handoff and critic-reviewer forms — it keeps a single owner of state while still isolating expertise. We'll build exactly that in §4.

3 · The trade-off, made numeric

Every agent you add buys specialization and pays coordination. The coordination tax is real and easy to under-count: each handoff is an extra LLM round-trip with its own latency, its own token cost, and its own chance to garble the artifact. Make it concrete.

Worked example — a research blog, one agent vs a crew. Suppose one strong generalist agent can produce a 500-word AI-trends blog in a single pass: ~6,000 input tokens (instructions + retrieved context) and ~1,200 output tokens, about 9 s of wall-clock, at a model priced \$3 / 1M input and \$15 / 1M output. That is (6000·3 + 1200·15)/1e6 = \$0.018 + \$0.018 = \$0.036 and ~9 s.

Now split it the way the book's CrewAI example does — a researcher agent and a writer agent in sequential handoff:

Researcher: 5,000 in / 1,500 out → (5000·3 + 1500·15)/1e6 = \$0.0375, ~7 s.
Handoff: the writer receives the researcher's 1,500-token summary as context — not the raw 40 sources. Writer: 2,500 in / 1,200 out → (2500·3 + 1200·15)/1e6 = \$0.0255, ~6 s.
Total: \$0.063 and, because it is sequential, ~13 s.

So the crew costs 1.75× the money and 1.4× the latency. That is the coordination tax. It is only worth paying if the split buys back more than it costs — here, if the researcher's isolated context lets it ground 40 sources without blowing the writer's window, and the writer produces noticeably better prose because its prompt is about writing and nothing else. If the generalist's single pass is already good enough, the crew is strictly worse. The lesson: measure the quality lift before you ship the topology.

When parallelism flips the latency math. If the two roles were independent (say a weather-fetcher and a news-fetcher, the book's ADK ParallelAgent example), running them concurrently makes wall-clock ≈ max(7, 6) = 7 s instead of 13 s — now the crew can be faster than the generalist while still paying the extra tokens. Sequential handoff adds latency; parallel fan-out hides it. Choosing the form is choosing your latency profile.

Coordination tax — when does adding agents pay off?

Set how many specialist agents the work is split into, how much quality lift per agent specialization actually buys (diminishing), and whether handoffs run sequentially or in parallel. The bars show total cost and wall-clock latency vs a single generalist; the verdict tells you whether the split is earning its keep.

agents N: 2 lift/agent: 18%

Cost vs 1 agent

1.0×

Latency vs 1 agent

1.0×

Quality lift

Verdict

baseline

Show the core JS

// each added agent adds one coordination round-trip (cost + a handoff)
const costMult = 1 + (N - 1) * 0.7;                 // ~+70% tokens per extra agent
const latMult  = parallel ? 1 + 0.15*(N-1)          // parallel: small merge overhead
                          : 1 + 0.45*(N-1);          // sequential: each stage adds wall-clock
// quality lift has diminishing returns (each extra specialist helps less)
let lift = 0;
for (let i = 1; i < N; i++) lift += (liftPerAgent/100) * Math.pow(0.6, i-1);
const qualityPct = lift * 100;
// worth it only if quality lift clearly beats the cost premium you paid
const worth = qualityPct >= (costMult - 1) * 100 * 0.8;

4 · Running example — the coding/research assistant as a supervised crew

Take our running coding/research assistant and split a code-change task into a supervised expert team. Four roles, each owning a skill, a context, a permission, and (for the reviewer) an evaluation:

Role	Owns	Tools	Produces (typed artifact)
Researcher	finding & summarizing context	`search`, `read_docs`	`ResearchBrief{findings[], sources[]}`
Implementer	writing the change	`read_file`, `apply_patch`	`Patch{diff, rationale}`
Reviewer	independent judgment	`diff`, `run_tests`	`Review{verdict, issues[]}`
Coordinator	state & final synthesis	(delegation only)	`Result{patch, review, decision}`

The coordinator is the supervisor (topology 3). The flow is sequential handoff with a critic-reviewer loop bolted on. Crucially, roles do not freely overwrite one another — they emit artifacts; the coordinator owns the shared state and decides when to loop back.

01Coordinator decomposes the goal and delegates research → ResearchBrief.

02Implementer receives the brief (only the summary, not raw sources) → Patch.

03Reviewer runs tests + reads the diff against the brief → Review.

04If verdict = reject and loop < max, coordinator hands issues back to implementer (critic loop). Else synthesize Result.

How the book's two frameworks express this. CrewAI models it as a Crew of Agents (each with a role, goal, backstory) and Tasks with explicit context=[...] dependencies, run with Process.sequential. The book's worked CrewAI sample builds exactly a two-agent researcher → writer blog crew on Gemini 2.0 Flash, where the writing task declares context=[research_task] so the handoff is a typed dependency, not a chat:

# CrewAI — the book's researcher → writer crew (paraphrased)
researcher = Agent(role="Senior Research Analyst",
                   goal="Find and summarize the latest AI trends.",
                   allow_delegation=False)
writer     = Agent(role="Technical Content Writer",
                   goal="Write a clear blog from the research.",
                   allow_delegation=False)

research_task = Task(description="Research 3 emerging AI trends (2024-25).",
                     expected_output="Detailed summary w/ key points + sources.",
                     agent=researcher)
writing_task  = Task(description="Write a 500-word blog from the research.",
                     expected_output="A complete 500-word blog.",
                     agent=writer,
                     context=[research_task])          # <- typed handoff

crew = Crew(agents=[researcher, writer],
            tasks=[research_task, writing_task],
            process=Process.sequential)
result = crew.kickoff()

Google ADK exposes the topologies directly as composable agents. The same patterns from §2 map onto concrete classes:

ADK construct	Realizes	Behavior
`LlmAgent(sub_agents=[...])`	Hierarchy / supervisor	A coordinator delegates to children (e.g. `Greeter`, `TaskExecutor`); parent/child links are explicit.
`SequentialAgent`	Sequential handoff	Runs children in order; `step1` writes `session.state["data"]`, `step2` reads it.
`ParallelAgent`	Parallel processing	Runs children concurrently; each writes its own `output_key` into shared state.
`LoopAgent(max_iterations=N)`	Critic / reviewer loop	Repeats a step + a `ConditionChecker` until status="completed" or N hit (bounded, by design).
`AgentTool(agent=...)`	Agent-as-tool	One agent is wrapped as a callable tool for another (e.g. an `artist_agent` invoking an `ImageGen` agent).

Two design points the ADK examples make sharp. First, shared state is the communication channel — agents read/write keyed entries in session.state rather than passing free-form chat, which keeps the trace inspectable (who wrote data? who read it?). Second, the LoopAgent always carries max_iterations: the book never lets a critic loop run unbounded — there is always a verifier with a stopping rule. That is the antidote to the "debate that never converges" failure in §5.

The book also notes the "agent-as-tool" pattern blurs lesson 07 (tool use) and this one: from the caller's side a sub-agent is just a tool with a typed signature. That is the cleanest way to add an expert without inventing a whole new protocol — and it previews lesson 17, where the tool happens to live on another machine.

5 · Failure modes and the checklist

Failure modes

No owner of final state. A "chat room" of peers (topology 2 done badly) where everyone edits and nobody decides — the output is whatever the last agent happened to say.
Overlapping roles. Two agents share responsibility and tools, so they redo each other's work and disagree on what's done. Symptom: cost doubles, quality doesn't move.
Unbounded debate. A debate/critic form with no decision rule or max_iterations loops forever or oscillates. The fix is ADK's bounded LoopAgent + an explicit verdict.
Conversational noise as handoff. Agents pass long transcripts instead of typed artifacts, so each downstream window fills with the upstream agent's thinking-out-loud, costing tokens and inviting drift.
Supervisor bottleneck / SPOF. The topology-3 supervisor is a single point of failure; if it stalls, the whole crew stalls. Budget a timeout and a fallback.
Lost provenance. The trace doesn't record who decided what on which evidence, so a wrong final answer is undebuggable.

Implementation checklist

For each agent: which of skill / context / permission / evaluation does it own? If none, delete it.
What typed artifact does each role produce, and what's the schema?
Who owns shared state and final synthesis? (Name one coordinator.)
Handoffs pass artifacts and decisions, never raw transcripts.
Every loop (debate, critic) has a decision rule and a max_iterations bound.
Does the trace record decider, evidence, and handoff points for each step?
Did you measure the quality lift against the single-agent baseline before shipping?

Where this points next

We can now coordinate several role-specialized loops through typed handoffs and shared state. But notice what made the handoffs clean: the researcher's brief, the implementer's patch, the reviewer's verdict all had to be stored somewhere the coordinator could read later — ADK's session.state was doing quiet, load-bearing work. That shared store is not just a pass-through buffer; deciding what to keep, for how long, and at what scope is its own discipline. Lesson 10 (Memory management) treats memory as scoped storage with explicit write and read policies — session, state, and long-term knowledge — rather than an ever-growing transcript. Multi-agent systems make that need acute, because now multiple writers contend for the same memory.

Takeaway

Multi-agent collaboration decomposes a multi-domain goal into sub-problems owned by specialized agents. Add an agent only when it owns a distinct skill, context, permission, or evaluation role — otherwise it is pure coordination tax (in our worked case, ~1.75× cost and 1.4× latency for a sequential split). The book's six forms (sequential handoff, parallel, debate, hierarchy, expert team, critic-reviewer) ride on six topologies (single → network → supervisor → tool-supervisor → hierarchical → custom); the production default is a supervisor running sequential handoffs plus a bounded critic loop. Make handoffs typed artifacts through shared state, never chat noise; give every loop a decision rule and a max_iterations bound; and keep a trace of who decided what on which evidence. CrewAI (Crew + Task.context) and Google ADK (Sequential/Parallel/LoopAgent/AgentTool) are just typed spellings of these same structures.

Interview prompts

When does adding a second agent help, and when is it pure overhead? (§1 — only when it owns a distinct skill, context, permission, or evaluation; if two agents share both responsibility and tools, merge them.)
What is the coordination tax, and how do you decide it's worth paying? (§3 — each handoff adds an LLM round-trip's latency and tokens; a sequential 2-agent split cost ~1.75× and ~1.4× latency, so you ship it only if the measured quality lift beats that premium.)
Distinguish a collaboration form from a topology. (§2 — form = how work flows, e.g. sequential handoff vs debate vs hierarchy; topology = who may talk to whom, e.g. single / network / supervisor / hierarchical / custom.)
How do you stop a debate or critic loop from running forever? (§4/§5 — give it an explicit decision rule plus a bound, exactly like ADK's LoopAgent(max_iterations) with a ConditionChecker verdict.)
Why pass typed artifacts instead of conversation transcripts between agents? (§4/§5 — artifacts keep downstream context small and the trace inspectable; CrewAI's Task(context=[research_task]) makes the handoff a typed dependency, not chat.)
Your multi-agent system is slower and pricier than the old single agent but no more accurate. What's likely wrong? (§5 — overlapping roles redoing work, transcript-as-handoff bloating context, or a split that never bought a quality lift; re-check ownership and measure against the single-agent baseline.)
How does "agent-as-tool" relate to plain tool use? (§4 — wrapping a sub-agent as an AgentTool gives the caller a typed callable; from the caller's side an expert agent is just a tool, which also previews remote A2A in lesson 17.)