Part II - Core execution patterns
Parallelization - fan-out, fan-in, and latency
Chaining (lesson 03) gave the agent a straight line and routing (lesson 04) gave it a fork. Both still do one thing at a time. But most real agent jobs hide several subtasks that do not depend on each other — three searches, four validators, five content blocks. Running them one after another wastes the dominant cost in an agent (waiting on I/O). Parallelization is the pattern that runs the independent parts at the same time and then reduces the results back into one state.
RunnableParallel code, and the Google ADK ParallelAgent/SequentialAgent code are drawn from this chapter.RunnableParallel and Google ADK ParallelAgent — and confront the asyncio "concurrency is not parallelism" trap. (5) Thread it all through the running research assistant. We close by handing the contradiction problem to lesson 06 (Reflection).New capability: Concurrent execution of independent model calls, tool calls, or whole sub-agents, plus a reducer (fan-in) that merges, votes, or reconciles their outputs into one state the next step can consume.
1 · Two patterns wearing one name
The word "parallelization" hides two genuinely different designs, and the book is careful to separate them. Mixing them up is the most common reason a parallel agent gets slower, more expensive, and less correct all at once.
Sectioning increases throughput; voting increases quality. They share the fan-out/fan-in skeleton but demand completely different reducers — a merger that assumes its inputs are complementary will silently average away the disagreement that a voter was specifically built to surface. Decide which one you are building first, because that decision determines the reducer, and the reducer is where the difficulty lives.
The book's catalog of where this pays off is worth keeping in view; every one of these is independent work hiding inside a single user request:
| Scenario | Parallel branches | Pattern |
|---|---|---|
| Research a company | news search · stock data · social media · internal DB | sectioning |
| Analyze customer feedback | sentiment · keyword extraction · classification · urgency flag | sectioning |
| Plan a trip | flights · hotels · local events · restaurants | sectioning |
| Draft a marketing email | subject line · body · image · CTA copy | sectioning |
| Validate a form | email format · phone · address lookup · profanity check | sectioning |
| Process a multimodal post | text sentiment + keywords ‖ image objects + scene | sectioning |
| Generate ad headlines | 3 titles from different prompts / models | voting / A/B |
2 · The latency arithmetic — what you actually buy
Parallelization is the only pattern in this track whose benefit you can compute on a napkin before writing code, so do that first. The reason it works at all is that agent time is dominated by waiting: an LLM call or an API request spends almost all of its wall-clock time idle, waiting for a network round trip and for tokens to stream back. Three such calls in series wait three times in a row. Three in parallel wait once.
Serial cost is the sum of the branches; parallel cost is the slowest single branch (they overlap, so you only wait for the laggard) plus the reducer step that must run after all of them finish — the fan-in is inherently sequential, because it needs every input.
Worked number. Take the book's research assistant. It hits three retrievers — web search, internal docs, local notes — taking 3.0 s, 2.0 s, and 1.0 s, then a synthesis LLM call to fuse them at 2.0 s.
- Serial: 3.0 + 2.0 + 1.0 + 2.0 = 8.0 s.
- Parallel fan-out, then reduce: max(3.0, 2.0, 1.0) + 2.0 = 3.0 + 2.0 = 5.0 s.
- Speed-up: 8.0 / 5.0 = 1.6×. Time saved: 3.0 s — exactly the two branches you no longer wait on in series.
Notice three things the arithmetic teaches you for free. First, the speed-up is bounded: the synthesis step is not parallel (it needs all three inputs), so it sits in the denominator forever. This is Amdahl's law in agent clothing — if a fraction s of the work is irreducibly serial, your maximum speed-up is capped at 1/s no matter how many branches you fan out. Here the 2 s synthesis caps you hard. Second, more branches help only up to the slowest one; a fourth retriever at 0.5 s is free on the clock (it finishes inside the 3.0 s window) but still costs tokens. Third, the laggard dominates — shaving the 1.0 s branch does nothing; you must attack the 3.0 s branch (or give it a timeout) to move the number.
Cost, though, does not parallelize. Wall-clock time overlaps; token spend does not. Running three branches still bills you for three branches' worth of input and output tokens, plus the synthesis call. If serial cost three LLM calls and parallel costs four (three branches + one reducer), you have spent more money to save time. That trade — buy latency with dollars — is the central tension this pattern hands to lesson 18 (Resource-aware optimization). Use the widget to feel both axes move at once.
3 · The shape: fan-out, fan-in, and why the reducer is the hard half
The skeleton is four moves, and the first three are easy. The fourth is where designs live or die.
- Identify independent branches. Two pieces of work are independent only if neither reads the other's output and they share no mutable state. This is the precondition lessons 03–04 trained you to spot. Get it wrong and you have a hidden dependency, not a parallel branch.
- Launch each branch with its own contract. Every branch gets its own typed input and output schema (lesson 02), so the reducer knows exactly what shape comes back from each.
- Collect the artifacts as the branches finish, under a shared budget (timeout, cancellation).
- Fan in with a reducer that does real work — not string concatenation.
The seductive mistake is to treat fan-in as "\n".join(results). The book is blunt about this: the aggregator matters as much as the fan-out. A reducer has to know which job it is doing, and the four jobs are different:
| Reducer job | When | What it must do |
|---|---|---|
| Merge sections | sectioning — branches are complementary | concatenate, but dedupe overlapping facts and keep source attribution |
| Compare alternatives | voting — branches answer the same question | score each against evidence/constraints, pick the best |
| Reconcile disagreement | branches conflict on a fact | surface the conflict explicitly rather than silently picking one |
| Select the safest | any branch may be unsafe/low-confidence | filter on a guardrail/threshold before merging |
The book's research example makes the stakes concrete: three retrievers can return overlapping, partially contradictory evidence. A reducer that concatenates produces a report that says both "X is true" and "X is false" two paragraphs apart and never notices. A reducer with a contradiction policy either resolves it (by recency, authority, or quorum) or flags it for the next stage. That "flag it for the next stage" is precisely the seam into Reflection (lesson 06).
4 · Two framework forms (and the asyncio trap)
The book shows the same pattern in two framework dialects. They differ in what runs in parallel — runnables vs whole agents — but the shape is identical.
LangChain (LCEL) — parallel runnables. You bundle independent chains into a RunnableParallel (a dict of branches); when the bundle is invoked, the LCEL runtime fires the branches concurrently and returns a dict keyed by branch name. The book's example fans one topic out into three chains — summary, questions, key-terms — preserving the original input with RunnablePassthrough, then pipes the merged dict into a synthesis prompt:
map_chain = RunnableParallel({
"summary": summarize_chain, # 3 independent LLM chains,
"questions": questions_chain, # run concurrently on one topic
"key_terms": terms_chain,
"topic": RunnablePassthrough() # carry the original input forward
})
# fan-in: the merged dict feeds one synthesis prompt → LLM → parser
full_chain = map_chain | synthesis_prompt | llm | StrOutputParser()
await full_chain.ainvoke("the history of space exploration")
Google ADK — parallel agents. ADK gives first-class primitives: ParallelAgent runs its sub-agents concurrently, and SequentialAgent orchestrates phases. The book's sustainable-tech example wires three LlmAgent researchers (renewable energy, EVs, carbon capture), each using google_search and writing to a distinct output_key in shared session state. A ParallelAgent runs all three; then a synthesis merger_agent reads the three keys and writes a structured report — and crucially is instructed to use only the input summaries, adding no outside knowledge. A SequentialAgent chains "parallel research → merge" as the root:
parallel_research = ParallelAgent(
sub_agents=[renewable_researcher, ev_researcher, carbon_researcher])
merger = LlmAgent(instruction="Synthesize ONLY from the input summaries: "
"{renewable_energy_result} {ev_technology_result} "
"{carbon_capture_result} — attribute each section.")
root_agent = SequentialAgent(sub_agents=[parallel_research, merger])
Both forms put the same guardrail on the reducer: synthesize only from the branch outputs. That instruction is doing real work — it stops the merger from hallucinating glue between sources, which is the failure mode that makes parallel research reports untrustworthy.
asyncio gives concurrency, not true parallelism: a single-threaded event loop switches between tasks whenever one is idle (e.g. waiting on a network response), under the GIL. That is exactly right for agent work, because agent branches are I/O-bound — they spend their time waiting on LLM/API round trips, not burning CPU. The event loop overlaps all that waiting, so you get the full latency win. But if a branch were CPU-bound (heavy local computation, no await points), asyncio.gather would not speed it up at all — the loop can't preempt a busy task. For CPU-bound branches you need real processes/threads. Match the concurrency primitive to where the time actually goes.5 · Running example — the research assistant, parallelized
Pull the thread together. Our research agent receives one query and runs three retrievers concurrently, each under a shared budget, then reduces with a real policy — not a join:
branches = [search_web(query), search_docs(query), search_notes(query)]
# fan-out under a shared budget; the laggard can be cancelled
results = await gather_with_budget(branches, timeout=8.0)
# fan-in: a reducer that knows it is MERGING SECTIONS
evidence = reduce_results(
results,
dedupe=True, # collapse facts that appear in 2+ sources
require_source=True, # drop any claim without attribution
conflict_policy="surface_conflict", # don't silently pick a side
quorum=2 # proceed once 2 of 3 branches return
)
The timeout=8.0 and quorum=2 together implement the latency lesson from §2: do not wait on a slow, low-value branch once you have enough evidence. If web search is the 3.0 s laggard and the other two return in 2.0 s, a quorum of 2 lets synthesis start at 2.0 s and treat the web branch as best-effort. The surface_conflict policy implements §3: when docs say the launch is in March and notes say April, the reducer emits both, attributed, instead of guessing — and that flagged conflict is the input lesson 06 will critique.
Failure modes
- Fake independence. Branches share a constraint or one secretly needs another's output; you get duplicated effort or incompatible results that the reducer can't fuse.
- Concatenation as fan-in. No dedupe, no conflict handling — the report contradicts itself and nobody notices.
- Waiting on the laggard. Holding all branches until the slowest returns even after a quorum already has enough evidence; the timeout/quorum exists for exactly this.
- Parallelizing for show. Fanning out CPU-bound work under
asyncio(no speed-up) or fanning out three branches to save 1 s while tripling token cost. - Lost attribution. Merger forgets which source said what, so conflicts and provenance both vanish.
Implementation checklist
- Which branches are truly independent (no shared mutable state, no cross-reads)?
- Sectioning or voting? — it picks the reducer.
- What typed artifact does each branch return?
- What does the reducer do: merge, vote, reconcile, or filter?
- Budget: per-branch timeout, quorum, and cancellation of the rest?
- Is the laggard or the synthesis step the real bottleneck (§2)?
- Are per-branch token costs logged so the latency/cost trade is visible?
Where this points next
Parallelization buys speed and coverage, but it manufactures a new problem it cannot solve on its own: when branches disagree, the reducer can at best surface the contradiction — it has no mechanism to judge which branch is right or to repair a weak synthesis. That judgement is a second pass over the output: read what you produced, critique it against the evidence and constraints, and revise. That is Reflection, lesson 06, which adds the feedback loop that turns "here are three answers and a flagged conflict" into "here is the vetted answer, and here is why." Parallelization gives Reflection something worth critiquing.
RunnableParallel for parallel runnables, Google ADK ParallelAgent/SequentialAgent for parallel agents — but remember asyncio gives concurrency, not parallelism, so it only helps I/O-bound branches.Interview prompts
- What is the difference between parallelizing for latency and for robustness? (§1 — sectioning splits different work and the reducer merges complementary parts; voting runs the same task several ways and the reducer scores/votes. Same skeleton, opposite reducers.)
- You fan out 3 retrievers (3s, 2s, 1s) then synthesize (2s). What's the wall-clock, and what caps the speed-up? (§2 — parallel = max(3,2,1)+2 = 5s vs serial 8s, a 1.6× win; the non-parallel 2s synthesis is the Amdahl cap, and the 3s laggard dominates the fan-out.)
- Does parallelization save money? (§2 — no; latency overlaps but token cost does not. Three branches + a reducer bill four calls regardless of execution order; you trade dollars for time, which is lesson 18's concern.)
- Why is the fan-in reducer the hard part? (§3 — it must dedupe, keep attribution, and reconcile or surface conflicts; naive concatenation produces self-contradicting output and loses provenance.)
- Why does
asyncio.gatherspeed up agent branches but not heavy local computation? (§4 — asyncio is single-threaded concurrency under the GIL; it overlaps I/O waits (LLM/API calls) but cannot preempt CPU-bound work, which needs real processes/threads.) - How do you stop a slow branch from holding up the whole fan-out? (§5 — a shared timeout plus a quorum: start the reducer once enough branches return and cancel or treat the laggard as best-effort.)
- Name the framework primitives the book uses and what each parallelizes. (§4 — LangChain
RunnableParallelruns independent runnables concurrently; Google ADKParallelAgentruns sub-agents concurrently, withSequentialAgentorchestrating the parallel-then-merge phases.)