all_lessons/agentic_systems/05lesson 6 / 25

Part II - Core execution patterns

Parallelization - fan-out, fan-in, and latency

Chaining (lesson 03) gave the agent a straight line and routing (lesson 04) gave it a fork. Both still do one thing at a time. But most real agent jobs hide several subtasks that do not depend on each other — three searches, four validators, five content blocks. Running them one after another wastes the dominant cost in an agent (waiting on I/O). Parallelization is the pattern that runs the independent parts at the same time and then reduces the results back into one state.

Book source
Chapter 3 - Parallelization (并行化); PDF outline pages 43-52. Concepts, the research-assistant example, the seven application scenarios, the LangChain RunnableParallel code, and the Google ADK ParallelAgent/SequentialAgent code are drawn from this chapter.
The plan
Five moves. (1) Name the two distinct reasons to parallelize — latency (split independent work) and robustness (try the same work several ways), and why conflating them produces bad designs. (2) Do the latency arithmetic so you can predict the speed-up before building anything, and meet the wall: parallelism only helps the part that is parallel. (3) Build the fan-out / fan-in shape, and argue that the reducer is the hard half. (4) Translate the shape into the book's two concrete framework forms — LangChain RunnableParallel and Google ADK ParallelAgent — and confront the asyncio "concurrency is not parallelism" trap. (5) Thread it all through the running research assistant. We close by handing the contradiction problem to lesson 06 (Reflection).
Linear position
Prerequisite: Prompt chaining (lesson 03) for sequential decomposition, and routing (lesson 04) for choosing a branch — together they tell you when two pieces of work are independent, which is the entire precondition for going parallel.
New capability: Concurrent execution of independent model calls, tool calls, or whole sub-agents, plus a reducer (fan-in) that merges, votes, or reconciles their outputs into one state the next step can consume.

1 · Two patterns wearing one name

The word "parallelization" hides two genuinely different designs, and the book is careful to separate them. Mixing them up is the most common reason a parallel agent gets slower, more expensive, and less correct all at once.

Sectioning (for latency)
Split different work across branches: search the web, query the company database, pull stock data, scan social media — all at once. Each branch does a distinct piece; the reducer concatenates / merges them into a fuller picture. Goal: shorter wall-clock time.
Voting (for robustness)
Run the same task several independent ways: three prompts, three models, three temperatures, all answering one question. The reducer scores, ranks, or votes and keeps the best. Goal: a more reliable answer when any single attempt is uncertain.

Sectioning increases throughput; voting increases quality. They share the fan-out/fan-in skeleton but demand completely different reducers — a merger that assumes its inputs are complementary will silently average away the disagreement that a voter was specifically built to surface. Decide which one you are building first, because that decision determines the reducer, and the reducer is where the difficulty lives.

The book's catalog of where this pays off is worth keeping in view; every one of these is independent work hiding inside a single user request:

ScenarioParallel branchesPattern
Research a companynews search · stock data · social media · internal DBsectioning
Analyze customer feedbacksentiment · keyword extraction · classification · urgency flagsectioning
Plan a tripflights · hotels · local events · restaurantssectioning
Draft a marketing emailsubject line · body · image · CTA copysectioning
Validate a formemail format · phone · address lookup · profanity checksectioning
Process a multimodal posttext sentiment + keywords ‖ image objects + scenesectioning
Generate ad headlines3 titles from different prompts / modelsvoting / A/B

2 · The latency arithmetic — what you actually buy

Parallelization is the only pattern in this track whose benefit you can compute on a napkin before writing code, so do that first. The reason it works at all is that agent time is dominated by waiting: an LLM call or an API request spends almost all of its wall-clock time idle, waiting for a network round trip and for tokens to stream back. Three such calls in series wait three times in a row. Three in parallel wait once.

Tserial = Σi ti    vs    Tparallel = maxi(ti) + treduce

Serial cost is the sum of the branches; parallel cost is the slowest single branch (they overlap, so you only wait for the laggard) plus the reducer step that must run after all of them finish — the fan-in is inherently sequential, because it needs every input.

Worked number. Take the book's research assistant. It hits three retrievers — web search, internal docs, local notes — taking 3.0 s, 2.0 s, and 1.0 s, then a synthesis LLM call to fuse them at 2.0 s.

Notice three things the arithmetic teaches you for free. First, the speed-up is bounded: the synthesis step is not parallel (it needs all three inputs), so it sits in the denominator forever. This is Amdahl's law in agent clothing — if a fraction s of the work is irreducibly serial, your maximum speed-up is capped at 1/s no matter how many branches you fan out. Here the 2 s synthesis caps you hard. Second, more branches help only up to the slowest one; a fourth retriever at 0.5 s is free on the clock (it finishes inside the 3.0 s window) but still costs tokens. Third, the laggard dominates — shaving the 1.0 s branch does nothing; you must attack the 3.0 s branch (or give it a timeout) to move the number.

Cost, though, does not parallelize. Wall-clock time overlaps; token spend does not. Running three branches still bills you for three branches' worth of input and output tokens, plus the synthesis call. If serial cost three LLM calls and parallel costs four (three branches + one reducer), you have spent more money to save time. That trade — buy latency with dollars — is the central tension this pattern hands to lesson 18 (Resource-aware optimization). Use the widget to feel both axes move at once.

Fan-out latency & cost — drag the branches, watch the wall clock
Four candidate branches feed a synthesis (reduce) step. Toggle which branches run, set each branch's latency, and flip serial vs parallel execution. Serial waits for the sum of enabled branches; parallel waits only for the slowest, then adds the reducer. The cost meter counts LLM calls (branches + 1 reducer) and does not shrink when you parallelize — that is the whole point.
Wall-clock
5.0s
Serial would be
8.0s
Speed-up
1.6×
LLM calls (cost)
4

3 · The shape: fan-out, fan-in, and why the reducer is the hard half

The skeleton is four moves, and the first three are easy. The fourth is where designs live or die.

┌──────────▶ branch A (search web) ──┐ query ──fan-out──────────▶ branch B (search docs) ──┼──▶ REDUCER ──▶ one state └──────────▶ branch C (search notes) ──┘ (fan-in) independent contracts dedupe · rank run concurrently resolve conflicts
  1. Identify independent branches. Two pieces of work are independent only if neither reads the other's output and they share no mutable state. This is the precondition lessons 03–04 trained you to spot. Get it wrong and you have a hidden dependency, not a parallel branch.
  2. Launch each branch with its own contract. Every branch gets its own typed input and output schema (lesson 02), so the reducer knows exactly what shape comes back from each.
  3. Collect the artifacts as the branches finish, under a shared budget (timeout, cancellation).
  4. Fan in with a reducer that does real work — not string concatenation.

The seductive mistake is to treat fan-in as "\n".join(results). The book is blunt about this: the aggregator matters as much as the fan-out. A reducer has to know which job it is doing, and the four jobs are different:

Reducer jobWhenWhat it must do
Merge sectionssectioning — branches are complementaryconcatenate, but dedupe overlapping facts and keep source attribution
Compare alternativesvoting — branches answer the same questionscore each against evidence/constraints, pick the best
Reconcile disagreementbranches conflict on a factsurface the conflict explicitly rather than silently picking one
Select the safestany branch may be unsafe/low-confidencefilter on a guardrail/threshold before merging

The book's research example makes the stakes concrete: three retrievers can return overlapping, partially contradictory evidence. A reducer that concatenates produces a report that says both "X is true" and "X is false" two paragraphs apart and never notices. A reducer with a contradiction policy either resolves it (by recency, authority, or quorum) or flags it for the next stage. That "flag it for the next stage" is precisely the seam into Reflection (lesson 06).

4 · Two framework forms (and the asyncio trap)

The book shows the same pattern in two framework dialects. They differ in what runs in parallel — runnables vs whole agents — but the shape is identical.

LangChain (LCEL) — parallel runnables. You bundle independent chains into a RunnableParallel (a dict of branches); when the bundle is invoked, the LCEL runtime fires the branches concurrently and returns a dict keyed by branch name. The book's example fans one topic out into three chains — summary, questions, key-terms — preserving the original input with RunnablePassthrough, then pipes the merged dict into a synthesis prompt:

map_chain = RunnableParallel({
    "summary":   summarize_chain,     # 3 independent LLM chains,
    "questions": questions_chain,     # run concurrently on one topic
    "key_terms": terms_chain,
    "topic":     RunnablePassthrough()  # carry the original input forward
})

# fan-in: the merged dict feeds one synthesis prompt → LLM → parser
full_chain = map_chain | synthesis_prompt | llm | StrOutputParser()
await full_chain.ainvoke("the history of space exploration")

Google ADK — parallel agents. ADK gives first-class primitives: ParallelAgent runs its sub-agents concurrently, and SequentialAgent orchestrates phases. The book's sustainable-tech example wires three LlmAgent researchers (renewable energy, EVs, carbon capture), each using google_search and writing to a distinct output_key in shared session state. A ParallelAgent runs all three; then a synthesis merger_agent reads the three keys and writes a structured report — and crucially is instructed to use only the input summaries, adding no outside knowledge. A SequentialAgent chains "parallel research → merge" as the root:

parallel_research = ParallelAgent(
    sub_agents=[renewable_researcher, ev_researcher, carbon_researcher])

merger = LlmAgent(instruction="Synthesize ONLY from the input summaries: "
                              "{renewable_energy_result} {ev_technology_result} "
                              "{carbon_capture_result}  — attribute each section.")

root_agent = SequentialAgent(sub_agents=[parallel_research, merger])

Both forms put the same guardrail on the reducer: synthesize only from the branch outputs. That instruction is doing real work — it stops the merger from hallucinating glue between sources, which is the failure mode that makes parallel research reports untrustworthy.

The asyncio trap — concurrency is not parallelism
The book flags this explicitly. Python's asyncio gives concurrency, not true parallelism: a single-threaded event loop switches between tasks whenever one is idle (e.g. waiting on a network response), under the GIL. That is exactly right for agent work, because agent branches are I/O-bound — they spend their time waiting on LLM/API round trips, not burning CPU. The event loop overlaps all that waiting, so you get the full latency win. But if a branch were CPU-bound (heavy local computation, no await points), asyncio.gather would not speed it up at all — the loop can't preempt a busy task. For CPU-bound branches you need real processes/threads. Match the concurrency primitive to where the time actually goes.

5 · Running example — the research assistant, parallelized

Pull the thread together. Our research agent receives one query and runs three retrievers concurrently, each under a shared budget, then reduces with a real policy — not a join:

branches = [search_web(query), search_docs(query), search_notes(query)]

# fan-out under a shared budget; the laggard can be cancelled
results = await gather_with_budget(branches, timeout=8.0)

# fan-in: a reducer that knows it is MERGING SECTIONS
evidence = reduce_results(
    results,
    dedupe=True,                      # collapse facts that appear in 2+ sources
    require_source=True,              # drop any claim without attribution
    conflict_policy="surface_conflict",  # don't silently pick a side
    quorum=2                          # proceed once 2 of 3 branches return
)

The timeout=8.0 and quorum=2 together implement the latency lesson from §2: do not wait on a slow, low-value branch once you have enough evidence. If web search is the 3.0 s laggard and the other two return in 2.0 s, a quorum of 2 lets synthesis start at 2.0 s and treat the web branch as best-effort. The surface_conflict policy implements §3: when docs say the launch is in March and notes say April, the reducer emits both, attributed, instead of guessing — and that flagged conflict is the input lesson 06 will critique.

Failure modes

  • Fake independence. Branches share a constraint or one secretly needs another's output; you get duplicated effort or incompatible results that the reducer can't fuse.
  • Concatenation as fan-in. No dedupe, no conflict handling — the report contradicts itself and nobody notices.
  • Waiting on the laggard. Holding all branches until the slowest returns even after a quorum already has enough evidence; the timeout/quorum exists for exactly this.
  • Parallelizing for show. Fanning out CPU-bound work under asyncio (no speed-up) or fanning out three branches to save 1 s while tripling token cost.
  • Lost attribution. Merger forgets which source said what, so conflicts and provenance both vanish.

Implementation checklist

  • Which branches are truly independent (no shared mutable state, no cross-reads)?
  • Sectioning or voting? — it picks the reducer.
  • What typed artifact does each branch return?
  • What does the reducer do: merge, vote, reconcile, or filter?
  • Budget: per-branch timeout, quorum, and cancellation of the rest?
  • Is the laggard or the synthesis step the real bottleneck (§2)?
  • Are per-branch token costs logged so the latency/cost trade is visible?

Where this points next

Parallelization buys speed and coverage, but it manufactures a new problem it cannot solve on its own: when branches disagree, the reducer can at best surface the contradiction — it has no mechanism to judge which branch is right or to repair a weak synthesis. That judgement is a second pass over the output: read what you produced, critique it against the evidence and constraints, and revise. That is Reflection, lesson 06, which adds the feedback loop that turns "here are three answers and a flagged conflict" into "here is the vetted answer, and here is why." Parallelization gives Reflection something worth critiquing.

Takeaway
Parallelization runs independent subtasks concurrently and then reduces them into one state. It wears two faces: sectioning (split different work for latency) and voting (try the same work several ways for robustness) — and the choice fixes the reducer. The win is computable: Tparallel = max(ti) + treduce versus Σ ti, capped by the serial synthesis step (Amdahl) and bought with token cost that does not parallelize. The fan-out is trivial; the reducer is the design — it must dedupe, attribute, and surface conflicts rather than concatenate. Frameworks supply the plumbing: LangChain RunnableParallel for parallel runnables, Google ADK ParallelAgent/SequentialAgent for parallel agents — but remember asyncio gives concurrency, not parallelism, so it only helps I/O-bound branches.

Interview prompts