Pipeline parallel — and the bubble

Shard the model across stages, send activations between them. The catch is the bubble: until the pipeline fills, some ranks have no work to do. The whole lesson is how to shrink that idle time.

The setup

TP splits a layer across ranks. PP splits the layers themselves across ranks. Stage 0 holds layers 1..L/N, stage 1 holds L/N+1..2L/N, etc. A forward pass walks down the stages; backward walks back up.

Per-stage comm is cheap — at the boundary between stages, you send the output activation of the last layer in stage i to stage i+1 as its input. One Send/Recv per stage per micro-batch. The activation shape is just (B, T, d), regardless of how many layers are in the stage. This is point-to-point, not a collective: it doesn't scale poorly with N. PP across nodes is fine.

The naive picture — GPipe

The obvious thing: split a macro-batch into M micro-batches. Stage 0 processes micro-batch 0, sends it to stage 1, processes micro-batch 1, … Then stage 1 takes them in turn, then stage 2, …, until stage N-1 has finished forward on all M micro-batches. Then backward proceeds in reverse.

 rank 0: F0 F1 F2 F3 . . . . . . . . . . B3 B2 B1 B0
 rank 1:    F0 F1 F2 F3 . . . . . . B3 B2 B1 B0
 rank 2:       F0 F1 F2 F3 . . . B3 B2 B1 B0
 rank 3:          F0 F1 F2 F3 B3 B2 B1 B0

The .'s are idle time. Ranks 1, 2, 3 sit doing nothing while stage 0 fills the pipeline, and again at the tail when the last micro-batch propagates back. This is the pipeline bubble.

The bubble formula

For N stages and M micro-batches per macro-batch, let each forward step take t_f and each backward t_b. Total wall-clock:

T_total = (M + N - 1) · (t_f + t_b)

The ideal time, if pipelining were perfect, would be M · (t_f + t_b). So the bubble overhead is:

bubble_fraction = (N - 1) / (M + N - 1)

Two practical readings:

More micro-batches → smaller bubble. At M = 4N, the bubble is ~20%. At M = 16N, ~6%.
More stages → bigger bubble. For fixed M, doubling N roughly doubles the bubble.

So the tension is: PP gives you more parallelism (more stages → more GPUs share the model state), but more stages also waste more cycles in the bubble. The fix is to crank M — but cranking M requires storing the activations of M in-flight micro-batches per stage. Activation memory grows with M. Memory budget caps how big M can be. This is where 1F1B helps.

Animated · the pipeline filling, GPipe vs 1F1B

Below is the same schedule but rendered cell-by-cell as the wall clock ticks. Press play to watch the diagonal "fill" at the start and the matching diagonal at the end — that's the bubble. Toggle between GPipe (all forwards, then all backwards) and 1F1B (forwards and backwards interleaved as soon as the last stage finishes). The bubble fraction is identical between the two — what changes is the peak number of in-flight micro-batches per stage, which is what determines activation memory.

2D · bubble fraction (N - 1)/(M + N - 1)

The bubble fraction is a clean two-parameter function. The widget below plots both: (left) a 2D heatmap of bubble fraction over (N, M); (right) a stacked bar of "useful work" vs "wasted bubble" at your current settings. Cranking M shrinks the wasted band; cranking N grows it.

3D · pipeline cube — stages × microbatches × direction

The schedule has three natural axes: pipeline stage (which GPU), micro-batch (which input), and direction (forward vs backward). Visualised as cells in a 3D cube, you can see the diagonal sweep of forward across stages, then backward sweeping back. Cells are colored by activity: compute-forward (blue), compute-backward (orange), idle/bubble (dark), communication (gold dot).

1F1B — same bubble, less memory

The PipeDream / Megatron 1F1B schedule observes: once the pipeline is full, alternate one forward and one backward per stage. As soon as a micro-batch finishes its forward all the way to stage N-1, its backward can start immediately:

 rank 0: F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 F7 …
 rank 1:    F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 …
 rank 2:       F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 …
 rank 3:          F0 F1 F2 F3 B0 F4 B1 F5 B2 …

The bubble fraction is unchanged from GPipe — still (N-1)/(M+N-1). So why bother?

Memory. Under GPipe, each stage holds activations for up to M in-flight forward micro-batches before the first backward starts. Under 1F1B, the steady state has at most N in-flight micro-batches per stage (because as soon as one finishes forward, its backward starts and frees its activations). The peak memory per stage drops from M to ~N. So you can crank M much higher under 1F1B without OOM — and that does shrink the bubble.

The lesson: 1F1B doesn't reduce the bubble directly; it lets you reduce the bubble indirectly by lifting the memory ceiling on M.

Interleaved 1F1B — actually reducing the bubble

Megatron-LM's interleaved schedule (Narayanan et al. 2021) goes further. Instead of each rank owning a contiguous chunk of layers, each rank owns multiple non-contiguous chunks. With V "virtual stages" per physical rank, the pipeline has V · N virtual stages on N ranks. The bubble fraction becomes:

bubble_fraction_interleaved = (N - 1) / (V · M + N - 1)

For V=2 and M = 8N, the bubble drops from ~11% to ~6%. The cost: V times as many activation Send/Recvs at stage boundaries (each rank now passes through the pipeline V times). Smaller per-message size means lower bandwidth utilization, so there's a knee — past V=4 or so, returns diminish.

Zero-bubble pipeline (newer)

Subsequent work (Zero Bubble Pipeline Parallelism, Qi et al. 2023) further decomposes backward into its two halves: gradient w.r.t. input (needed by the previous stage) and gradient w.r.t. weights (only needed locally). The weight gradient can be deferred. Carefully scheduled, this asymptotically eliminates the bubble. The cost is complexity and a brief memory spike. Modern frameworks (e.g. NVIDIA NeMo, MegatronLM) ship this as an option.

Layer assignment — uneven is sometimes right

A surprising practical detail: layers aren't always evenly distributed across stages. The first stage has the embedding (often big — 200M params for a 32k vocab × 4096 d model), and the last stage has the LM head (same size). Stages 0 and N-1 are heavier than the middle stages if we just split layers evenly. Sometimes we move the LM head onto a separate stage, or pack fewer layers into stage 0 to compensate. Megatron and friends expose this as a per-stage layer count.

Cost summary

Quantity	Per-step cost
Activation Send between stages	(M + N - 1) · 2 · B · T · d bytes per stage boundary, per step
Activation Recv at stages 1..N-1	Same as above
Gradient AllReduce	Zero — PP doesn't AllReduce gradients (each stage's params are unique to that stage)
Peak activation memory per stage	GPipe: ~M · (B, T, d). 1F1B: ~N · (B, T, d)
Bubble fraction	(N - 1) / (M + N - 1) · (or with V·M for interleaved)

When PP is the right hammer

Across-node sharding of the model. Activation Send between stages is small, infrequent, and tolerates IB.
Composed with TP intra-node. TP=8 within a node, PP across nodes. This is the canonical "3D parallelism" middle layer (lesson 12).
Not at inference time. PP's bubble penalises single-request latency hard — for a single inference request, you pay a bubble equal to N - 1 stage times. Inference uses TP (lesson 14) or replication. PP for batch-inference of trillions of tokens? Sometimes; for low-latency chat, never.

Interactive · the bubble in motion

Drag N (stages) and M (micro-batches). The widget animates the GPipe and 1F1B schedules side by side, with the bubble fraction in the KPIs. Try M = N — the bubble's nearly half the run. Now M = 4N: better. M = 16N: bubble is small but activation memory under GPipe is huge — that's the slider in real life.

DualPipe and the DP caveat

DualPipe — the step past zero-bubble

Zero-bubble defers the weight-gradient half of backward to fill gaps. DualPipe (DeepSeek-V3) goes one further: run two pipelines in opposite directions at once, so the forward of a micro-batch travelling one way overlaps the backward of a different micro-batch travelling the other way. With forward and backward kept symmetric and co-scheduled, the bubble is almost entirely filled — the idle triangles get packed with the counter-flowing pipeline's work. The price is duplication: each rank holds ~2× the weights (it participates in both directions) and keeps more micro-batches' activations in flight, so DualPipe trades memory for near-total bubble elimination. It only makes sense when you have the HBM headroom and a comm-heavy regime (DeepSeek pairs it with cross-node expert-parallel AllToAll to overlap).

The DP caveat — "gradient AllReduce: zero" is per-pipeline

The cost table above lists gradient AllReduce: zero, and that is true within a single pipeline: each stage owns unique params, so there is nothing to AllReduce along the PP axis. But PP is essentially never run alone — it is composed with DP (replicate the whole pipeline across more GPUs). Across those outer DP replicas, every stage's gradients are AllReduced, exactly as in lesson 04. So the honest accounting is: zero gradient comm along PP, full DDP-style gradient AllReduce along the DP axis that wraps it.

Takeaway

PP's bubble is (N - 1) / (M + N - 1). To shrink it you crank M. Cranking M needs activation memory — which is exactly what 1F1B's reduced peak-memory unlocks. Interleaved 1F1B and zero-bubble schedules attack the formula directly. PP is cheap on comm — point-to-point only — which is why it lives across nodes in 3D parallelism.