Pipeline parallel — and the bubble
Shard the model across stages, send activations between them. The catch is the bubble: until the pipeline fills, some ranks have no work to do. The whole lesson is how to shrink that idle time.
The setup
TP splits a layer across ranks. PP splits the layers themselves across ranks. Stage 0 holds layers 1..L/N, stage 1 holds L/N+1..2L/N, etc. A forward pass walks down the stages; backward walks back up.
Per-stage comm is cheap — at the boundary between stages, you send the output activation of the last layer in stage i to stage i+1 as its input. One Send/Recv per stage per micro-batch. The activation shape is just (B, T, d), regardless of how many layers are in the stage. This is point-to-point, not a collective: it doesn't scale poorly with N. PP across nodes is fine.
The naive picture — GPipe
The obvious thing: split a macro-batch into M micro-batches. Stage 0 processes micro-batch 0, sends it to stage 1, processes micro-batch 1, … Then stage 1 takes them in turn, then stage 2, …, until stage N-1 has finished forward on all M micro-batches. Then backward proceeds in reverse.
rank 0: F0 F1 F2 F3 . . . . . . . . . . B3 B2 B1 B0
rank 1: F0 F1 F2 F3 . . . . . . B3 B2 B1 B0
rank 2: F0 F1 F2 F3 . . . B3 B2 B1 B0
rank 3: F0 F1 F2 F3 B3 B2 B1 B0
The .'s are idle time. Ranks 1, 2, 3 sit doing nothing while stage 0 fills the pipeline, and again at the tail when the last micro-batch propagates back. This is the pipeline bubble.
The bubble formula
For N stages and M micro-batches per macro-batch, let each forward step take t_f and each backward t_b. Total wall-clock:
The ideal time, if pipelining were perfect, would be M · (t_f + t_b). So the bubble overhead is:
Two practical readings:
- More micro-batches → smaller bubble. At M = 4N, the bubble is ~20%. At M = 16N, ~6%.
- More stages → bigger bubble. For fixed M, doubling N roughly doubles the bubble.
So the tension is: PP gives you more parallelism (more stages → more GPUs share the model state), but more stages also waste more cycles in the bubble. The fix is to crank M — but cranking M requires storing the activations of M in-flight micro-batches per stage. Activation memory grows with M. Memory budget caps how big M can be. This is where 1F1B helps.
Animated · the pipeline filling, GPipe vs 1F1B
Below is the same schedule but rendered cell-by-cell as the wall clock ticks. Press play to watch the diagonal "fill" at the start and the matching diagonal at the end — that's the bubble. Toggle between GPipe (all forwards, then all backwards) and 1F1B (forwards and backwards interleaved as soon as the last stage finishes). The bubble fraction is identical between the two — what changes is the peak number of in-flight micro-batches per stage, which is what determines activation memory.
2D · bubble fraction (N - 1)/(M + N - 1)
The bubble fraction is a clean two-parameter function. The widget below plots both: (left) a 2D heatmap of bubble fraction over (N, M); (right) a stacked bar of "useful work" vs "wasted bubble" at your current settings. Cranking M shrinks the wasted band; cranking N grows it.
3D · pipeline cube — stages × microbatches × direction
The schedule has three natural axes: pipeline stage (which GPU), micro-batch (which input), and direction (forward vs backward). Visualised as cells in a 3D cube, you can see the diagonal sweep of forward across stages, then backward sweeping back. Cells are colored by activity: compute-forward (blue), compute-backward (orange), idle/bubble (dark), communication (gold dot).
1F1B — same bubble, less memory
The PipeDream / Megatron 1F1B schedule observes: once the pipeline is full, alternate one forward and one backward per stage. As soon as a micro-batch finishes its forward all the way to stage N-1, its backward can start immediately:
rank 0: F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 F7 …
rank 1: F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 …
rank 2: F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 …
rank 3: F0 F1 F2 F3 B0 F4 B1 F5 B2 …
The bubble fraction is unchanged from GPipe — still (N-1)/(M+N-1). So why bother?
Memory. Under GPipe, each stage holds activations for up to M in-flight forward micro-batches before the first backward starts. Under 1F1B, the steady state has at most N in-flight micro-batches per stage (because as soon as one finishes forward, its backward starts and frees its activations). The peak memory per stage drops from M to ~N. So you can crank M much higher under 1F1B without OOM — and that does shrink the bubble.
The lesson: 1F1B doesn't reduce the bubble directly; it lets you reduce the bubble indirectly by lifting the memory ceiling on M.
Interleaved 1F1B — actually reducing the bubble
Megatron-LM's interleaved schedule (Narayanan et al. 2021) goes further. Instead of each rank owning a contiguous chunk of layers, each rank owns multiple non-contiguous chunks. With V "virtual stages" per physical rank, the pipeline has V · N virtual stages on N ranks. The bubble fraction becomes:
For V=2 and M = 8N, the bubble drops from ~11% to ~6%. The cost: V times as many activation Send/Recvs at stage boundaries (each rank now passes through the pipeline V times). Smaller per-message size means lower bandwidth utilization, so there's a knee — past V=4 or so, returns diminish.
Zero-bubble pipeline (newer)
Subsequent work (Zero Bubble Pipeline Parallelism, Qi et al. 2023) further decomposes backward into its two halves: gradient w.r.t. input (needed by the previous stage) and gradient w.r.t. weights (only needed locally). The weight gradient can be deferred. Carefully scheduled, this asymptotically eliminates the bubble. The cost is complexity and a brief memory spike. Modern frameworks (e.g. NVIDIA NeMo, MegatronLM) ship this as an option.
Layer assignment — uneven is sometimes right
A surprising practical detail: layers aren't always evenly distributed across stages. The first stage has the embedding (often big — 200M params for a 32k vocab × 4096 d model), and the last stage has the LM head (same size). Stages 0 and N-1 are heavier than the middle stages if we just split layers evenly. Sometimes we move the LM head onto a separate stage, or pack fewer layers into stage 0 to compensate. Megatron and friends expose this as a per-stage layer count.
Cost summary
| Quantity | Per-step cost |
|---|---|
| Activation Send between stages | (M + N - 1) · 2 · B · T · d bytes per stage boundary, per step |
| Activation Recv at stages 1..N-1 | Same as above |
| Gradient AllReduce | Zero — PP doesn't AllReduce gradients (each stage's params are unique to that stage) |
| Peak activation memory per stage | GPipe: ~M · (B, T, d). 1F1B: ~N · (B, T, d) |
| Bubble fraction | (N - 1) / (M + N - 1) · (or with V·M for interleaved) |
When PP is the right hammer
- Across-node sharding of the model. Activation Send between stages is small, infrequent, and tolerates IB.
- Composed with TP intra-node. TP=8 within a node, PP across nodes. This is the canonical "3D parallelism" middle layer (lesson 10).
- Not at inference time. PP's bubble penalises single-request latency hard — for a single inference request, you pay a bubble equal to N - 1 stage times. Inference uses TP (lesson 11) or replication. PP for batch-inference of trillions of tokens? Sometimes; for low-latency chat, never.
Interactive · the bubble in motion
Drag N (stages) and M (micro-batches). The widget animates the GPipe and 1F1B schedules side by side, with the bubble fraction in the KPIs. Try M = N — the bubble's nearly half the run. Now M = 4N: better. M = 16N: bubble is small but activation memory under GPipe is huge — that's the slider in real life.