system_ml / 07 · pipeline parallel lesson 7 / 19

Pipeline parallel — and the bubble

Shard the model across stages, send activations between them. The catch is the bubble: until the pipeline fills, some ranks have no work to do. The whole lesson is how to shrink that idle time.

The setup

TP splits a layer across ranks. PP splits the layers themselves across ranks. Stage 0 holds layers 1..L/N, stage 1 holds L/N+1..2L/N, etc. A forward pass walks down the stages; backward walks back up.

Per-stage comm is cheap — at the boundary between stages, you send the output activation of the last layer in stage i to stage i+1 as its input. One Send/Recv per stage per micro-batch. The activation shape is just (B, T, d), regardless of how many layers are in the stage. This is point-to-point, not a collective: it doesn't scale poorly with N. PP across nodes is fine.

The naive picture — GPipe

The obvious thing: split a macro-batch into M micro-batches. Stage 0 processes micro-batch 0, sends it to stage 1, processes micro-batch 1, … Then stage 1 takes them in turn, then stage 2, …, until stage N-1 has finished forward on all M micro-batches. Then backward proceeds in reverse.

 rank 0: F0 F1 F2 F3 . . . . . . . . . . B3 B2 B1 B0
 rank 1:    F0 F1 F2 F3 . . . . . . B3 B2 B1 B0
 rank 2:       F0 F1 F2 F3 . . . B3 B2 B1 B0
 rank 3:          F0 F1 F2 F3 B3 B2 B1 B0

The .'s are idle time. Ranks 1, 2, 3 sit doing nothing while stage 0 fills the pipeline, and again at the tail when the last micro-batch propagates back. This is the pipeline bubble.

The bubble formula

For N stages and M micro-batches per macro-batch, let each forward step take t_f and each backward t_b. Total wall-clock:

T_total  =  (M + N - 1) · (t_f + t_b)

The ideal time, if pipelining were perfect, would be M · (t_f + t_b). So the bubble overhead is:

bubble_fraction  =  (N - 1) / (M + N - 1)

Two practical readings:

So the tension is: PP gives you more parallelism (more stages → more GPUs share the model state), but more stages also waste more cycles in the bubble. The fix is to crank M — but cranking M requires storing the activations of M in-flight micro-batches per stage. Activation memory grows with M. Memory budget caps how big M can be. This is where 1F1B helps.

Animated · the pipeline filling, GPipe vs 1F1B

Below is the same schedule but rendered cell-by-cell as the wall clock ticks. Press play to watch the diagonal "fill" at the start and the matching diagonal at the end — that's the bubble. Toggle between GPipe (all forwards, then all backwards) and 1F1B (forwards and backwards interleaved as soon as the last stage finishes). The bubble fraction is identical between the two — what changes is the peak number of in-flight micro-batches per stage, which is what determines activation memory.

Pipeline schedule · scrub through wall-clock
Rows = pipeline stages. Columns = time. Blue = forward, orange = backward, dark grey = idle (bubble). Watch the corner triangles — that's where the pipe is filling or draining.
bubble fraction
peak in-flight
activation memory
total slots

2D · bubble fraction (N - 1)/(M + N - 1)

The bubble fraction is a clean two-parameter function. The widget below plots both: (left) a 2D heatmap of bubble fraction over (N, M); (right) a stacked bar of "useful work" vs "wasted bubble" at your current settings. Cranking M shrinks the wasted band; cranking N grows it.

Bubble fraction explorer
Heat: red = high bubble, green = low. Black dot = your current (N, M). Right pane = compute breakdown.
bubble (formula)
M/N ratio
recommended M
utilisation

3D · pipeline cube — stages × microbatches × direction

The schedule has three natural axes: pipeline stage (which GPU), micro-batch (which input), and direction (forward vs backward). Visualised as cells in a 3D cube, you can see the diagonal sweep of forward across stages, then backward sweeping back. Cells are colored by activity: compute-forward (blue), compute-backward (orange), idle/bubble (dark), communication (gold dot).

Pipeline schedule as a 3D cube
X = micro-batch index. Y = stage. Z = direction (forward in front, backward behind). Empty cells mean idle. Rotate to see the bubble corners.
total cells
forward cells
backward cells
activity
2D projection

1F1B — same bubble, less memory

The PipeDream / Megatron 1F1B schedule observes: once the pipeline is full, alternate one forward and one backward per stage. As soon as a micro-batch finishes its forward all the way to stage N-1, its backward can start immediately:

 rank 0: F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 F7 …
 rank 1:    F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 …
 rank 2:       F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 …
 rank 3:          F0 F1 F2 F3 B0 F4 B1 F5 B2 …

The bubble fraction is unchanged from GPipe — still (N-1)/(M+N-1). So why bother?

Memory. Under GPipe, each stage holds activations for up to M in-flight forward micro-batches before the first backward starts. Under 1F1B, the steady state has at most N in-flight micro-batches per stage (because as soon as one finishes forward, its backward starts and frees its activations). The peak memory per stage drops from M to ~N. So you can crank M much higher under 1F1B without OOM — and that does shrink the bubble.

The lesson: 1F1B doesn't reduce the bubble directly; it lets you reduce the bubble indirectly by lifting the memory ceiling on M.

Interleaved 1F1B — actually reducing the bubble

Megatron-LM's interleaved schedule (Narayanan et al. 2021) goes further. Instead of each rank owning a contiguous chunk of layers, each rank owns multiple non-contiguous chunks. With V "virtual stages" per physical rank, the pipeline has V · N virtual stages on N ranks. The bubble fraction becomes:

bubble_fraction_interleaved  =  (N - 1) / (V · M + N - 1)

For V=2 and M = 8N, the bubble drops from ~11% to ~6%. The cost: V times as many activation Send/Recvs at stage boundaries (each rank now passes through the pipeline V times). Smaller per-message size means lower bandwidth utilization, so there's a knee — past V=4 or so, returns diminish.

Zero-bubble pipeline (newer)

Subsequent work (Zero Bubble Pipeline Parallelism, Qi et al. 2023) further decomposes backward into its two halves: gradient w.r.t. input (needed by the previous stage) and gradient w.r.t. weights (only needed locally). The weight gradient can be deferred. Carefully scheduled, this asymptotically eliminates the bubble. The cost is complexity and a brief memory spike. Modern frameworks (e.g. NVIDIA NeMo, MegatronLM) ship this as an option.

Layer assignment — uneven is sometimes right

A surprising practical detail: layers aren't always evenly distributed across stages. The first stage has the embedding (often big — 200M params for a 32k vocab × 4096 d model), and the last stage has the LM head (same size). Stages 0 and N-1 are heavier than the middle stages if we just split layers evenly. Sometimes we move the LM head onto a separate stage, or pack fewer layers into stage 0 to compensate. Megatron and friends expose this as a per-stage layer count.

Cost summary

QuantityPer-step cost
Activation Send between stages(M + N - 1) · 2 · B · T · d bytes per stage boundary, per step
Activation Recv at stages 1..N-1Same as above
Gradient AllReduceZero — PP doesn't AllReduce gradients (each stage's params are unique to that stage)
Peak activation memory per stageGPipe: ~M · (B, T, d). 1F1B: ~N · (B, T, d)
Bubble fraction(N - 1) / (M + N - 1) · (or with V·M for interleaved)

When PP is the right hammer

Interactive · the bubble in motion

Drag N (stages) and M (micro-batches). The widget animates the GPipe and 1F1B schedules side by side, with the bubble fraction in the KPIs. Try M = N — the bubble's nearly half the run. Now M = 4N: better. M = 16N: bubble is small but activation memory under GPipe is huge — that's the slider in real life.

GPipe vs 1F1B · bubble visualisation
Each row is a stage. Coloured cells: forward (blue) and backward (orange) micro-batches. Grey cells are idle (bubble). Toggle the schedule to compare.
bubble fraction
total slots
active slots
memory class
Takeaway
PP's bubble is (N - 1) / (M + N - 1). To shrink it you crank M. Cranking M needs activation memory — which is exactly what 1F1B's reduced peak-memory unlocks. Interleaved 1F1B and zero-bubble schedules attack the formula directly. PP is cheap on comm — point-to-point only — which is why it lives across nodes in 3D parallelism.