Async pipelining & weight sync — engineering for overlap

The throughput equation (22a) said disaggregated topology leaves the trainer idle while rollout runs. Lesson 22b cut rollout dominance from ~75% to ~25%. The next wall is wired into the topology itself: every step still serializes on a weight broadcast. This lesson turns that serial dependency into a pipeline — with an off-policy bill the rest of the lesson explains how to pay.

Where this lesson sits

Third and final lesson in the throughput sub-track. Lesson 22a wrote down τ_step. Lesson 22b lowered τ_R. This lesson lowers τ_step itself, by overlapping the terms instead of summing them — at the cost of a more careful estimator. The mechanics of weight sync (the term that limits how fast you can pipeline) are the second half.

The dependency graph of one step

Three rows: rollout engine, trainer, weight-sync channel. Read left to right.

The fully-async row is the prize: if τ_R, τ_T, and τ_S can be made to run on different hardware and never block each other, wall-clock is just the slowest of the three rather than their sum. The catch is that the trainer at step t is updating weights that the rollout at step t already finished using. The rollouts the trainer is consuming were sampled under a different, older policy. Welcome to off-policy RL.

Versioned weights and the importance-sampling ratio

The cleanest way to think about async is: every rollout carries the version number of the policy it was sampled from. Call this v(traj). The trainer at version v_t reads a batch where v(traj) ∈ {v_t − Δ, …, v_t} for some staleness budget Δ. Each token's policy gradient is then weighted by an importance ratio:

ρ_i = π_{v_t}(y_i | x, y_<i) / π_v(traj)(y_i | x, y_<i)

This is the same ρ that PPO uses for re-using rollouts across multiple gradient epochs (lesson 10). The PPO clip [1 − ε, 1 + ε] exists precisely to keep the gradient stable when ρ drifts away from 1. The async setting reuses that machinery: as long as ρ stays inside the clip range for most tokens, the gradient is approximately unbiased.

The freshness wall

ρ drifts as the number of trainer steps between rollout and use grows. As a rule of thumb across published 7–70B reasoning-RL runs, the clip fraction crosses ~20–25% after Δ ≈ 4–8 trainer steps of staleness; past that, gradient variance climbs and learning destabilizes. The exact number depends on learning rate, model size, PPO ε, and how heterogeneous your prompt distribution is — instrument the metric, don't trust the constant. The "freshness wall" is the largest Δ your IS-corrected estimator can tolerate. It is the central tuning knob of async RL: too low and async buys you nothing; too high and the algorithm breaks.

Two metrics to instrument:

Clip fraction. The percentage of tokens where ρ falls outside [1 − ε, 1 + ε]. Healthy: < 5%. Async-stable: < 15%. Past 25% the gradient is dominated by clipped contributions and the estimator is broken.
Mean log-ρ. The expected log-importance-ratio. Drift is mean log-ρ moving away from zero. Healthy: |mean log-ρ| < 0.05. A persistent drift means the rollout is systematically behind the trainer; either lower Δ or check that the sync is actually delivering new weights.

The replay buffer and "freshness" knobs

Once rollouts carry versions, you can store them in a buffer and consume them out of order. Three pieces of bookkeeping:

Buffer size. Number of (trajectory, version) pairs to keep. Too small and you starve the trainer; too large and you accumulate stale samples whose ρ has drifted past the clip.
Eviction policy. When the buffer fills, evict by version (drop the stalest) or by "expected ρ exceeds threshold" (drop ones that would clip).
Sampling policy. Recent-biased uniform: weight samples by w(v) = γ^{v_t − v} with γ ≈ 0.95 — newer rollouts get sampled more.

One non-obvious failure mode: if rollouts of different prompts arrive at different cadences (some prompts are slow because of long-tail decode — lesson 22b), the buffer accumulates a non-uniform distribution over prompts at any given trainer step. Mitigation: sample buffer entries with prompt-balanced weighting, not uniform.

Weight sync mechanics — the four operations

The lesson 22a equation hides τ_S behind a single symbol. In a real cluster it's a sequence of four operations, each with its own bandwidth and latency profile:

Step 1 — All-gather sharded parameters

If the trainer uses FSDP / ZeRO-3, its weights are sharded across the data-parallel group, and the shards must be assembled before they can be sent anywhere. Skipped when the trainer is pure DDP or TP-only (full params already resident). The bound is NVLink bandwidth inside the node (~600 GB/s on H100). For a 70B model at BF16, the total weight tensor is ~140 GB; an all-gather across 8 GPUs on NVLink runs in roughly 0.5–1 second.

Step 2 — Dtype cast to the inference precision

The rollout engine runs at FP8 or INT8 (lesson 22b). The trainer holds BF16 (plus FP32 master). A fused cast kernel converts BF16 → FP8 in place; for 70B that's a ~70 GB write at HBM bandwidth (~3 TB/s), so ~0.02 seconds per H100. With per-tensor scaling for FP8, you also need to compute the per-tensor max for the scale factor — which is the actual bottleneck (~0.1–0.3 s for a full pass).

Step 3 — Broadcast to the rollout engine

This is the long pole. Send 70 GB (FP8'd weights) across the network from one node to another. At 200 Gb/s (25 GB/s) that's ~3 seconds; at 400 Gb/s it's ~1.5 s. Two optimizations matter:

Broadcast tree. Instead of one trainer → many rollouts as a star, use a tree topology where intermediate rollouts forward to leaf rollouts. Reduces the trainer-side egress bottleneck from O(N) to O(log N) where N is the number of rollout replicas.
Bucket fusion. Group small tensors (LayerNorm γ, attention biases) into one large bucket before broadcasting. NCCL has a fixed per-call latency; fewer larger calls dramatically outperform many small ones for a fixed total byte count.

Step 4 — Reshard for the rollout TP layout

Trainer often uses TP=8 (one node, fully parallel); rollout often uses TP=2 or TP=4 (replicate for higher throughput per request). The freshly-broadcast weights need to be re-split along the new TP axis. This is mostly a memory-bound shuffle on the rollout side, plus a small NCCL all-to-all if the layout reshuffles channels. Cost: 0.5–4 seconds depending on model size and reshard distance.

Five optimizations, in compose order

Optimization	τ_S relative	What it does
Baseline (naive synchronous broadcast)	1.00×	—
+ Broadcast tree (log-fan-out)	0.55×	trainer egress no longer bottlenecks; sender-side bandwidth halved
+ Bucket fusion (group small tensors)	0.45×	amortize NCCL launch latency over fewer larger calls
+ FP8 cast in-flight (cast during broadcast)	0.30×	halve bytes on the wire (~0.5× alone) but the cast kernel + per-tensor scale reduction adds ~0.1–0.3 s back, so net ≈ 0.65× of the prior row
+ Zero-copy via shared memory / RDMA	0.22×	skip the host-side memcpy; GPU-to-GPU direct
+ Overlap with next rollout prefill	0.00×*	τ_S now hidden behind τ_R — it no longer counts toward τ_step

The last row is where the win compounds. τ_S is no longer additive — it's overlapped. The trainer pushes the new weights while the rollout engine is mid-decode on the previous version; by the time the rollout finishes that batch, the new weights are already in place. The cost is exactly one batch of staleness: rollouts sampled during the broadcast are tagged with the old version. The IS correction handles it.

Interactive: dependency-graph throughput

Compose the topology and the staleness budget. The widget renders a Gantt-style view of one or two steps under your chosen topology and reports the achieved throughput plus the clip fraction predicted from the staleness.

Async topology throughput & staleness

Pick a topology and a staleness budget Δ (trainer steps between rollout and use). The bar chart shows the dependency graph of one period of steps; the KPIs report throughput, idle %, and predicted clip fraction.

Topology: τ_R: 20 τ_T: 8 τ_S: 3 Δ staleness: 2

τ_step

—

tok/sec relative

—

trainer idle

—

clip fraction

—

What to try. (1) colocated, τ_R=20, τ_T=8, τ_S=3: τ_step = 31, trainer idle = 0% (because trainer doesn't run during rollout). (2) disagg same: τ_step = max(20, 8)+3 = 23 — 26% faster; trainer idles 12 seconds per step. (3) async with Δ=2: τ_step = max(R, T, S) = 20, throughput +50%; clip fraction climbs to ~10%, still safe. (4) async with Δ=8: τ_step still 20, but clip fraction crosses 25% — the freshness wall. The gradient becomes unreliable; throughput numbers are a lie.

Pipeline parallelism × RL — the bubble interaction

One last interaction worth surfacing. Pipeline parallelism (PP) shards a model's layers across GPUs: GPU 0 has layers 0–7, GPU 1 has layers 8–15, etc. A forward pass walks a sequence through the stages; backward walks it back. The well-known PP cost is the "bubble" — at the start and end of each step, some stages are idle waiting for the pipeline to fill or drain.

Three RL-specific complications:

Trainer bubble compounds with rollout long-tail. If rollouts have heavy-tailed lengths (lesson 22b), the trainer's micro-batches will have heterogeneous lengths. Interleaved 1F1B scheduling (Megatron's default) assumes near-uniform micro-batch cost; a single long micro-batch stretches one stage's work and starves the others. Mitigation: sort or balance the per-micro-batch token count before launching the trainer pass. Caveat: sorting reorders the gradient's mini-batch composition, which subtly biases the group-baseline computation in GRPO — keep the per-group structure intact when you sort.
Async PP can produce version-inconsistent weights across stages. Naive weight-sync push updates each pipeline stage independently. If a rollout starts decoding while only some stages have new weights, the forward will use a mix of versions — a kind of weight-tearing that breaks IS correction silently. Mitigation: atomic per-step swap — all stages' weights become live simultaneously, or the new weights are held back until all stages are ready.
Activation recompute × PP bubbles. PP stages with activation recompute (lesson 22) re-run the forward inside backward to save memory. Under async, the recompute uses the trainer's current weights at backward time — which may have advanced since the forward ran. The fix is to recompute against the saved forward's weight version, not the latest, but most PP scheduler implementations don't expose this knob cleanly.

This is the kind of bug that shows up as "loss is fine for a week, then suddenly diverges" — exactly the symptom that the diagnostic playbook in lesson 28 names "weight version skew." If you run PP + async RL, instrument per-stage weight version as a first-class signal.

When async is worth the engineering — and when it isn't

The throughput equation makes the trade explicit. Async claims back the idle band in the disaggregated row of the dependency graph. The size of that band is:

idle_band = |τ_R − τ_T| + τ_S

If τ_R ≈ τ_T and τ_S is well-overlapped already, async gives you almost nothing. If τ_R ≫ τ_T (the common case before applying lesson 22b's patches), async halves wall-clock. The pre-lesson-22b regime is where most published "async RL gives 2× speedup" numbers come from. After lesson 22b patches, the gap closes; async's win is more like 1.2–1.5×.

Regime	Worth async?	Why
τ_R ≫ τ_T (rollout-dominated, no patches)	Yes (2×+)	large idle band; sync is the only thing serializing
τ_R ≈ τ_T (post-22b patches)	Marginal (1.2–1.5×)	idle band is small; engineering complexity may not pay back
Small cluster, single node	No	colocated topology is simpler and within 10% of disaggregated
PRM / per-step rewards (lesson 17)	Hard	per-step reward propagation across stale rollouts is fragile
RM-only RL with drifting RM (lesson 15)	Hard	RM drift adds a second source of off-policyness; ρ stops being interpretable

Takeaway

Async RL collapses τ_step from sum-of-terms to max-of-terms — but only if the importance ratio ρ between rollout and trainer stays inside the PPO clip. The freshness wall (~4–8 trainer steps of staleness) is the hard limit; clip fraction and mean log-ρ are the metrics that tell you where you sit relative to it. The four sync operations (all-gather, cast, broadcast, reshard) are independently optimizable; the prize is overlapping τ_S with the next rollout's decode so it stops being additive at all. After all three throughput lessons, the dominant term has typically shifted from rollout (lesson 22a baseline) to nothing in particular — which is where you should be.