rl_lessons / 22c · async & weight sync lesson 8¾ / 9 · part III

Async pipelining & weight sync — engineering for overlap

The throughput equation (22a) said disaggregated topology leaves the trainer idle while rollout runs. Lesson 22b cut rollout dominance from ~75% to ~25%. The next wall is wired into the topology itself: every step still serializes on a weight broadcast. This lesson turns that serial dependency into a pipeline — with an off-policy bill the rest of the lesson explains how to pay.

Where this lesson sits
Third and final lesson in the throughput sub-track. Lesson 22a wrote down τ_step. Lesson 22b lowered τ_R. This lesson lowers τ_step itself, by overlapping the terms instead of summing them — at the cost of a more careful estimator. The mechanics of weight sync (the term that limits how fast you can pipeline) are the second half.

The dependency graph of one step

Three rows: rollout engine, trainer, weight-sync channel. Read left to right.

colocated everything serializes; trainer waits for rollout, rollout waits for trainer's new weights τ_R τ_T τ_S τ = R + T + S disaggregated rollout and trainer run on disjoint GPUs; both work concurrently, sync still serializes rollout τ_R trainer τ_T idle τ_S τ = max(R,T) + S fully async (this lesson) rollout runs continuously; trainer consumes rollouts at its own cadence; sync overlaps with both rollout τ_R[t] τ_R[t+1] τ_R[t+2] trainer τ_T[t−1] τ_T[t] τ_T[t+1] sync τ_S τ_S τ = max(R,T,S) fully async: every term overlaps; throughput is limited by the single largest term

The fully-async row is the prize: if τ_R, τ_T, and τ_S can be made to run on different hardware and never block each other, wall-clock is just the slowest of the three rather than their sum. The catch is that the trainer at step t is updating weights that the rollout at step t already finished using. The rollouts the trainer is consuming were sampled under a different, older policy. Welcome to off-policy RL.

Versioned weights and the importance-sampling ratio

The cleanest way to think about async is: every rollout carries the version number of the policy it was sampled from. Call this v(traj). The trainer at version v_t reads a batch where v(traj) ∈ {v_t − Δ, …, v_t} for some staleness budget Δ. Each token's policy gradient is then weighted by an importance ratio:

ρi = πv_t(yi | x, y<i) / πv(traj)(yi | x, y<i)

This is the same ρ that PPO uses for re-using rollouts across multiple gradient epochs (lesson 10). The PPO clip [1 − ε, 1 + ε] exists precisely to keep the gradient stable when ρ drifts away from 1. The async setting reuses that machinery: as long as ρ stays inside the clip range for most tokens, the gradient is approximately unbiased.

The freshness wall
ρ drifts as the number of trainer steps between rollout and use grows. As a rule of thumb across published 7–70B reasoning-RL runs, the clip fraction crosses ~20–25% after Δ ≈ 4–8 trainer steps of staleness; past that, gradient variance climbs and learning destabilizes. The exact number depends on learning rate, model size, PPO ε, and how heterogeneous your prompt distribution is — instrument the metric, don't trust the constant. The "freshness wall" is the largest Δ your IS-corrected estimator can tolerate. It is the central tuning knob of async RL: too low and async buys you nothing; too high and the algorithm breaks.

Two metrics to instrument:

The replay buffer and "freshness" knobs

Once rollouts carry versions, you can store them in a buffer and consume them out of order. Three pieces of bookkeeping:

One non-obvious failure mode: if rollouts of different prompts arrive at different cadences (some prompts are slow because of long-tail decode — lesson 22b), the buffer accumulates a non-uniform distribution over prompts at any given trainer step. Mitigation: sample buffer entries with prompt-balanced weighting, not uniform.

Weight sync mechanics — the four operations

The lesson 22a equation hides τ_S behind a single symbol. In a real cluster it's a sequence of four operations, each with its own bandwidth and latency profile:

1 · all-gather DP × TP → full param NCCL intra-node 2 · dtype cast BF16 → FP8 / INT8 compute kernel 3 · broadcast trainer → rollout NCCL inter-node 4 · reshard trainer TP → rollout TP NCCL + permute ~0.5–2 s bound: intra-node NVLink ~0.1–0.3 s bound: HBM write ~1–5 s bound: inter-node network ~0.5–4 s bound: NCCL + reorder total τ_S ≈ 2–10 s on a 70B model with 8×H100 node and 200 Gb/s inter-node — and grows with model size

Step 1 — All-gather sharded parameters

If the trainer uses FSDP / ZeRO-3, its weights are sharded across the data-parallel group, and the shards must be assembled before they can be sent anywhere. Skipped when the trainer is pure DDP or TP-only (full params already resident). The bound is NVLink bandwidth inside the node (~600 GB/s on H100). For a 70B model at BF16, the total weight tensor is ~140 GB; an all-gather across 8 GPUs on NVLink runs in roughly 0.5–1 second.

Step 2 — Dtype cast to the inference precision

The rollout engine runs at FP8 or INT8 (lesson 22b). The trainer holds BF16 (plus FP32 master). A fused cast kernel converts BF16 → FP8 in place; for 70B that's a ~70 GB write at HBM bandwidth (~3 TB/s), so ~0.02 seconds per H100. With per-tensor scaling for FP8, you also need to compute the per-tensor max for the scale factor — which is the actual bottleneck (~0.1–0.3 s for a full pass).

Step 3 — Broadcast to the rollout engine

This is the long pole. Send 70 GB (FP8'd weights) across the network from one node to another. At 200 Gb/s (25 GB/s) that's ~3 seconds; at 400 Gb/s it's ~1.5 s. Two optimizations matter:

Step 4 — Reshard for the rollout TP layout

Trainer often uses TP=8 (one node, fully parallel); rollout often uses TP=2 or TP=4 (replicate for higher throughput per request). The freshly-broadcast weights need to be re-split along the new TP axis. This is mostly a memory-bound shuffle on the rollout side, plus a small NCCL all-to-all if the layout reshuffles channels. Cost: 0.5–4 seconds depending on model size and reshard distance.

Five optimizations, in compose order

Optimizationτ_S relativeWhat it does
Baseline (naive synchronous broadcast)1.00×
+ Broadcast tree (log-fan-out)0.55×trainer egress no longer bottlenecks; sender-side bandwidth halved
+ Bucket fusion (group small tensors)0.45×amortize NCCL launch latency over fewer larger calls
+ FP8 cast in-flight (cast during broadcast)0.30×halve bytes on the wire (~0.5× alone) but the cast kernel + per-tensor scale reduction adds ~0.1–0.3 s back, so net ≈ 0.65× of the prior row
+ Zero-copy via shared memory / RDMA0.22×skip the host-side memcpy; GPU-to-GPU direct
+ Overlap with next rollout prefill0.00×*τ_S now hidden behind τ_R — it no longer counts toward τ_step

The last row is where the win compounds. τ_S is no longer additive — it's overlapped. The trainer pushes the new weights while the rollout engine is mid-decode on the previous version; by the time the rollout finishes that batch, the new weights are already in place. The cost is exactly one batch of staleness: rollouts sampled during the broadcast are tagged with the old version. The IS correction handles it.

Interactive: dependency-graph throughput

Compose the topology and the staleness budget. The widget renders a Gantt-style view of one or two steps under your chosen topology and reports the achieved throughput plus the clip fraction predicted from the staleness.

Async topology throughput & staleness
Pick a topology and a staleness budget Δ (trainer steps between rollout and use). The bar chart shows the dependency graph of one period of steps; the KPIs report throughput, idle %, and predicted clip fraction.
τ_step
tok/sec relative
trainer idle
clip fraction
What to try. (1) colocated, τ_R=20, τ_T=8, τ_S=3: τ_step = 31, trainer idle = 0% (because trainer doesn't run during rollout). (2) disagg same: τ_step = max(20, 8)+3 = 23 — 26% faster; trainer idles 12 seconds per step. (3) async with Δ=2: τ_step = max(R, T, S) = 20, throughput +50%; clip fraction climbs to ~10%, still safe. (4) async with Δ=8: τ_step still 20, but clip fraction crosses 25% — the freshness wall. The gradient becomes unreliable; throughput numbers are a lie.

Pipeline parallelism × RL — the bubble interaction

One last interaction worth surfacing. Pipeline parallelism (PP) shards a model's layers across GPUs: GPU 0 has layers 0–7, GPU 1 has layers 8–15, etc. A forward pass walks a sequence through the stages; backward walks it back. The well-known PP cost is the "bubble" — at the start and end of each step, some stages are idle waiting for the pipeline to fill or drain.

Three RL-specific complications:

This is the kind of bug that shows up as "loss is fine for a week, then suddenly diverges" — exactly the symptom that the diagnostic playbook in lesson 28 names "weight version skew." If you run PP + async RL, instrument per-stage weight version as a first-class signal.

When async is worth the engineering — and when it isn't

The throughput equation makes the trade explicit. Async claims back the idle band in the disaggregated row of the dependency graph. The size of that band is:

idle_band = |τ_R − τ_T|   +   τ_S

If τ_R ≈ τ_T and τ_S is well-overlapped already, async gives you almost nothing. If τ_R ≫ τ_T (the common case before applying lesson 22b's patches), async halves wall-clock. The pre-lesson-22b regime is where most published "async RL gives 2× speedup" numbers come from. After lesson 22b patches, the gap closes; async's win is more like 1.2–1.5×.

RegimeWorth async?Why
τ_R ≫ τ_T (rollout-dominated, no patches)Yes (2×+)large idle band; sync is the only thing serializing
τ_R ≈ τ_T (post-22b patches)Marginal (1.2–1.5×)idle band is small; engineering complexity may not pay back
Small cluster, single nodeNocolocated topology is simpler and within 10% of disaggregated
PRM / per-step rewards (lesson 17)Hardper-step reward propagation across stale rollouts is fragile
RM-only RL with drifting RM (lesson 15)HardRM drift adds a second source of off-policyness; ρ stops being interpretable
Takeaway
Async RL collapses τ_step from sum-of-terms to max-of-terms — but only if the importance ratio ρ between rollout and trainer stays inside the PPO clip. The freshness wall (~4–8 trainer steps of staleness) is the hard limit; clip fraction and mean log-ρ are the metrics that tell you where you sit relative to it. The four sync operations (all-gather, cast, broadcast, reshard) are independently optimizable; the prize is overlapping τ_S with the next rollout's decode so it stops being additive at all. After all three throughput lessons, the dominant term has typically shifted from rollout (lesson 22a baseline) to nothing in particular — which is where you should be.