Async pipelining & weight sync — engineering for overlap
The throughput equation (22a) said disaggregated topology leaves the trainer idle while rollout runs. Lesson 22b cut rollout dominance from ~75% to ~25%. The next wall is wired into the topology itself: every step still serializes on a weight broadcast. This lesson turns that serial dependency into a pipeline — with an off-policy bill the rest of the lesson explains how to pay.
The dependency graph of one step
Three rows: rollout engine, trainer, weight-sync channel. Read left to right.
The fully-async row is the prize: if τ_R, τ_T, and τ_S can be made to run on different hardware and never block each other, wall-clock is just the slowest of the three rather than their sum. The catch is that the trainer at step t is updating weights that the rollout at step t already finished using. The rollouts the trainer is consuming were sampled under a different, older policy. Welcome to off-policy RL.
Versioned weights and the importance-sampling ratio
The cleanest way to think about async is: every rollout carries the version number of the policy it was sampled from. Call this v(traj). The trainer at version v_t reads a batch where v(traj) ∈ {v_t − Δ, …, v_t} for some staleness budget Δ. Each token's policy gradient is then weighted by an importance ratio:
This is the same ρ that PPO uses for re-using rollouts across multiple gradient epochs (lesson 10). The PPO clip [1 − ε, 1 + ε] exists precisely to keep the gradient stable when ρ drifts away from 1. The async setting reuses that machinery: as long as ρ stays inside the clip range for most tokens, the gradient is approximately unbiased.
Two metrics to instrument:
- Clip fraction. The percentage of tokens where ρ falls outside [1 − ε, 1 + ε]. Healthy: < 5%. Async-stable: < 15%. Past 25% the gradient is dominated by clipped contributions and the estimator is broken.
- Mean log-ρ. The expected log-importance-ratio. Drift is mean log-ρ moving away from zero. Healthy: |mean log-ρ| < 0.05. A persistent drift means the rollout is systematically behind the trainer; either lower Δ or check that the sync is actually delivering new weights.
The replay buffer and "freshness" knobs
Once rollouts carry versions, you can store them in a buffer and consume them out of order. Three pieces of bookkeeping:
- Buffer size. Number of (trajectory, version) pairs to keep. Too small and you starve the trainer; too large and you accumulate stale samples whose ρ has drifted past the clip.
- Eviction policy. When the buffer fills, evict by version (drop the stalest) or by "expected ρ exceeds threshold" (drop ones that would clip).
- Sampling policy. Recent-biased uniform: weight samples by w(v) = γv_t − v with γ ≈ 0.95 — newer rollouts get sampled more.
One non-obvious failure mode: if rollouts of different prompts arrive at different cadences (some prompts are slow because of long-tail decode — lesson 22b), the buffer accumulates a non-uniform distribution over prompts at any given trainer step. Mitigation: sample buffer entries with prompt-balanced weighting, not uniform.
Weight sync mechanics — the four operations
The lesson 22a equation hides τ_S behind a single symbol. In a real cluster it's a sequence of four operations, each with its own bandwidth and latency profile:
Step 1 — All-gather sharded parameters
If the trainer uses FSDP / ZeRO-3, its weights are sharded across the data-parallel group, and the shards must be assembled before they can be sent anywhere. Skipped when the trainer is pure DDP or TP-only (full params already resident). The bound is NVLink bandwidth inside the node (~600 GB/s on H100). For a 70B model at BF16, the total weight tensor is ~140 GB; an all-gather across 8 GPUs on NVLink runs in roughly 0.5–1 second.
Step 2 — Dtype cast to the inference precision
The rollout engine runs at FP8 or INT8 (lesson 22b). The trainer holds BF16 (plus FP32 master). A fused cast kernel converts BF16 → FP8 in place; for 70B that's a ~70 GB write at HBM bandwidth (~3 TB/s), so ~0.02 seconds per H100. With per-tensor scaling for FP8, you also need to compute the per-tensor max for the scale factor — which is the actual bottleneck (~0.1–0.3 s for a full pass).
Step 3 — Broadcast to the rollout engine
This is the long pole. Send 70 GB (FP8'd weights) across the network from one node to another. At 200 Gb/s (25 GB/s) that's ~3 seconds; at 400 Gb/s it's ~1.5 s. Two optimizations matter:
- Broadcast tree. Instead of one trainer → many rollouts as a star, use a tree topology where intermediate rollouts forward to leaf rollouts. Reduces the trainer-side egress bottleneck from O(N) to O(log N) where N is the number of rollout replicas.
- Bucket fusion. Group small tensors (LayerNorm γ, attention biases) into one large bucket before broadcasting. NCCL has a fixed per-call latency; fewer larger calls dramatically outperform many small ones for a fixed total byte count.
Step 4 — Reshard for the rollout TP layout
Trainer often uses TP=8 (one node, fully parallel); rollout often uses TP=2 or TP=4 (replicate for higher throughput per request). The freshly-broadcast weights need to be re-split along the new TP axis. This is mostly a memory-bound shuffle on the rollout side, plus a small NCCL all-to-all if the layout reshuffles channels. Cost: 0.5–4 seconds depending on model size and reshard distance.
Five optimizations, in compose order
| Optimization | τ_S relative | What it does |
|---|---|---|
| Baseline (naive synchronous broadcast) | 1.00× | — |
| + Broadcast tree (log-fan-out) | 0.55× | trainer egress no longer bottlenecks; sender-side bandwidth halved |
| + Bucket fusion (group small tensors) | 0.45× | amortize NCCL launch latency over fewer larger calls |
| + FP8 cast in-flight (cast during broadcast) | 0.30× | halve bytes on the wire (~0.5× alone) but the cast kernel + per-tensor scale reduction adds ~0.1–0.3 s back, so net ≈ 0.65× of the prior row |
| + Zero-copy via shared memory / RDMA | 0.22× | skip the host-side memcpy; GPU-to-GPU direct |
| + Overlap with next rollout prefill | 0.00×* | τ_S now hidden behind τ_R — it no longer counts toward τ_step |
The last row is where the win compounds. τ_S is no longer additive — it's overlapped. The trainer pushes the new weights while the rollout engine is mid-decode on the previous version; by the time the rollout finishes that batch, the new weights are already in place. The cost is exactly one batch of staleness: rollouts sampled during the broadcast are tagged with the old version. The IS correction handles it.
Interactive: dependency-graph throughput
Compose the topology and the staleness budget. The widget renders a Gantt-style view of one or two steps under your chosen topology and reports the achieved throughput plus the clip fraction predicted from the staleness.
Pipeline parallelism × RL — the bubble interaction
One last interaction worth surfacing. Pipeline parallelism (PP) shards a model's layers across GPUs: GPU 0 has layers 0–7, GPU 1 has layers 8–15, etc. A forward pass walks a sequence through the stages; backward walks it back. The well-known PP cost is the "bubble" — at the start and end of each step, some stages are idle waiting for the pipeline to fill or drain.
Three RL-specific complications:
- Trainer bubble compounds with rollout long-tail. If rollouts have heavy-tailed lengths (lesson 22b), the trainer's micro-batches will have heterogeneous lengths. Interleaved 1F1B scheduling (Megatron's default) assumes near-uniform micro-batch cost; a single long micro-batch stretches one stage's work and starves the others. Mitigation: sort or balance the per-micro-batch token count before launching the trainer pass. Caveat: sorting reorders the gradient's mini-batch composition, which subtly biases the group-baseline computation in GRPO — keep the per-group structure intact when you sort.
- Async PP can produce version-inconsistent weights across stages. Naive weight-sync push updates each pipeline stage independently. If a rollout starts decoding while only some stages have new weights, the forward will use a mix of versions — a kind of weight-tearing that breaks IS correction silently. Mitigation: atomic per-step swap — all stages' weights become live simultaneously, or the new weights are held back until all stages are ready.
- Activation recompute × PP bubbles. PP stages with activation recompute (lesson 22) re-run the forward inside backward to save memory. Under async, the recompute uses the trainer's current weights at backward time — which may have advanced since the forward ran. The fix is to recompute against the saved forward's weight version, not the latest, but most PP scheduler implementations don't expose this knob cleanly.
This is the kind of bug that shows up as "loss is fine for a week, then suddenly diverges" — exactly the symptom that the diagnostic playbook in lesson 28 names "weight version skew." If you run PP + async RL, instrument per-stage weight version as a first-class signal.
When async is worth the engineering — and when it isn't
The throughput equation makes the trade explicit. Async claims back the idle band in the disaggregated row of the dependency graph. The size of that band is:
If τ_R ≈ τ_T and τ_S is well-overlapped already, async gives you almost nothing. If τ_R ≫ τ_T (the common case before applying lesson 22b's patches), async halves wall-clock. The pre-lesson-22b regime is where most published "async RL gives 2× speedup" numbers come from. After lesson 22b patches, the gap closes; async's win is more like 1.2–1.5×.
| Regime | Worth async? | Why |
|---|---|---|
| τ_R ≫ τ_T (rollout-dominated, no patches) | Yes (2×+) | large idle band; sync is the only thing serializing |
| τ_R ≈ τ_T (post-22b patches) | Marginal (1.2–1.5×) | idle band is small; engineering complexity may not pay back |
| Small cluster, single node | No | colocated topology is simpler and within 10% of disaggregated |
| PRM / per-step rewards (lesson 17) | Hard | per-step reward propagation across stale rollouts is fragile |
| RM-only RL with drifting RM (lesson 15) | Hard | RM drift adds a second source of off-policyness; ρ stops being interpretable |