all_lessons / ml_system_design / 09b · placement and weight sync RL systems 2 / 4

Placement and weight sync, from the layout mismatch up

09a left a term unexplained: τS, the weight-sync time. This lesson derives it. The chain is short and forces everything else: the actor and the learner want physically different shardings of the same N parameters, so a conversion is unavoidable, that conversion moves bytes, and the bytes cost time on a link. Where you put the two roles — colocated, disaggregated, hybrid, relayed — is nothing more than four ways to pay or hide that one cost.

The question this lesson answers
Where do the actor and learner live, and how do fresh weights get from one to the other cheaply? Almost every clever thing a SOTA framework does — 3D-HybridEngine resharding, bucketed updates, direct-memory sync, relay workers — is an answer to that, and which answer is right is set by a number you can compute: τS as a fraction of τstep.

1 · Why a conversion is forced — the layout mismatch

The learner and the actor are the same N weights, but they are sharded across GPUs differently, because training and serving optimize for different things:

These two partitions almost never coincide. A learner GPU holds "1/d of every tensor"; an actor GPU holds "all of some tensors, none of others." So handing weights from learner to actor is not a copy — it is a gather-then-reshard: reassemble each full parameter from its learner shards, then re-cut it along the actor's partition. That reshape is the irreducible reason weight sync exists, and why it is more than a memcpy.

learner shard layout  ≠  actor shard layout  ⟹  gather + reshard required every update

2 · The byte cost, and the four operations it decomposes into

How big is the transfer? Lesson 02's weight rule: one full copy of the policy is precision × N bytes, and you must deliver it to every independent actor replica:

sync_bytes ≈ precision_bytes × N × actor_copies

For a 70B bf16 policy one copy is ~140 GB; at 405B it is ~810 GB. That is the headline number, but it hides structure. The actual sync is four operations in series (RL 22c), and each has its own lever — which is exactly why there are several "weight-sync patterns" rather than one:

#OperationBound byOrder (70B, in-node)The lever on it
1All-gather FSDP shards → full paramsintra-node BW (NVLink)~0.5–2 sbroadcast tree, overlap with compute
2Cast dtype (bf16 → fp8)HBM write~0.1–0.3 sfuse into the gather; fp8 halves the bytes downstream
3Broadcast learner → every actorinter-node BW (IB)~1–5 sbucketing, RDMA/DMA, fewer copies — the long pole
4Reshard into actor TP layoutNCCL + permute~0.5–4 sresharding engine (HybridFlow), avoid round-trips

Operation 3 is the long pole, and it is where lesson 02's 18× gap bites: 140 GB over NVLink (~900 GB/s) is ~0.16 s, but over InfiniBand (~50 GB/s) it is ~2.8 s — which can rival a whole training step. The byte equation alone tells you the architecture: τS grows linearly with N and inversely with the slowest link on the path, so the bigger the model and the slower the link, the more the design must work to avoid doing this globally every step.

Weight sync buys nothing — it is pure tax
A gradient step buys learning; a rollout buys data; the sync buys neither. It is the cost of keeping two copies of the weights in two layouts. So unlike τR or τT, the goal for τS is not "make it efficient" — it is "make it disappear," either by not having two copies (colocate) or by hiding it behind the next rollout's prefill.

3 · Placement = four ways to pay or hide τS

Now the placement choices are not a menu to memorize — each is a distinct answer to "what do we do about the conversion cost." Derive the idle they create. In disaggregated synchronous mode the step is max(τR, τT) + τS, so the faster pool sits idle for the difference, plus the whole cluster waits out the sync:

idle_band = |τR − τT| + τS

Every placement is an attack on one of those two terms:

PlacementPhysical shapeWhat it does to τS / idleFailure mode
ColocatedActor and learner time-share the same GPUs.τS → ~0: same process, same device — sync is a pointer swap, no conversion over the wire.No overlap (the R−τT| term is the whole step); HBM thrash swapping KV ↔ optimizer state.
DisaggregatedSeparate actor pool and learner pool.Pays the full τS broadcast, but lets τR and τT overlap and be sized independently.O(model size) broadcast every step; a pool idles if the split is wrong (09a §4).
Hybrid engineShared workers, but train and generate phases reshard in place with zero redundancy.Kills the duplicate actor copy and shrinks operation 4 — reshard locally instead of broadcasting a second copy.Hard engine integration; the code must understand both training and serving layouts at once.
Relay / parameter serviceWeights flow through relay workers; actors pull a fresh-enough version when ready.Removes the global sync barrier entirely — τS leaves the critical path, at the cost of versioned staleness.Versioning, staleness, and failure recovery become first-class (09c).

The decision is workload-dependent, and the byte equation decides it. A 7B experiment on one node has tiny τS and should pay the complexity of nothing — colocate and pointer-swap. A 405B policy with long-CoT rollouts has an 810 GB τS and slow inter-node links, so a naive global broadcast every step is the bug; it must reshard in place, bucket, or relay.

4 · The named patterns, each grounded in an operation

With the four operations and the idle equation in hand, the public "weight-movement patterns" stop being a list of brand names and become a map of which operation each one attacks:

PatternAttacksMechanism in one lineChoose it when
3D-HybridEngine reshard (HybridFlow / verl)op 4 + the duplicate copyReshard the actor between train and generate layouts in place, with low memory redundancy.Colocated/hybrid is attractive but naive reloads dominate.
Ray + vLLM + ZeRO handoff (OpenRLHF)scheduling around ops 1–4Practical disaggregated orchestration over widely-used engines.Accessibility and flexible RLHF/RLVR workflows matter most.
Bucketed update (slime / SGLang)op 3 latencySend weights in fused buckets and overlap each bucket's transfer with the next gather.Full monolithic reload is too slow but actors stay service-based.
Direct-memory sync (LlamaRL)op 3 copiesGPU-to-GPU RDMA writes that skip host staging.Hundreds of billions of params and tight hardware coupling is acceptable.
Relay workers (Laminar)the global barrier itselfA parameter tier lets each rollout pull a fresh-enough version without a cluster-wide sync.Long-tail trajectories make global sync the bottleneck.
TransferQueue + staleness knob (Relax / AsyncFlow)moves τS off the critical pathA data bus carries weights and trajectories; one knob slides from on-policy to fully async.Multimodal/agentic roles need fault-isolated services.

5 · The freshness contract — staleness is a budget, not a switch

The moment the sync leaves the critical path (hybrid, relay, async), trajectories start being generated under old weights. The controller must therefore stamp every trajectory with the policy version that produced it, so the learner can decide whether to accept it — because the gradient correction depends on it (the importance ratio of RL 06; the full statistical story is 09c). The contract has a few discrete settings:

Freshness modeRuleSystem effectStatistical effect
Strict on-policytrain update t only on version-t samplesGlobal barriers, low utilization on long-tail.Cleanest gradient estimator.
Near-on-policyaccept samples within k versionsActor/learner overlap; stale tail bounded.Small off-policy bias, usually fine if monitored.
Fully asyncqueue-based admission, no global batchHighest utilization.Needs staleness-aware loss / IS correction.
Replay / search-heavyprioritize reward or recency from a bufferSearcher fleet scales past the learner.Needs an off-policy-compatible objective.
Design rule: treat staleness like a latency SLO
It is not "async: yes/no" — it is a budget you spend. If reward at fixed GPU-hours is unchanged when you loosen k, spend the budget to reclaim the idle_band. If reward degrades, tighten k before buying more actors. The measurement that matters is final quality at fixed compute, never steps/hour.

Interactive · sync time and freshness budget

Watch τS grow with the model and turn from a rounding error into the architecture. The exact number is not the point — the point is the threshold where simple broadcast stops working and the design must reshard, bucket, or relay.

Weight sync sizing

Assumes bf16 weights, full-policy update. Effective bandwidth is below the raw link spec because it folds in protocol, reshard, and scheduling losses. Actor copies = independent inference replicas needing fresh weights. The verdict tracks τS / τstep.

one copy
-
full fanout
-
τS / step
-
pattern
-

What carries forward

Sources used

SourceSystem idea used
HybridFlow / verlHybrid control and 3D-HybridEngine resharding between train and generate layouts.
verl repositoryFSDP/Megatron training + vLLM/SGLang rollout backends, RL algorithms, MoE scale.
OpenRLHFRay + vLLM + DeepSpeed/ZeRO disaggregated scheduling.
slimeSGLang-native rollout, Megatron-native training, bucketed weight updates.
LlamaRLAsync architecture and direct-memory (RDMA) weight synchronization.
LaminarRelay-worker parameter service that breaks the global sync barrier.
Relax / AsyncFlowTransferQueue data bus, service decoupling, continuous staleness knob.