Placement and weight sync, from the layout mismatch up
09a left a term unexplained: τS, the weight-sync time. This lesson derives it. The chain is short and forces everything else: the actor and the learner want physically different shardings of the same N parameters, so a conversion is unavoidable, that conversion moves bytes, and the bytes cost time on a link. Where you put the two roles — colocated, disaggregated, hybrid, relayed — is nothing more than four ways to pay or hide that one cost.
1 · Why a conversion is forced — the layout mismatch
The learner and the actor are the same N weights, but they are sharded across GPUs differently, because training and serving optimize for different things:
- Learner layout. FSDP/ZeRO shards by parameter: every GPU holds a thin slice of every layer's weights, plus that slice's gradient and optimizer state (the 16N bill of lesson 07). Megatron may instead split with tensor/pipeline/expert/context parallel. The layout is chosen to fit the 16N training state and keep MFU high.
- Actor layout. vLLM/SGLang want tensor-parallel inference weights laid out for fast decode, paged KV alongside (RL 20), continuous batching, and often a lower precision or MoE-specialized kernel. The layout is chosen to maximize decode bandwidth, and it carries no optimizer state.
These two partitions almost never coincide. A learner GPU holds "1/d of every tensor"; an actor GPU holds "all of some tensors, none of others." So handing weights from learner to actor is not a copy — it is a gather-then-reshard: reassemble each full parameter from its learner shards, then re-cut it along the actor's partition. That reshape is the irreducible reason weight sync exists, and why it is more than a memcpy.
2 · The byte cost, and the four operations it decomposes into
How big is the transfer? Lesson 02's weight rule: one full copy of the policy is precision × N bytes, and you must deliver it to every independent actor replica:
For a 70B bf16 policy one copy is ~140 GB; at 405B it is ~810 GB. That is the headline number, but it hides structure. The actual sync is four operations in series (RL 22c), and each has its own lever — which is exactly why there are several "weight-sync patterns" rather than one:
| # | Operation | Bound by | Order (70B, in-node) | The lever on it |
|---|---|---|---|---|
| 1 | All-gather FSDP shards → full params | intra-node BW (NVLink) | ~0.5–2 s | broadcast tree, overlap with compute |
| 2 | Cast dtype (bf16 → fp8) | HBM write | ~0.1–0.3 s | fuse into the gather; fp8 halves the bytes downstream |
| 3 | Broadcast learner → every actor | inter-node BW (IB) | ~1–5 s | bucketing, RDMA/DMA, fewer copies — the long pole |
| 4 | Reshard into actor TP layout | NCCL + permute | ~0.5–4 s | resharding engine (HybridFlow), avoid round-trips |
Operation 3 is the long pole, and it is where lesson 02's 18× gap bites: 140 GB over NVLink (~900 GB/s) is ~0.16 s, but over InfiniBand (~50 GB/s) it is ~2.8 s — which can rival a whole training step. The byte equation alone tells you the architecture: τS grows linearly with N and inversely with the slowest link on the path, so the bigger the model and the slower the link, the more the design must work to avoid doing this globally every step.
3 · Placement = four ways to pay or hide τS
Now the placement choices are not a menu to memorize — each is a distinct answer to "what do we do about the conversion cost." Derive the idle they create. In disaggregated synchronous mode the step is max(τR, τT) + τS, so the faster pool sits idle for the difference, plus the whole cluster waits out the sync:
Every placement is an attack on one of those two terms:
| Placement | Physical shape | What it does to τS / idle | Failure mode |
|---|---|---|---|
| Colocated | Actor and learner time-share the same GPUs. | τS → ~0: same process, same device — sync is a pointer swap, no conversion over the wire. | No overlap (the |τR−τT| term is the whole step); HBM thrash swapping KV ↔ optimizer state. |
| Disaggregated | Separate actor pool and learner pool. | Pays the full τS broadcast, but lets τR and τT overlap and be sized independently. | O(model size) broadcast every step; a pool idles if the split is wrong (09a §4). |
| Hybrid engine | Shared workers, but train and generate phases reshard in place with zero redundancy. | Kills the duplicate actor copy and shrinks operation 4 — reshard locally instead of broadcasting a second copy. | Hard engine integration; the code must understand both training and serving layouts at once. |
| Relay / parameter service | Weights flow through relay workers; actors pull a fresh-enough version when ready. | Removes the global sync barrier entirely — τS leaves the critical path, at the cost of versioned staleness. | Versioning, staleness, and failure recovery become first-class (09c). |
The decision is workload-dependent, and the byte equation decides it. A 7B experiment on one node has tiny τS and should pay the complexity of nothing — colocate and pointer-swap. A 405B policy with long-CoT rollouts has an 810 GB τS and slow inter-node links, so a naive global broadcast every step is the bug; it must reshard in place, bucket, or relay.
4 · The named patterns, each grounded in an operation
With the four operations and the idle equation in hand, the public "weight-movement patterns" stop being a list of brand names and become a map of which operation each one attacks:
| Pattern | Attacks | Mechanism in one line | Choose it when |
|---|---|---|---|
| 3D-HybridEngine reshard (HybridFlow / verl) | op 4 + the duplicate copy | Reshard the actor between train and generate layouts in place, with low memory redundancy. | Colocated/hybrid is attractive but naive reloads dominate. |
| Ray + vLLM + ZeRO handoff (OpenRLHF) | scheduling around ops 1–4 | Practical disaggregated orchestration over widely-used engines. | Accessibility and flexible RLHF/RLVR workflows matter most. |
| Bucketed update (slime / SGLang) | op 3 latency | Send weights in fused buckets and overlap each bucket's transfer with the next gather. | Full monolithic reload is too slow but actors stay service-based. |
| Direct-memory sync (LlamaRL) | op 3 copies | GPU-to-GPU RDMA writes that skip host staging. | Hundreds of billions of params and tight hardware coupling is acceptable. |
| Relay workers (Laminar) | the global barrier itself | A parameter tier lets each rollout pull a fresh-enough version without a cluster-wide sync. | Long-tail trajectories make global sync the bottleneck. |
| TransferQueue + staleness knob (Relax / AsyncFlow) | moves τS off the critical path | A data bus carries weights and trajectories; one knob slides from on-policy to fully async. | Multimodal/agentic roles need fault-isolated services. |
5 · The freshness contract — staleness is a budget, not a switch
The moment the sync leaves the critical path (hybrid, relay, async), trajectories start being generated under old weights. The controller must therefore stamp every trajectory with the policy version that produced it, so the learner can decide whether to accept it — because the gradient correction depends on it (the importance ratio of RL 06; the full statistical story is 09c). The contract has a few discrete settings:
| Freshness mode | Rule | System effect | Statistical effect |
|---|---|---|---|
| Strict on-policy | train update t only on version-t samples | Global barriers, low utilization on long-tail. | Cleanest gradient estimator. |
| Near-on-policy | accept samples within k versions | Actor/learner overlap; stale tail bounded. | Small off-policy bias, usually fine if monitored. |
| Fully async | queue-based admission, no global batch | Highest utilization. | Needs staleness-aware loss / IS correction. |
| Replay / search-heavy | prioritize reward or recency from a buffer | Searcher fleet scales past the learner. | Needs an off-policy-compatible objective. |
Interactive · sync time and freshness budget
Watch τS grow with the model and turn from a rounding error into the architecture. The exact number is not the point — the point is the threshold where simple broadcast stops working and the design must reshard, bucket, or relay.
What carries forward
- The mismatch forces the cost. Learner shards by parameter (+16N state); actor shards for decode (no optimizer, paged KV). Different partitions ⟹ a gather-and-reshard every update — that is what τS is.
- The bytes set the architecture: sync_bytes ≈ precision × N × actor_copies — 140 GB at 70B, 810 GB at 405B — decomposing into all-gather → cast → broadcast → reshard, with the inter-node broadcast as the long pole (lesson 02's 18×).
- Placement is four ways to pay or hide it: colocate (τS→0, no overlap), disaggregate (pay it, gain overlap), hybrid (reshard in place, drop the copy), relay (take it off the critical path for staleness).
- Sync is pure tax, so the goal is to make it vanish, not to make it efficient — hence pointer-swap when colocated and overlap-with-prefill when not.
- Once sync leaves the critical path, staleness appears — a budget measured against reward at fixed compute, with version-tagged trajectories as the prerequisite. The statistics of that budget are 09c.
Sources used
| Source | System idea used |
|---|---|
| HybridFlow / verl | Hybrid control and 3D-HybridEngine resharding between train and generate layouts. |
| verl repository | FSDP/Megatron training + vLLM/SGLang rollout backends, RL algorithms, MoE scale. |
| OpenRLHF | Ray + vLLM + DeepSpeed/ZeRO disaggregated scheduling. |
| slime | SGLang-native rollout, Megatron-native training, bucketed weight updates. |
| LlamaRL | Async architecture and direct-memory (RDMA) weight synchronization. |
| Laminar | Relay-worker parameter service that breaks the global sync barrier. |
| Relax / AsyncFlow | TransferQueue data bus, service decoupling, continuous staleness knob. |