Placement and weight sync, from the layout mismatch up

09a left a term unexplained: τ_S, the weight-sync time. This lesson derives it. The chain is short and forces everything else: the actor and the learner want physically different shardings of the same N parameters, so a conversion is unavoidable, that conversion moves bytes, and the bytes cost time on a link. Where you put the two roles — colocated, disaggregated, hybrid, relayed — is nothing more than four ways to pay or hide that one cost.

The question this lesson answers

Where do the actor and learner live, and how do fresh weights get from one to the other cheaply? Almost every clever thing a SOTA framework does — 3D-HybridEngine resharding, bucketed updates, direct-memory sync, relay workers — is an answer to that, and which answer is right is set by a number you can compute: τ_S as a fraction of τ_step.

1 · Why a conversion is forced — the layout mismatch

The learner and the actor are the same N weights, but they are sharded across GPUs differently, because training and serving optimize for different things:

Learner layout. FSDP/ZeRO shards by parameter: every GPU holds a thin slice of every layer's weights, plus that slice's gradient and optimizer state (the 16N bill of lesson 07). Megatron may instead split with tensor/pipeline/expert/context parallel. The layout is chosen to fit the 16N training state and keep MFU high.
Actor layout. vLLM/SGLang want tensor-parallel inference weights laid out for fast decode, paged KV alongside (RL 20), continuous batching, and often a lower precision or MoE-specialized kernel. The layout is chosen to maximize decode bandwidth, and it carries no optimizer state.

These two partitions almost never coincide. A learner GPU holds "1/d of every tensor"; an actor GPU holds "all of some tensors, none of others." So handing weights from learner to actor is not a copy — it is a gather-then-reshard: reassemble each full parameter from its learner shards, then re-cut it along the actor's partition. That reshape is the irreducible reason weight sync exists, and why it is more than a memcpy.

learner shard layout ≠ actor shard layout ⟹ gather + reshard required every update

2 · The byte cost, and the four operations it decomposes into

How big is the transfer? Lesson 02's weight rule: one full copy of the policy is precision × N bytes, and you must deliver it to every independent actor replica:

sync_bytes ≈ precision_bytes × N × actor_copies

For a 70B bf16 policy one copy is ~140 GB; at 405B it is ~810 GB. That is the headline number, but it hides structure. The actual sync is four operations in series (RL 22c), and each has its own lever — which is exactly why there are several "weight-sync patterns" rather than one:

#	Operation	Bound by	Order (70B, in-node)	The lever on it
1	All-gather FSDP shards → full params	intra-node BW (NVLink)	~0.5–2 s	broadcast tree, overlap with compute
2	Cast dtype (bf16 → fp8)	HBM write	~0.1–0.3 s	fuse into the gather; fp8 halves the bytes downstream
3	Broadcast learner → every actor	inter-node BW (IB)	~1–5 s	bucketing, RDMA/DMA, fewer copies — the long pole
4	Reshard into actor TP layout	NCCL + permute	~0.5–4 s	resharding engine (HybridFlow), avoid round-trips

Operation 3 is the long pole, and it is where lesson 02's 18× gap bites: 140 GB over NVLink (~900 GB/s) is ~0.16 s, but over InfiniBand (~50 GB/s) it is ~2.8 s — which can rival a whole training step. The byte equation alone tells you the architecture: τ_S grows linearly with N and inversely with the slowest link on the path, so the bigger the model and the slower the link, the more the design must work to avoid doing this globally every step.

Weight sync buys nothing — it is pure tax

A gradient step buys learning; a rollout buys data; the sync buys neither. It is the cost of keeping two copies of the weights in two layouts. So unlike τ_R or τ_T, the goal for τ_S is not "make it efficient" — it is "make it disappear," either by not having two copies (colocate) or by hiding it behind the next rollout's prefill.

3 · Placement = four ways to pay or hide τ_S

Now the placement choices are not a menu to memorize — each is a distinct answer to "what do we do about the conversion cost." Derive the idle they create. In disaggregated synchronous mode the step is max(τ_R, τ_T) + τ_S, so the faster pool sits idle for the difference, plus the whole cluster waits out the sync:

idle_band = |τ_R − τ_T| + τ_S

Every placement is an attack on one of those two terms:

Placement	Physical shape	What it does to τ_S / idle	Failure mode
Colocated	Actor and learner time-share the same GPUs.	τ_S → ~0: same process, same device — sync is a pointer swap, no conversion over the wire.	No overlap (the \|τ_R−τ_T\| term is the whole step); HBM thrash swapping KV ↔ optimizer state.
Disaggregated	Separate actor pool and learner pool.	Pays the full τ_S broadcast, but lets τ_R and τ_T overlap and be sized independently.	O(model size) broadcast every step; a pool idles if the split is wrong (09a §4).
Hybrid engine	Shared workers, but train and generate phases reshard in place with zero redundancy.	Kills the duplicate actor copy and shrinks operation 4 — reshard locally instead of broadcasting a second copy.	Hard engine integration; the code must understand both training and serving layouts at once.
Relay / parameter service	Weights flow through relay workers; actors pull a fresh-enough version when ready.	Removes the global sync barrier entirely — τ_S leaves the critical path, at the cost of versioned staleness.	Versioning, staleness, and failure recovery become first-class (09c).

The decision is workload-dependent, and the byte equation decides it. A 7B experiment on one node has tiny τ_S and should pay the complexity of nothing — colocate and pointer-swap. A 405B policy with long-CoT rollouts has an 810 GB τ_S and slow inter-node links, so a naive global broadcast every step is the bug; it must reshard in place, bucket, or relay.

4 · The named patterns, each grounded in an operation

With the four operations and the idle equation in hand, the public "weight-movement patterns" stop being a list of brand names and become a map of which operation each one attacks:

Pattern	Attacks	Mechanism in one line	Choose it when
3D-HybridEngine reshard (HybridFlow / verl)	op 4 + the duplicate copy	Reshard the actor between train and generate layouts in place, with low memory redundancy.	Colocated/hybrid is attractive but naive reloads dominate.
Ray + vLLM + ZeRO handoff (OpenRLHF)	scheduling around ops 1–4	Practical disaggregated orchestration over widely-used engines.	Accessibility and flexible RLHF/RLVR workflows matter most.
Bucketed update (slime / SGLang)	op 3 latency	Send weights in fused buckets and overlap each bucket's transfer with the next gather.	Full monolithic reload is too slow but actors stay service-based.
Direct-memory sync (LlamaRL)	op 3 copies	GPU-to-GPU RDMA writes that skip host staging.	Hundreds of billions of params and tight hardware coupling is acceptable.
Relay workers (Laminar)	the global barrier itself	A parameter tier lets each rollout pull a fresh-enough version without a cluster-wide sync.	Long-tail trajectories make global sync the bottleneck.
TransferQueue + staleness knob (Relax / AsyncFlow)	moves τ_S off the critical path	A data bus carries weights and trajectories; one knob slides from on-policy to fully async.	Multimodal/agentic roles need fault-isolated services.

5 · The freshness contract — staleness is a budget, not a switch

The moment the sync leaves the critical path (hybrid, relay, async), trajectories start being generated under old weights. The controller must therefore stamp every trajectory with the policy version that produced it, so the learner can decide whether to accept it — because the gradient correction depends on it (the importance ratio of RL 06; the full statistical story is 09c). The contract has a few discrete settings:

Freshness mode	Rule	System effect	Statistical effect
Strict on-policy	train update t only on version-t samples	Global barriers, low utilization on long-tail.	Cleanest gradient estimator.
Near-on-policy	accept samples within k versions	Actor/learner overlap; stale tail bounded.	Small off-policy bias, usually fine if monitored.
Fully async	queue-based admission, no global batch	Highest utilization.	Needs staleness-aware loss / IS correction.
Replay / search-heavy	prioritize reward or recency from a buffer	Searcher fleet scales past the learner.	Needs an off-policy-compatible objective.

Design rule: treat staleness like a latency SLO

It is not "async: yes/no" — it is a budget you spend. If reward at fixed GPU-hours is unchanged when you loosen k, spend the budget to reclaim the idle_band. If reward degrades, tighten k before buying more actors. The measurement that matters is final quality at fixed compute, never steps/hour.

Interactive · sync time and freshness budget

Watch τ_S grow with the model and turn from a rounding error into the architecture. The exact number is not the point — the point is the threshold where simple broadcast stops working and the design must reshard, bucket, or relay.

What carries forward

The mismatch forces the cost. Learner shards by parameter (+16N state); actor shards for decode (no optimizer, paged KV). Different partitions ⟹ a gather-and-reshard every update — that is what τ_S is.
The bytes set the architecture: sync_bytes ≈ precision × N × actor_copies — 140 GB at 70B, 810 GB at 405B — decomposing into all-gather → cast → broadcast → reshard, with the inter-node broadcast as the long pole (lesson 02's 18×).
Placement is four ways to pay or hide it: colocate (τ_S→0, no overlap), disaggregate (pay it, gain overlap), hybrid (reshard in place, drop the copy), relay (take it off the critical path for staleness).
Sync is pure tax, so the goal is to make it vanish, not to make it efficient — hence pointer-swap when colocated and overlap-with-prefill when not.
Once sync leaves the critical path, staleness appears — a budget measured against reward at fixed compute, with version-tagged trajectories as the prerequisite. The statistics of that budget are 09c.

Sources used

Source	System idea used
HybridFlow / verl	Hybrid control and 3D-HybridEngine resharding between train and generate layouts.
verl repository	FSDP/Megatron training + vLLM/SGLang rollout backends, RL algorithms, MoE scale.
OpenRLHF	Ray + vLLM + DeepSpeed/ZeRO disaggregated scheduling.
slime	SGLang-native rollout, Megatron-native training, bucketed weight updates.
LlamaRL	Async architecture and direct-memory (RDMA) weight synchronization.
Laminar	Relay-worker parameter service that breaks the global sync barrier.
Relax / AsyncFlow	TransferQueue data bus, service decoupling, continuous staleness knob.