Weight sync — closing the loop

The single most important wire in a production RL framework — and the most common source of "training looks fine but isn't learning" bugs.

Why this wire exists at all

From lesson 2: the rollout engine holds a frozen copy of the policy weights. From lesson 5: the trainer holds the trainable copy. After the trainer takes a step, the two copies disagree. If we sample the next batch of rollouts without updating the rollout engine, we are sampling from the policy that produced last step's data — and learning nothing new.

step 0:  rollout = trainer = π_SFT      ← initial sync
         sample, score, compute loss, trainer.step()
         rollout = π_SFT,  trainer = π_1     ← out of sync

sync.    rollout ← trainer.state_dict()
         rollout = trainer = π_1

step 1:  sample, score, compute loss, trainer.step()
         ...

So we need a primitive that pushes trainer.state_dict() into rollout.model.parameters(). In this in-process toy that's a plain copy:

# From rl_framework/weight_sync.py — WeightSyncer.sync
for p_src, p_dst in zip(trainer.model.parameters(), rollout.model.parameters()):
    p_dst.data.copy_(p_src.data)

What sync is in production

That two-line copy hides everything. In a real deployment the trainer's weights are sharded across many GPUs (FSDP, ZeRO-3, Megatron-TP) in bf16, and the rollout engine's weights live on a different set of GPUs in a different tensor-parallel layout, possibly quantized to int8 for inference speed. "Sync" is then five operations:

Gather. Reassemble each parameter from its FSDP/ZeRO shards. (all-gather)
Cast. Convert to the rollout engine's numeric format (bf16 → int8, etc.).
Reshard. Split into the rollout's tensor-parallel layout.
Broadcast. Send to each rollout rank over a dedicated NCCL group.
Invalidate. Drop any cached speculative-decoding state on the rollout engine.

Step 1 is the expensive one — a full gather over a 70B model is many seconds. Step 4 is one NCCL broadcast. Frameworks like verl's HybridEngine and OpenRLHF's merger spend a lot of effort making step 1 zero-copy (the gathered tensor lives in the same allocator as the rollout buffer, so step 3 is a reinterpretation rather than a memcpy).

Three deployment patterns

Pattern	Where rollout / trainer run	Sync cost	Trade-off
Colocated	Same GPUs, time-shared	In-process pointer swap	GPU utilization bounded by whichever role is running — never overlap. (TRL default.)
Disaggregated	Different GPUs, concurrent	NCCL broadcast over a dedicated group	Need to reshard if TP/PP layouts differ. Better utilization. (OpenRLHF.)
Hybrid / Async	Different GPUs, trainer steps ahead	Versioned weight snapshots + traj tags	Highest throughput. Requires "which policy version produced this token" bookkeeping. (verl-async, SLIME.)

Interactive · what happens when sync is stale

The simulation below runs a toy training loop with a knob: how often the trainer syncs to the rollout engine. sync_every = 1 means every step (always on-policy); higher means stale. Watch what happens to the PPO ratio and the fraction of tokens the clip catches.

Stale π_old → ratio explosion

As sync_every grows, the trainer's policy moves further from the rollout's frozen copy between syncs, log-ratios grow in magnitude, and PPO clipping kicks in on more tokens. (This toy uses a hard "in clip / out of clip" gate to make the trend visible — the real PPO surrogate softens this with its min(s₁, s₂), so in practice clipped tokens still contribute one branch. The big-picture failure mode is real: stale π_old makes the importance-sampling ratio meaningless, and clipping is doing damage control downstream.)

sync_every: 1

Final |log ρ|

—

Final frac_clipped

—

Final reward

—

The most common production bug

Stale weight sync is one of the bug families that looks healthy on the dashboard:

Loss curve still goes down (KL term keeps decreasing, PPO term is clipped to zero).
Reward EMA stays flat but doesn't crash — the policy isn't moving.
frac_clipped is the canary. Healthy runs typically sit at 10–26%; sustained ≥30% means π_old is drifting too far from π_θ. Sync more often, or sync after fewer optimizer steps.

The async case is even subtler: the trainer steps ahead while the rollout is mid-generation. Half a trajectory ends up sampled from π_θ−k and half from π_θ−k+1; the old_logp tensor on the trajectory is no longer a coherent log-likelihood. Versioned weight snapshots and a weights_generation tag on every trajectory are how SLIME and verl-async fix this.

Sync frequency in practice

This framework defaults to every step, and that's actually what most SOTA recipes do — on-policy RL is what the reward signal is designed for. The cost of a sync varies wildly with the deployment: in colocated runs it's near-free, in disaggregated runs it can be a meaningful fraction of a step, and in async pipelines it overlaps with computation. Tuning sync_every is a real production trade-off; the framework defaults to the cleanest setting and exposes the knob.

Why this is its own role

Three reasons that justify having a WeightSyncer class rather than just calling load_state_dict inline in the controller:

Subclassable. NCCLBroadcastSyncer, SharedMemorySyncer, ResharderSyncer all expose the same interface. The controller code is unchanged.
Per-tensor reasoning. By iterating parameters (rather than calling load_state_dict) we can insert per-tensor cast/shard logic — which is exactly where a real implementation drops in its all-gather.
Zero-copy buffers. p_dst.data.copy_(...) keeps the underlying buffer identity stable. The rollout engine's KV cache and any cuBLAS workspaces stay valid across syncs.

The interface, unchanged across all three patterns

syncer.sync(trainer, rollout)

That call is the last step of every controller iteration. In-process it's a copy. In a Ray-actor deployment it's an RPC. In a multi-node NCCL deployment it's a broadcast. The controller doesn't care — and that indifference is what lets the rest of this framework be re-used across all three deployment patterns without modification.

Takeaway

Without weight sync, the rollout engine samples forever from the SFT checkpoint. With weight sync done wrong (stale, mid-generation, or with a sharding mismatch) you get every classic RL bug — KL explosion, ratio explosion, silent precision drift, learning that stalls. The fix is to treat sync as a first-class role with one explicit primitive.