Weight sync — closing the loop
The single most important wire in a production RL framework — and the most common source of "training looks fine but isn't learning" bugs.
Why this wire exists at all
From lesson 2: the rollout engine holds a frozen copy of the policy weights. From lesson 5: the trainer holds the trainable copy. After the trainer takes a step, the two copies disagree. If we sample the next batch of rollouts without updating the rollout engine, we are sampling from the policy that produced last step's data — and learning nothing new.
step 0: rollout = trainer = π_SFT ← initial sync
sample, score, compute loss, trainer.step()
rollout = π_SFT, trainer = π_1 ← out of sync
sync. rollout ← trainer.state_dict()
rollout = trainer = π_1
step 1: sample, score, compute loss, trainer.step()
...
So we need a primitive that pushes trainer.state_dict() into rollout.model.parameters(). In this in-process toy that's a plain copy:
# From rl_framework/weight_sync.py — WeightSyncer.sync
for p_src, p_dst in zip(trainer.model.parameters(), rollout.model.parameters()):
p_dst.data.copy_(p_src.data)
What sync is in production
That two-line copy hides everything. In a real deployment the trainer's weights are sharded across many GPUs (FSDP, ZeRO-3, Megatron-TP) in bf16, and the rollout engine's weights live on a different set of GPUs in a different tensor-parallel layout, possibly quantized to int8 for inference speed. "Sync" is then five operations:
- Gather. Reassemble each parameter from its FSDP/ZeRO shards. (all-gather)
- Cast. Convert to the rollout engine's numeric format (bf16 → int8, etc.).
- Reshard. Split into the rollout's tensor-parallel layout.
- Broadcast. Send to each rollout rank over a dedicated NCCL group.
- Invalidate. Drop any cached speculative-decoding state on the rollout engine.
Step 1 is the expensive one — a full gather over a 70B model is many seconds. Step 4 is one NCCL broadcast. Frameworks like verl's HybridEngine and OpenRLHF's merger spend a lot of effort making step 1 zero-copy (the gathered tensor lives in the same allocator as the rollout buffer, so step 3 is a reinterpretation rather than a memcpy).
Three deployment patterns
| Pattern | Where rollout / trainer run | Sync cost | Trade-off |
|---|---|---|---|
| Colocated | Same GPUs, time-shared | In-process pointer swap | GPU utilization bounded by whichever role is running — never overlap. (TRL default.) |
| Disaggregated | Different GPUs, concurrent | NCCL broadcast over a dedicated group | Need to reshard if TP/PP layouts differ. Better utilization. (OpenRLHF.) |
| Hybrid / Async | Different GPUs, trainer steps ahead | Versioned weight snapshots + traj tags | Highest throughput. Requires "which policy version produced this token" bookkeeping. (verl-async, SLIME.) |
Interactive · what happens when sync is stale
The simulation below runs a toy training loop with a knob: how often the trainer syncs to the rollout engine. sync_every = 1 means every step (always on-policy); higher means stale. Watch what happens to the PPO ratio and the fraction of tokens the clip catches.
The most common production bug
Stale weight sync is one of the bug families that looks healthy on the dashboard:
- Loss curve still goes down (KL term keeps decreasing, PPO term is clipped to zero).
- Reward EMA stays flat but doesn't crash — the policy isn't moving.
- frac_clipped is the canary. Healthy runs typically sit at 10–26%; sustained ≥30% means πold is drifting too far from πθ. Sync more often, or sync after fewer optimizer steps.
The async case is even subtler: the trainer steps ahead while the rollout is mid-generation. Half a trajectory ends up sampled from πθ−k and half from πθ−k+1; the old_logp tensor on the trajectory is no longer a coherent log-likelihood. Versioned weight snapshots and a weights_generation tag on every trajectory are how SLIME and verl-async fix this.
sync_every is a real production trade-off; the framework defaults to the cleanest setting and exposes the knob.
Why this is its own role
Three reasons that justify having a WeightSyncer class rather than just calling load_state_dict inline in the controller:
- Subclassable.
NCCLBroadcastSyncer,SharedMemorySyncer,ResharderSyncerall expose the same interface. The controller code is unchanged. - Per-tensor reasoning. By iterating parameters (rather than calling
load_state_dict) we can insert per-tensor cast/shard logic — which is exactly where a real implementation drops in its all-gather. - Zero-copy buffers.
p_dst.data.copy_(...)keeps the underlying buffer identity stable. The rollout engine's KV cache and any cuBLAS workspaces stay valid across syncs.
The interface, unchanged across all three patterns
syncer.sync(trainer, rollout)
That call is the last step of every controller iteration. In-process it's a copy. In a Ray-actor deployment it's an RPC. In a multi-node NCCL deployment it's a broadcast. The controller doesn't care — and that indifference is what lets the rest of this framework be re-used across all three deployment patterns without modification.