System topology — how a real RL cluster is wired
Rollout and training are two workloads with opposite hardware profiles. Rollout is memory-bandwidth-bound and cares about KV cache; training is compute-bound and cares about overlap. Where you put them on the cluster — colocated, disaggregated, or fully async — is the single biggest systems decision in RL post-training.
The fundamental asymmetry
From lesson 02, rollout is autoregressive generation. From lesson 05, training is a single forward + backward + optimizer step on a batched sequence. They look superficially similar — both are "run a transformer" — but their resource profiles are different in every dimension worth measuring:
| Dimension | Rollout (generation) | Training (forward + backward) |
|---|---|---|
| Compute pattern | 1 new token at a time × many sequences in flight | Full sequence in one pass |
| Bound by | Memory bandwidth (KV cache reads) | FLOPs (matmul throughput) |
| Sweet-spot precision | fp16/bf16, sometimes int8 weights | bf16 with fp32 master + Adam state |
| Memory cost | Weights (1×) + KV cache (huge, grows with seq len) | Weights + activations + grads + Adam (≈ 18× param bytes, see lesson 22) |
| Parallelism that fits | Tensor parallel, pipeline parallel | FSDP/ZeRO-3, sequence parallel |
| Optimal batch size | Tens of concurrent sequences | Millions of tokens per step |
| Latency target | Tokens/sec per stream matters | Time per step matters; per-token irrelevant |
So we have two workloads that want different sharding, different precision, different batching, different memory layout. Yet they share the same weights and must stay synchronized on every step. The three deployment patterns below are different answers to the question: given this, how do we wire the cluster?
Pattern 1 · Colocated (the simplest thing that works)
Both rollout engine and trainer live on the same GPUs, time-sharing them. On step n:
[t] load rollout weights ────────────────────┐
sample K rollouts │
unload rollout (free KV cache) │
load trainer state (Adam, grads) ─┤ one GPU's
forward + backward + step │ timeline
unload trainer │
sync weights to rollout copy ────────────┘
This is what TRL does by default, and it's what the toy in RL/framework/01_single_turn.py does — rollout and trainer hold pointers to the same nn.Module. Weight sync is then a no-op (or a single in-process pointer swap from a swap-buffered copy).
- Pro: Simplest possible system. No NCCL groups, no resharding, no cross-host plumbing. Easiest to debug.
- Con: GPU sits idle for whichever phase isn't running. If rollout takes 60% of step time and training 40%, you're at 100% utilization neither phase but never both at once — and your effective FLOPs are well under the chip's peak.
- Con: The two workloads want different memory layouts. Loading one means evicting the other. You pay this eviction cost twice per step.
Colocated is the right answer for: small models (≤ 7B), research prototypes, single-node setups. It's how every TRL-style demo is wired.
Pattern 2 · Disaggregated (parallel pipelines)
Split the cluster into a rollout pool and a trainer pool running concurrently. The trainer steps while the next batch of rollouts is being sampled, and vice versa.
rollout GPUs: sample step n+1 ────────────────────► send trajectories
▲ │
│ broadcast weights ▼
trainer GPUs: train step n ──────────────────────────────────► step n+1
This is the pattern in OpenRLHF, verl (default), NeMo-RL. Concretely:
- Rollout pool runs vLLM/SGLang (lesson 20) at its preferred tensor-parallel layout for inference.
- Trainer pool runs FSDP/Megatron at its preferred sharding for training.
- A dedicated NCCL communicator group joins the two pools just for weight broadcasts.
- An orchestrator (Ray actor, or a custom rendezvous) hands trajectories from rollout to trainer and broadcasts new weights from trainer to rollout.
The reshard problem
Trainer's parameter shards are partitioned by FSDP (one parameter, many GPUs each holding a strip). Rollout's parameter shards are partitioned by tensor-parallel rules (one layer's matmul split across GPUs). The same parameter has different shapes on the two pools.
"Weight sync" must therefore be: gather full parameter on trainer rank 0 (or all ranks), reshape into rollout's TP layout, broadcast to rollout ranks. For a 70B model in bf16 this is 140 GB of NCCL traffic per sync. Production frameworks (verl's HybridEngine, OpenRLHF's merger) heavily optimize this — they use one-sided RDMA, overlap with the next rollout's prefill, and aggressively zero-copy.
gen_id tag (the weight generation it was sampled under) so the trainer can compute old_logp correctly. If you forget, importance-sampling ratios are computed against the wrong reference and PPO silently breaks (lesson 06 has the dashboard symptoms).
Pattern 3 · Fully async (the SLIME/verl-async frontier)
In disaggregated mode the two pools still march in lockstep — the trainer waits for a rollout batch, the rollout waits for fresh weights. Async breaks that constraint. The rollout pool keeps generating with whatever weights it last received; the trainer keeps stepping on whatever trajectories it last got; weight syncs happen on a separate cadence.
Result: both pools at near-100% utilization, with the cost of off-policy data — trajectories sampled from πθ−k are trained against πθ where k can be several steps.
The fix is a versioned-weights protocol plus importance correction:
- Trainer tags each weight broadcast with a monotonic
gen_id. - Rollout tags each emitted token with the
gen_idactive at sampling time. - Trainer sees a batch of tokens, each annotated with its sampling-time
gen_id. It can recomputeold_logpfrom the snapshot if it cached it, or use the PPO ratio as a soft correction up to a clip.
This is what SLIME, verl-async, and similar systems are. The throughput gain over disaggregated is large (often 2–3× steps/hour at the same hardware) but the implementation surface is the largest in this lesson — you are running a partially-on-policy RL system whose policy version is a moving target.
Interactive · pick a pattern, watch the throughput
Below: a simple simulator. You set the rollout step time (R) and the trainer step time (T) and choose a pattern. The simulator draws the timeline for 5 training steps and reports steady-state steps/hour.
Orchestration · who decides what runs when
Beneath all three patterns sits an orchestrator. Two common choices:
- SPMD with collective primitives. One Python process per GPU, all ranks running the same code, synchronizing on NCCL collectives (all-gather, broadcast, all-reduce). Megatron-style. Tight, fast, but rigid — you can't easily mix two different kinds of workload on the same ranks.
- Ray-actor topology. Each pool (rollout, trainer, env) is a Ray placement group of GPU actors. A driver process sends RPCs; actors run independently. More flexible (easy to add a new role like an LLM judge), at the cost of higher per-call overhead. OpenRLHF and verl run this way.
Sharding inside each pool
Within the trainer pool, weights and activations get partitioned. The three workhorse strategies:
- Data parallel (DP). Each GPU holds a full copy; each runs a different micro-batch; gradients all-reduced at backward. Memory cost is unchanged — limits model size to one-GPU's worth.
- FSDP / ZeRO-3. Shard parameters, gradients, and optimizer state across the DP group. Each step does an all-gather of weights for forward, an all-gather for backward, and a reduce-scatter of grads. Memory drops by ~1/N at the cost of two all-gathers per layer per step. Default for 7B–70B training.
- Tensor parallel (TP). Split a single matmul across GPUs (e.g., column-parallel attention QKV). Adds collectives inside the layer. Used for very large models or when sequence parallelism is also needed.
- Pipeline parallel (PP). Different layers on different GPUs; mini-batches flow through the pipeline. Mostly used for >100B models or context lengths that don't fit any single GPU's activation memory.
Production frontier-model training combines FSDP × TP × PP × sequence-parallel — "4D parallelism" — and tuning the split is its own discipline. For RL post-training at 7–70B, FSDP with sequence parallel is almost always enough.
Picking a pattern · decision tree
- Model fits on one node, low-effort prototype? Colocated. Done.
- Model needs multiple nodes for training, rollout fits on a subset of the cluster? Disaggregated. Pay the resharding cost once; gain ~2× throughput.
- You're scaling and rollout dominates wall time (web/SWE-bench)? Async, with versioned weights and a clear KL-divergence budget. Gain another 1.5–3×; accept the complexity.
- You're doing test-time-only inference (e.g., evaluation): rollout-only deployment, no trainer. Use the inference engine alone.