rl_lessons / 19 · topology lesson 5 / 9 · part III

System topology — how a real RL cluster is wired

Rollout and training are two workloads with opposite hardware profiles. Rollout is memory-bandwidth-bound and cares about KV cache; training is compute-bound and cares about overlap. Where you put them on the cluster — colocated, disaggregated, or fully async — is the single biggest systems decision in RL post-training.

Where this lesson sits
Lessons 15–18 covered where the reward signal comes from. Starting here, Part III shifts to the systems side. Given a fixed reward signal, what does the cluster look like that runs the rollout / train loop at scale? This lesson is the highest-level decision (cluster topology); lessons 20–21 zoom inside the rollout engine (KV cache substrate + scheduler tricks); lesson 22 does the memory math that decides whether anything fits at all.

The fundamental asymmetry

From lesson 02, rollout is autoregressive generation. From lesson 05, training is a single forward + backward + optimizer step on a batched sequence. They look superficially similar — both are "run a transformer" — but their resource profiles are different in every dimension worth measuring:

DimensionRollout (generation)Training (forward + backward)
Compute pattern1 new token at a time × many sequences in flightFull sequence in one pass
Bound byMemory bandwidth (KV cache reads)FLOPs (matmul throughput)
Sweet-spot precisionfp16/bf16, sometimes int8 weightsbf16 with fp32 master + Adam state
Memory costWeights (1×) + KV cache (huge, grows with seq len)Weights + activations + grads + Adam (≈ 18× param bytes, see lesson 22)
Parallelism that fitsTensor parallel, pipeline parallelFSDP/ZeRO-3, sequence parallel
Optimal batch sizeTens of concurrent sequencesMillions of tokens per step
Latency targetTokens/sec per stream mattersTime per step matters; per-token irrelevant

So we have two workloads that want different sharding, different precision, different batching, different memory layout. Yet they share the same weights and must stay synchronized on every step. The three deployment patterns below are different answers to the question: given this, how do we wire the cluster?

Pattern 1 · Colocated (the simplest thing that works)

Both rollout engine and trainer live on the same GPUs, time-sharing them. On step n:

  [t]   load rollout weights ────────────────────┐
        sample K rollouts                        │
        unload rollout (free KV cache)           │
        load trainer state (Adam, grads)        ─┤  one GPU's
        forward + backward + step                │  timeline
        unload trainer                           │
        sync weights to rollout copy ────────────┘

This is what TRL does by default, and it's what the toy in RL/framework/01_single_turn.py does — rollout and trainer hold pointers to the same nn.Module. Weight sync is then a no-op (or a single in-process pointer swap from a swap-buffered copy).

Colocated is the right answer for: small models (≤ 7B), research prototypes, single-node setups. It's how every TRL-style demo is wired.

Pattern 2 · Disaggregated (parallel pipelines)

Split the cluster into a rollout pool and a trainer pool running concurrently. The trainer steps while the next batch of rollouts is being sampled, and vice versa.

  rollout GPUs:  sample step n+1 ────────────────────►  send trajectories
                                          ▲                       │
                                          │   broadcast weights   ▼
  trainer GPUs:  train step n ──────────────────────────────────►  step n+1

This is the pattern in OpenRLHF, verl (default), NeMo-RL. Concretely:

The reshard problem

Trainer's parameter shards are partitioned by FSDP (one parameter, many GPUs each holding a strip). Rollout's parameter shards are partitioned by tensor-parallel rules (one layer's matmul split across GPUs). The same parameter has different shapes on the two pools.

"Weight sync" must therefore be: gather full parameter on trainer rank 0 (or all ranks), reshape into rollout's TP layout, broadcast to rollout ranks. For a 70B model in bf16 this is 140 GB of NCCL traffic per sync. Production frameworks (verl's HybridEngine, OpenRLHF's merger) heavily optimize this — they use one-sided RDMA, overlap with the next rollout's prefill, and aggressively zero-copy.

Critical bookkeeping
In disaggregated mode, rollout pool's weights at step n are the trainer's weights as of step n−1. Every trajectory must carry a gen_id tag (the weight generation it was sampled under) so the trainer can compute old_logp correctly. If you forget, importance-sampling ratios are computed against the wrong reference and PPO silently breaks (lesson 06 has the dashboard symptoms).

Pattern 3 · Fully async (the SLIME/verl-async frontier)

In disaggregated mode the two pools still march in lockstep — the trainer waits for a rollout batch, the rollout waits for fresh weights. Async breaks that constraint. The rollout pool keeps generating with whatever weights it last received; the trainer keeps stepping on whatever trajectories it last got; weight syncs happen on a separate cadence.

Result: both pools at near-100% utilization, with the cost of off-policy data — trajectories sampled from πθ−k are trained against πθ where k can be several steps.

The fix is a versioned-weights protocol plus importance correction:

  1. Trainer tags each weight broadcast with a monotonic gen_id.
  2. Rollout tags each emitted token with the gen_id active at sampling time.
  3. Trainer sees a batch of tokens, each annotated with its sampling-time gen_id. It can recompute old_logp from the snapshot if it cached it, or use the PPO ratio as a soft correction up to a clip.

This is what SLIME, verl-async, and similar systems are. The throughput gain over disaggregated is large (often 2–3× steps/hour at the same hardware) but the implementation surface is the largest in this lesson — you are running a partially-on-policy RL system whose policy version is a moving target.

Interactive · pick a pattern, watch the throughput

Below: a simple simulator. You set the rollout step time (R) and the trainer step time (T) and choose a pattern. The simulator draws the timeline for 5 training steps and reports steady-state steps/hour.

Throughput vs. pattern
In colocated, rollout and training are sequential on the same GPUs (cost R + T per step). In disaggregated, they overlap (cost ≈ max(R, T) plus a sync). In async, both run continuously with a weight broadcast on a fixed cadence (cost ≈ max(R, T), minus stall).
Colocated steps/h
Disagg. steps/h
Async steps/h
Async / colocated

Orchestration · who decides what runs when

Beneath all three patterns sits an orchestrator. Two common choices:

Sharding inside each pool

Within the trainer pool, weights and activations get partitioned. The three workhorse strategies:

Production frontier-model training combines FSDP × TP × PP × sequence-parallel — "4D parallelism" — and tuning the split is its own discipline. For RL post-training at 7–70B, FSDP with sequence parallel is almost always enough.

Picking a pattern · decision tree

  1. Model fits on one node, low-effort prototype? Colocated. Done.
  2. Model needs multiple nodes for training, rollout fits on a subset of the cluster? Disaggregated. Pay the resharding cost once; gain ~2× throughput.
  3. You're scaling and rollout dominates wall time (web/SWE-bench)? Async, with versioned weights and a clear KL-divergence budget. Gain another 1.5–3×; accept the complexity.
  4. You're doing test-time-only inference (e.g., evaluation): rollout-only deployment, no trainer. Use the inference engine alone.
When async bites
Async pipelines tend to look great in benchmarks (high throughput!) and worse in convergence (off-policy data degrades the gradient estimator). The slowdown often shows up as "steps-to-target" rather than "steps/hour". Always measure final-model quality at fixed compute, not throughput in isolation.
Takeaway
Rollout and training have opposite hardware profiles. Colocated trades throughput for simplicity; disaggregated overlaps them at the cost of weight-sync engineering; async overlaps everything at the cost of off-policy bookkeeping. The pattern you should pick is the simplest one that fits your wall-clock budget.