System topology — how a real RL cluster is wired

Rollout and training are two workloads with opposite hardware profiles. Rollout is memory-bandwidth-bound and cares about KV cache; training is compute-bound and cares about overlap. Where you put them on the cluster — colocated, disaggregated, or fully async — is the single biggest systems decision in RL post-training.

Where this lesson sits

Lessons 15–18 covered where the reward signal comes from. Starting here, Part III shifts to the systems side. Given a fixed reward signal, what does the cluster look like that runs the rollout / train loop at scale? This lesson is the highest-level decision (cluster topology); lessons 20–21 zoom inside the rollout engine (KV cache substrate + scheduler tricks); lesson 22 does the memory math that decides whether anything fits at all.

The fundamental asymmetry

From lesson 02, rollout is autoregressive generation. From lesson 05, training is a single forward + backward + optimizer step on a batched sequence. They look superficially similar — both are "run a transformer" — but their resource profiles are different in every dimension worth measuring:

Dimension	Rollout (generation)	Training (forward + backward)
Compute pattern	1 new token at a time × many sequences in flight	Full sequence in one pass
Bound by	Memory bandwidth (KV cache reads)	FLOPs (matmul throughput)
Sweet-spot precision	fp16/bf16, sometimes int8 weights	bf16 with fp32 master + Adam state
Memory cost	Weights (1×) + KV cache (huge, grows with seq len)	Weights + activations + grads + Adam (≈ 18× param bytes, see lesson 22)
Parallelism that fits	Tensor parallel, pipeline parallel	FSDP/ZeRO-3, sequence parallel
Optimal batch size	Tens of concurrent sequences	Millions of tokens per step
Latency target	Tokens/sec per stream matters	Time per step matters; per-token irrelevant

So we have two workloads that want different sharding, different precision, different batching, different memory layout. Yet they share the same weights and must stay synchronized on every step. The three deployment patterns below are different answers to the question: given this, how do we wire the cluster?

Pattern 1 · Colocated (the simplest thing that works)

Both rollout engine and trainer live on the same GPUs, time-sharing them. On step n:

  [t]   load rollout weights ────────────────────┐
        sample K rollouts                        │
        unload rollout (free KV cache)           │
        load trainer state (Adam, grads)        ─┤  one GPU's
        forward + backward + step                │  timeline
        unload trainer                           │
        sync weights to rollout copy ────────────┘

This is what TRL does by default, and it's what the toy in RL/framework/01_single_turn.py does — rollout and trainer hold pointers to the same nn.Module. Weight sync is then a no-op (or a single in-process pointer swap from a swap-buffered copy).

Pro: Simplest possible system. No NCCL groups, no resharding, no cross-host plumbing. Easiest to debug.
Con: GPU sits idle for whichever phase isn't running. If rollout takes 60% of step time and training 40%, you're at 100% utilization neither phase but never both at once — and your effective FLOPs are well under the chip's peak.
Con: The two workloads want different memory layouts. Loading one means evicting the other. You pay this eviction cost twice per step.

Colocated is the right answer for: small models (≤ 7B), research prototypes, single-node setups. It's how every TRL-style demo is wired.

Pattern 2 · Disaggregated (parallel pipelines)

Split the cluster into a rollout pool and a trainer pool running concurrently. The trainer steps while the next batch of rollouts is being sampled, and vice versa.

  rollout GPUs:  sample step n+1 ────────────────────►  send trajectories
                                          ▲                       │
                                          │   broadcast weights   ▼
  trainer GPUs:  train step n ──────────────────────────────────►  step n+1

This is the pattern in OpenRLHF, verl (default), NeMo-RL. Concretely:

Rollout pool runs vLLM/SGLang (lesson 20) at its preferred tensor-parallel layout for inference.
Trainer pool runs FSDP/Megatron at its preferred sharding for training.
A dedicated NCCL communicator group joins the two pools just for weight broadcasts.
An orchestrator (Ray actor, or a custom rendezvous) hands trajectories from rollout to trainer and broadcasts new weights from trainer to rollout.

The reshard problem

Trainer's parameter shards are partitioned by FSDP (one parameter, many GPUs each holding a strip). Rollout's parameter shards are partitioned by tensor-parallel rules (one layer's matmul split across GPUs). The same parameter has different shapes on the two pools.

"Weight sync" must therefore be: gather full parameter on trainer rank 0 (or all ranks), reshape into rollout's TP layout, broadcast to rollout ranks. For a 70B model in bf16 this is 140 GB of NCCL traffic per sync. Production frameworks (verl's HybridEngine, OpenRLHF's merger) heavily optimize this — they use one-sided RDMA, overlap with the next rollout's prefill, and aggressively zero-copy.

Critical bookkeeping

In disaggregated mode, rollout pool's weights at step n are the trainer's weights as of step n−1. Every trajectory must carry a gen_id tag (the weight generation it was sampled under) so the trainer can compute old_logp correctly. If you forget, importance-sampling ratios are computed against the wrong reference and PPO silently breaks (lesson 06 has the dashboard symptoms).

Pattern 3 · Fully async (the SLIME/verl-async frontier)

In disaggregated mode the two pools still march in lockstep — the trainer waits for a rollout batch, the rollout waits for fresh weights. Async breaks that constraint. The rollout pool keeps generating with whatever weights it last received; the trainer keeps stepping on whatever trajectories it last got; weight syncs happen on a separate cadence.

Result: both pools at near-100% utilization, with the cost of off-policy data — trajectories sampled from π_θ−k are trained against π_θ where k can be several steps.

The fix is a versioned-weights protocol plus importance correction:

Trainer tags each weight broadcast with a monotonic gen_id.
Rollout tags each emitted token with the gen_id active at sampling time.
Trainer sees a batch of tokens, each annotated with its sampling-time gen_id. It can recompute old_logp from the snapshot if it cached it, or use the PPO ratio as a soft correction up to a clip.

This is what SLIME, verl-async, and similar systems are. The throughput gain over disaggregated is large (often 2–3× steps/hour at the same hardware) but the implementation surface is the largest in this lesson — you are running a partially-on-policy RL system whose policy version is a moving target.

Interactive · pick a pattern, watch the throughput

Below: a simple simulator. You set the rollout step time (R) and the trainer step time (T) and choose a pattern. The simulator draws the timeline for 5 training steps and reports steady-state steps/hour.

Orchestration · who decides what runs when

Beneath all three patterns sits an orchestrator. Two common choices:

SPMD with collective primitives. One Python process per GPU, all ranks running the same code, synchronizing on NCCL collectives (all-gather, broadcast, all-reduce). Megatron-style. Tight, fast, but rigid — you can't easily mix two different kinds of workload on the same ranks.
Ray-actor topology. Each pool (rollout, trainer, env) is a Ray placement group of GPU actors. A driver process sends RPCs; actors run independently. More flexible (easy to add a new role like an LLM judge), at the cost of higher per-call overhead. OpenRLHF and verl run this way.

Sharding inside each pool

Within the trainer pool, weights and activations get partitioned. The three workhorse strategies:

Data parallel (DP). Each GPU holds a full copy; each runs a different micro-batch; gradients all-reduced at backward. Memory cost is unchanged — limits model size to one-GPU's worth.
FSDP / ZeRO-3. Shard parameters, gradients, and optimizer state across the DP group. Each step does an all-gather of weights for forward, an all-gather for backward, and a reduce-scatter of grads. Memory drops by ~1/N at the cost of two all-gathers per layer per step. Default for 7B–70B training.
Tensor parallel (TP). Split a single matmul across GPUs (e.g., column-parallel attention QKV). Adds collectives inside the layer. Used for very large models or when sequence parallelism is also needed.
Pipeline parallel (PP). Different layers on different GPUs; mini-batches flow through the pipeline. Mostly used for >100B models or context lengths that don't fit any single GPU's activation memory.

Production frontier-model training combines FSDP × TP × PP × sequence-parallel — "4D parallelism" — and tuning the split is its own discipline. For RL post-training at 7–70B, FSDP with sequence parallel is almost always enough.

Picking a pattern · decision tree

Model fits on one node, low-effort prototype? Colocated. Done.
Model needs multiple nodes for training, rollout fits on a subset of the cluster? Disaggregated. Pay the resharding cost once; gain ~2× throughput.
You're scaling and rollout dominates wall time (web/SWE-bench)? Async, with versioned weights and a clear KL-divergence budget. Gain another 1.5–3×; accept the complexity.
You're doing test-time-only inference (e.g., evaluation): rollout-only deployment, no trainer. Use the inference engine alone.

When async bites

Async pipelines tend to look great in benchmarks (high throughput!) and worse in convergence (off-policy data degrades the gradient estimator). The slowdown often shows up as "steps-to-target" rather than "steps/hour". Always measure final-model quality at fixed compute, not throughput in isolation.

Takeaway

Rollout and training have opposite hardware profiles. Colocated trades throughput for simplicity; disaggregated overlaps them at the cost of weight-sync engineering; async overlaps everything at the cost of off-policy bookkeeping. The pattern you should pick is the simplest one that fits your wall-clock budget.