Designing an RL post-training system
Every system so far was one shape. Serving (lessons 04–05) was an inference system: KV-bound, decode bandwidth-limited, batched. Pretraining (lesson 07) was a training system: 16N memory, FSDP/TP/PP, scored by MFU. RL post-training is the first system in this track that is both at once, wired in a loop — it generates with an inference engine, scores the output, and takes a gradient step on it, forever. So it inherits every constraint from 04–05 (the generation side) and every constraint from 07 (the learner side). The new, hard part — the entire reason this lesson exists — is coordinating the two.
This loop has four roles. We tag them with the same chips the RL track uses, so the vocabulary lines up:
- ACTOR / ROLLOUT the generation engine (vLLM/SGLang). This is pure inference — autoregressive decode, bandwidth-bound, KV-cache-limited (lesson 04). It produces trajectories.
- LEARNER / TRAINER the gradient step (FSDP/Megatron). This is pure training — the 16N memory bill, scored by MFU (lesson 07). It consumes trajectories.
- REWARD / VERIFIER scores each rollout — either a reward model (another forward pass, more weights in HBM) or a code/math checker (cheap CPU, runs a unit test). See RL 03.
- REFERENCE a frozen copy of the policy used for the KL penalty — extra read-only weights in memory, one forward pass per token scored. Also RL 03.
1 · The loop, and the memory bill it runs up
One iteration: the actor generates a batch of completions, the reward scores them, the reference supplies the KL baseline, and the learner takes one gradient step. Then the new weights must reach the actor and it all repeats. (Cluster wiring: RL 19.)
Now total the memory. At peak you may be holding, on the same cluster, all four roles' weights plus the learner's optimizer state plus the actor's KV cache. For a 70B policy in bf16, using the lesson-02/07 rules:
| Tenant | Role | Bytes (70B) | Rule |
|---|---|---|---|
| Learner train state | LEARNER | 1,120 GB | 16N (lesson 07) |
| Reference weights | REFERENCE | 140 GB | 2N, read-only |
| Reward model (if a model) | REWARD | ~140 GB | 2N (a 70B RM) |
| Actor inference weights | ACTOR | 140 GB | 2N, separate copy |
| Actor KV cache | ACTOR | tens–hundreds GB | grows with batch·context (lesson 04) |
That sums to well over 1.5 TB before KV — far more than one 80 GB H100, far more than one 8-GPU node. RL post-training is memory-hostile in a way neither pure serving nor pure pretraining is, because it stacks both bills on top of each other.
2 · Placement — colocated vs disaggregated
Where do the actor and learner physically live? Two answers, and the trade is the central topology decision (RL 19).
| Colocated | Disaggregated | |
|---|---|---|
| Layout | Actor and learner time-share the same GPUs: generate, then swap to train, then swap back. | Separate actor pool and learner pool, sized independently. |
| Overlap | None — the two phases are serial. While generating, the learner's GPUs are idle; while training, the actor's are. | Yes — actor generates the next batch while the learner steps on the last one. |
| Idle GPUs | No idle pool (one pool does both), but idle time within each step. | A pool sits idle when its phase isn't the bottleneck (§4). |
| Cost paid | Context switch: tear down KV cache, swap in optimizer state, reload weights — twice per step. | Ship trajectories actor→learner and weights learner→actor across the network (§3). |
| Best for | Small models (≤7B), single node, prototypes. | Multi-node, large models, production scale. |
Colocated is simplest and wastes no GPUs to an idle pool, but it can never run the two phases at once, and the swap (evict KV, load 16N optimizer state) is real overhead paid every iteration. Disaggregated buys concurrency and independent sizing at the price of inter-pool data movement and the risk that one pool idles. The choice hinges on §4.
3 · Weight sync — the wire that closes the loop
The actor runs a separate, inference-optimized copy of the policy weights (its own TP layout, maybe quantized). After each learner step those weights are stale — sample from them and you are learning from last step's policy. So every iteration the updated policy must be broadcast to every actor engine. (Mechanism and the bugs it causes: RL 06.)
The cost is set by lesson 02's weight rule: the broadcast moves 2N bytes. For a 70B that is 140 GB shipped every single gradient step:
Over intra-node NVLink (~900 GB/s) that 140 GB is ~0.16 s; over inter-node InfiniBand (~50 GB/s, the 18× gap from lesson 02) the same broadcast is ~2.8 s — which can rival a whole training step. Methods, cheapest first:
- Shared-memory / pointer swap when colocated — actor and learner are the same process on the same GPUs, so sync is nearly free (RL 06).
- NCCL broadcast when disaggregated — gather the FSDP-sharded weights on the learner, reshard into the actor's TP layout, broadcast over a dedicated communicator.
4 · The bottleneck diagnosis — rollout-bound or train-bound?
This is the central design skill of the lesson. The loop runs at the speed of its slower half. So the first question on any RL system is: is wall-clock dominated by generation or by the gradient step?
Rollout = generation = decode, which is bandwidth-bound (lesson 04). Its wall-clock is set by how many output tokens you must emit, one step at a time. Train = one forward+backward+optimizer step (lesson 07), set by the batch of tokens you process in parallel. The asymmetry: generation is sequential in the output length; training sees the whole sequence at once.
The method is not a formula, it is a measurement: time the rollout phase and the train phase, take the split, and add GPUs to whichever dominates. If rollout is 80% of the step, doubling learner GPUs barely moves wall-clock; you want more actors.
5 · Async / off-policy — trade staleness for overlap
Synchronous RL is wasteful by construction: the actor idles while the learner steps, and the learner idles while the actor generates (in disaggregated mode; in colocated they alternate on one pool). Async RL breaks the lockstep — the actor keeps generating with slightly stale weights (the policy from k steps ago) while the learner steps on the trajectories it already has. Now both halves run continuously and overlap, which can lift throughput 2–3× (RL 19).
The cost is statistical, not computational. Trajectories sampled under πθ−k but trained against πθ are off-policy — the gradient estimator is biased, importance-sampling ratios drift, and learning can destabilize (the ratio-explosion failure mode of RL 06). The defenses are importance-sampling correction and a bounded staleness cap.
6 · The capacity split — how many actors vs learners?
Given a fixed cluster, what fraction of GPUs goes to the actor pool versus the learner pool? This is lesson 08's producer-must-keep-up-with-consumer framing, applied to a loop: the rollout pool is the producer of trajectories, the learner is the consumer. Size the pools so neither waits.
If the actors produce R tokens/s and the learner consumes T tokens/s, the loop runs at min(R, T) and the other side idles by 1 − min(R,T)/max(R,T). The balanced split puts R ≈ T, so no GPU class is starved. Because reasoning rollouts are slow producers (long sequential decode), balancing usually lands most of the cluster on the actor side — the same conclusion §4 reached, now as a sizing rule.
Interactive · rollout-bound or train-bound? + actor/learner split
Set the cluster size, the reasoning knob (avg output tokens), the per-GPU throughputs, and the fraction of GPUs assigned to actors. The widget computes which half is the bottleneck, how badly the other half idles, and the split that balances them.
What carries forward
- RL post-training is an inference system and a training system in a loop — it inherits the KV/decode constraints of 04–05 and the 16N/MFU constraints of 07. The design work is in the seam between them.
- The memory bill stacks four roles: learner 16N + reference 2N + reward (maybe 2N) + a separate actor copy 2N + KV — over 1.5 TB for a 70B before KV. GRPO's win is deleting the critic, removing a whole model and its 16N.
- Placement: colocated (time-share, simple, no idle pool, no overlap, pay the swap) vs disaggregated (separate pools, concurrent, independently sized, pay data/weight movement and pool idle).
- Weight sync is pure overhead — 2N bytes broadcast every step, ~0.16 s on NVLink but ~2.8 s on IB for a 70B. Free as a pointer swap when colocated.
- Diagnose rollout-bound vs train-bound by measuring the split. Long reasoning outputs push it heavily rollout-bound (70–90%), so most GPUs go to actors.
- Async trades on-policyness for overlap; the knob is max staleness k — more overlap, more off-policy bias. Size pools so rollout throughput ≈ train consumption (producer keeps up with consumer, lesson 08).