Designing an RL post-training system

Every system so far was one shape. Serving (lessons 04–05) was an inference system: KV-bound, decode bandwidth-limited, batched. Pretraining (lesson 07) was a training system: 16N memory, FSDP/TP/PP, scored by MFU. RL post-training is the first system in this track that is both at once, wired in a loop — it generates with an inference engine, scores the output, and takes a gradient step on it, forever. So it inherits every constraint from 04–05 (the generation side) and every constraint from 07 (the learner side). The new, hard part — the entire reason this lesson exists — is coordinating the two.

The framing: two systems you already designed, now in a feedback loop

Nothing in the rollout side is new physics — it is lesson 04's replica, decode-bound. Nothing in the learner side is new physics — it is lesson 07's 16N gradient step. What is new is that they share weights, run on one budget, and each waits on the other. The design questions are therefore all about the seam: where do the two halves live, how do updated weights get from learner to actor, and which half is the bottleneck.

This loop has four roles. We tag them with the same chips the RL track uses, so the vocabulary lines up:

ACTOR / ROLLOUT the generation engine (vLLM/SGLang). This is pure inference — autoregressive decode, bandwidth-bound, KV-cache-limited (lesson 04). It produces trajectories.
LEARNER / TRAINER the gradient step (FSDP/Megatron). This is pure training — the 16N memory bill, scored by MFU (lesson 07). It consumes trajectories.
REWARD / VERIFIER scores each rollout — either a reward model (another forward pass, more weights in HBM) or a code/math checker (cheap CPU, runs a unit test). See RL 03.
REFERENCE a frozen copy of the policy used for the KL penalty — extra read-only weights in memory, one forward pass per token scored. Also RL 03.

1 · The loop, and the memory bill it runs up

One iteration: the actor generates a batch of completions, the reward scores them, the reference supplies the KL baseline, and the learner takes one gradient step. Then the new weights must reach the actor and it all repeats. (Cluster wiring: RL 19.)

Now total the memory. At peak you may be holding, on the same cluster, all four roles' weights plus the learner's optimizer state plus the actor's KV cache. For a 70B policy in bf16, using the lesson-02/07 rules:

Tenant	Role	Bytes (70B)	Rule
Learner train state	LEARNER	1,120 GB	16N (lesson 07)
Reference weights	REFERENCE	140 GB	2N, read-only
Reward model (if a model)	REWARD	~140 GB	2N (a 70B RM)
Actor inference weights	ACTOR	140 GB	2N, separate copy
Actor KV cache	ACTOR	tens–hundreds GB	grows with batch·context (lesson 04)

That sums to well over 1.5 TB before KV — far more than one 80 GB H100, far more than one 8-GPU node. RL post-training is memory-hostile in a way neither pure serving nor pure pretraining is, because it stacks both bills on top of each other.

Why GRPO exists: it deletes a whole model from the bill

Classic PPO (RL 10) trains a critic — a second 70B-scale network with its own 16N optimizer state — to estimate the value baseline. That is another ~1.1 TB of train state on top of everything above. GRPO (RL 11) drops the critic entirely and computes the baseline from the group of sampled completions for a prompt (their mean reward). Removing the critic removes a whole model's worth of weights and its 16N optimizer state — the single biggest memory win available in the algorithm choice. The lesson: in RL, the algorithm is a systems decision.

2 · Placement — colocated vs disaggregated

Where do the actor and learner physically live? Two answers, and the trade is the central topology decision (RL 19).

	Colocated	Disaggregated
Layout	Actor and learner time-share the same GPUs: generate, then swap to train, then swap back.	Separate actor pool and learner pool, sized independently.
Overlap	None — the two phases are serial. While generating, the learner's GPUs are idle; while training, the actor's are.	Yes — actor generates the next batch while the learner steps on the last one.
Idle GPUs	No idle pool (one pool does both), but idle time within each step.	A pool sits idle when its phase isn't the bottleneck (§4).
Cost paid	Context switch: tear down KV cache, swap in optimizer state, reload weights — twice per step.	Ship trajectories actor→learner and weights learner→actor across the network (§3).
Best for	Small models (≤7B), single node, prototypes.	Multi-node, large models, production scale.

Colocated is simplest and wastes no GPUs to an idle pool, but it can never run the two phases at once, and the swap (evict KV, load 16N optimizer state) is real overhead paid every iteration. Disaggregated buys concurrency and independent sizing at the price of inter-pool data movement and the risk that one pool idles. The choice hinges on §4.

3 · Weight sync — the wire that closes the loop

The actor runs a separate, inference-optimized copy of the policy weights (its own TP layout, maybe quantized). After each learner step those weights are stale — sample from them and you are learning from last step's policy. So every iteration the updated policy must be broadcast to every actor engine. (Mechanism and the bugs it causes: RL 06.)

The cost is set by lesson 02's weight rule: the broadcast moves 2N bytes. For a 70B that is 140 GB shipped every single gradient step:

sync_bytes = 2N | sync_time ≈ 2N / interconnect_BW

Over intra-node NVLink (~900 GB/s) that 140 GB is ~0.16 s; over inter-node InfiniBand (~50 GB/s, the 18× gap from lesson 02) the same broadcast is ~2.8 s — which can rival a whole training step. Methods, cheapest first:

Shared-memory / pointer swap when colocated — actor and learner are the same process on the same GPUs, so sync is nearly free (RL 06).
NCCL broadcast when disaggregated — gather the FSDP-sharded weights on the learner, reshard into the actor's TP layout, broadcast over a dedicated communicator.

Weight sync is pure overhead, and it scales with model size

Unlike a gradient step (which buys learning) or a rollout (which buys data), the sync buys nothing — it is the tax for having two copies of the weights. It grows linearly with N and inversely with interconnect bandwidth, so for big models on slow links it can dominate. This is why frameworks overlap the sync with the next rollout's prefill, and why the colocated pattern's free pointer-swap is genuinely attractive for small models.

4 · The bottleneck diagnosis — rollout-bound or train-bound?

This is the central design skill of the lesson. The loop runs at the speed of its slower half. So the first question on any RL system is: is wall-clock dominated by generation or by the gradient step?

Rollout = generation = decode, which is bandwidth-bound (lesson 04). Its wall-clock is set by how many output tokens you must emit, one step at a time. Train = one forward+backward+optimizer step (lesson 07), set by the batch of tokens you process in parallel. The asymmetry: generation is sequential in the output length; training sees the whole sequence at once.

Reasoning models make this lopsided — and it's Little's Law again

A chat completion emits ~hundreds of tokens; a reasoning model emits 10,000+ thinking tokens before its answer. From lesson 03, time-in-system scales with output length, and from lesson 04 each of those tokens is a separate bandwidth-bound decode step. So generation can balloon to 70–90% of loop wall-clock for long-output RL — the rollout, not the gradient, is where the time goes. Most modern reasoning-RL is heavily rollout-bound, which is why most of the GPUs go to actors, not learners. (Inside the rollout engine: RL 20; its memory: RL 22.)

The method is not a formula, it is a measurement: time the rollout phase and the train phase, take the split, and add GPUs to whichever dominates. If rollout is 80% of the step, doubling learner GPUs barely moves wall-clock; you want more actors.

5 · Async / off-policy — trade staleness for overlap

Synchronous RL is wasteful by construction: the actor idles while the learner steps, and the learner idles while the actor generates (in disaggregated mode; in colocated they alternate on one pool). Async RL breaks the lockstep — the actor keeps generating with slightly stale weights (the policy from k steps ago) while the learner steps on the trajectories it already has. Now both halves run continuously and overlap, which can lift throughput 2–3× (RL 19).

The cost is statistical, not computational. Trajectories sampled under π_θ−k but trained against π_θ are off-policy — the gradient estimator is biased, importance-sampling ratios drift, and learning can destabilize (the ratio-explosion failure mode of RL 06). The defenses are importance-sampling correction and a bounded staleness cap.

The knob: max staleness k

k = 0 is fully synchronous and on-policy — best sample quality, worst overlap. Larger k lets the actor run further ahead — better overlap and throughput, but more off-policy data. The trade is throughput (overlap) vs sample quality (on-policyness), and unlike a pure systems knob, pushing it too far hurts the model, not just the clock. Measure final quality at fixed compute, not steps/hour.

6 · The capacity split — how many actors vs learners?

Given a fixed cluster, what fraction of GPUs goes to the actor pool versus the learner pool? This is lesson 08's producer-must-keep-up-with-consumer framing, applied to a loop: the rollout pool is the producer of trajectories, the learner is the consumer. Size the pools so neither waits.

rollout_throughput = actors · rollout_tps ≈ learners · learn_tps = train_consumption

If the actors produce R tokens/s and the learner consumes T tokens/s, the loop runs at min(R, T) and the other side idles by 1 − min(R,T)/max(R,T). The balanced split puts R ≈ T, so no GPU class is starved. Because reasoning rollouts are slow producers (long sequential decode), balancing usually lands most of the cluster on the actor side — the same conclusion §4 reached, now as a sizing rule.

Interactive · rollout-bound or train-bound? + actor/learner split

Set the cluster size, the reasoning knob (avg output tokens), the per-GPU throughputs, and the fraction of GPUs assigned to actors. The widget computes which half is the bottleneck, how badly the other half idles, and the split that balances them.

Bottleneck & capacity split

Assumptions: actors = GPUs·frac, learners = GPUs·(1−frac). Rollout throughput = actors·rollout_tps; train consumption = learners·learn_tps. The loop runs at min(rollout, train); the other side idles by (1 − min/max). The output-tokens slider is the reasoning knob: it doesn't change per-GPU rates but it's why rollout dominates — long outputs mean the actor pool must sustain far more token-seconds of sequential decode per trajectory. Optimal frac balances rollout = train. Order-of-magnitude, per lesson 02's ±30% contract.

total GPUs 256 avg rollout output tokens 8000 rollout tok/s per actor-GPU 60 learner tok/s per learner-GPU 2500 frac GPUs to actors 0.50

rollout tok/s

–

train tok/s

–

bottleneck

–

idle GPUs

–

What carries forward

RL post-training is an inference system and a training system in a loop — it inherits the KV/decode constraints of 04–05 and the 16N/MFU constraints of 07. The design work is in the seam between them.
The memory bill stacks four roles: learner 16N + reference 2N + reward (maybe 2N) + a separate actor copy 2N + KV — over 1.5 TB for a 70B before KV. GRPO's win is deleting the critic, removing a whole model and its 16N.
Placement: colocated (time-share, simple, no idle pool, no overlap, pay the swap) vs disaggregated (separate pools, concurrent, independently sized, pay data/weight movement and pool idle).
Weight sync is pure overhead — 2N bytes broadcast every step, ~0.16 s on NVLink but ~2.8 s on IB for a 70B. Free as a pointer swap when colocated.
Diagnose rollout-bound vs train-bound by measuring the split. Long reasoning outputs push it heavily rollout-bound (70–90%), so most GPUs go to actors.
Async trades on-policyness for overlap; the knob is max staleness k — more overlap, more off-policy bias. Size pools so rollout throughput ≈ train consumption (producer keeps up with consumer, lesson 08).