all_lessons / ml_system_design / 09 · RL post-training lesson 9 / 20

Designing an RL post-training system

Every system so far was one shape. Serving (lessons 04–05) was an inference system: KV-bound, decode bandwidth-limited, batched. Pretraining (lesson 07) was a training system: 16N memory, FSDP/TP/PP, scored by MFU. RL post-training is the first system in this track that is both at once, wired in a loop — it generates with an inference engine, scores the output, and takes a gradient step on it, forever. So it inherits every constraint from 04–05 (the generation side) and every constraint from 07 (the learner side). The new, hard part — the entire reason this lesson exists — is coordinating the two.

The framing: two systems you already designed, now in a feedback loop
Nothing in the rollout side is new physics — it is lesson 04's replica, decode-bound. Nothing in the learner side is new physics — it is lesson 07's 16N gradient step. What is new is that they share weights, run on one budget, and each waits on the other. The design questions are therefore all about the seam: where do the two halves live, how do updated weights get from learner to actor, and which half is the bottleneck.

This loop has four roles. We tag them with the same chips the RL track uses, so the vocabulary lines up:

1 · The loop, and the memory bill it runs up

One iteration: the actor generates a batch of completions, the reward scores them, the reference supplies the KL baseline, and the learner takes one gradient step. Then the new weights must reach the actor and it all repeats. (Cluster wiring: RL 19.)

ACTOR / rollout decode, BW-bound REWARD + REF score + KL LEARNER 16N, gradient step trajectories rewards weight sync — closes the loop, 2N bytes per step (§3) each iteration = one inference job + one training job, chained

Now total the memory. At peak you may be holding, on the same cluster, all four roles' weights plus the learner's optimizer state plus the actor's KV cache. For a 70B policy in bf16, using the lesson-02/07 rules:

TenantRoleBytes (70B)Rule
Learner train stateLEARNER1,120 GB16N (lesson 07)
Reference weightsREFERENCE140 GB2N, read-only
Reward model (if a model)REWARD~140 GB2N (a 70B RM)
Actor inference weightsACTOR140 GB2N, separate copy
Actor KV cacheACTORtens–hundreds GBgrows with batch·context (lesson 04)

That sums to well over 1.5 TB before KV — far more than one 80 GB H100, far more than one 8-GPU node. RL post-training is memory-hostile in a way neither pure serving nor pure pretraining is, because it stacks both bills on top of each other.

Why GRPO exists: it deletes a whole model from the bill
Classic PPO (RL 10) trains a critic — a second 70B-scale network with its own 16N optimizer state — to estimate the value baseline. That is another ~1.1 TB of train state on top of everything above. GRPO (RL 11) drops the critic entirely and computes the baseline from the group of sampled completions for a prompt (their mean reward). Removing the critic removes a whole model's worth of weights and its 16N optimizer state — the single biggest memory win available in the algorithm choice. The lesson: in RL, the algorithm is a systems decision.

2 · Placement — colocated vs disaggregated

Where do the actor and learner physically live? Two answers, and the trade is the central topology decision (RL 19).

ColocatedDisaggregated
LayoutActor and learner time-share the same GPUs: generate, then swap to train, then swap back.Separate actor pool and learner pool, sized independently.
OverlapNone — the two phases are serial. While generating, the learner's GPUs are idle; while training, the actor's are.Yes — actor generates the next batch while the learner steps on the last one.
Idle GPUsNo idle pool (one pool does both), but idle time within each step.A pool sits idle when its phase isn't the bottleneck (§4).
Cost paidContext switch: tear down KV cache, swap in optimizer state, reload weights — twice per step.Ship trajectories actor→learner and weights learner→actor across the network (§3).
Best forSmall models (≤7B), single node, prototypes.Multi-node, large models, production scale.

Colocated is simplest and wastes no GPUs to an idle pool, but it can never run the two phases at once, and the swap (evict KV, load 16N optimizer state) is real overhead paid every iteration. Disaggregated buys concurrency and independent sizing at the price of inter-pool data movement and the risk that one pool idles. The choice hinges on §4.

3 · Weight sync — the wire that closes the loop

The actor runs a separate, inference-optimized copy of the policy weights (its own TP layout, maybe quantized). After each learner step those weights are stale — sample from them and you are learning from last step's policy. So every iteration the updated policy must be broadcast to every actor engine. (Mechanism and the bugs it causes: RL 06.)

The cost is set by lesson 02's weight rule: the broadcast moves 2N bytes. For a 70B that is 140 GB shipped every single gradient step:

sync_bytes = 2N   |   sync_time ≈ 2N / interconnect_BW

Over intra-node NVLink (~900 GB/s) that 140 GB is ~0.16 s; over inter-node InfiniBand (~50 GB/s, the 18× gap from lesson 02) the same broadcast is ~2.8 s — which can rival a whole training step. Methods, cheapest first:

Weight sync is pure overhead, and it scales with model size
Unlike a gradient step (which buys learning) or a rollout (which buys data), the sync buys nothing — it is the tax for having two copies of the weights. It grows linearly with N and inversely with interconnect bandwidth, so for big models on slow links it can dominate. This is why frameworks overlap the sync with the next rollout's prefill, and why the colocated pattern's free pointer-swap is genuinely attractive for small models.

4 · The bottleneck diagnosis — rollout-bound or train-bound?

This is the central design skill of the lesson. The loop runs at the speed of its slower half. So the first question on any RL system is: is wall-clock dominated by generation or by the gradient step?

Rollout = generation = decode, which is bandwidth-bound (lesson 04). Its wall-clock is set by how many output tokens you must emit, one step at a time. Train = one forward+backward+optimizer step (lesson 07), set by the batch of tokens you process in parallel. The asymmetry: generation is sequential in the output length; training sees the whole sequence at once.

Reasoning models make this lopsided — and it's Little's Law again
A chat completion emits ~hundreds of tokens; a reasoning model emits 10,000+ thinking tokens before its answer. From lesson 03, time-in-system scales with output length, and from lesson 04 each of those tokens is a separate bandwidth-bound decode step. So generation can balloon to 70–90% of loop wall-clock for long-output RL — the rollout, not the gradient, is where the time goes. Most modern reasoning-RL is heavily rollout-bound, which is why most of the GPUs go to actors, not learners. (Inside the rollout engine: RL 20; its memory: RL 22.)

The method is not a formula, it is a measurement: time the rollout phase and the train phase, take the split, and add GPUs to whichever dominates. If rollout is 80% of the step, doubling learner GPUs barely moves wall-clock; you want more actors.

5 · Async / off-policy — trade staleness for overlap

Synchronous RL is wasteful by construction: the actor idles while the learner steps, and the learner idles while the actor generates (in disaggregated mode; in colocated they alternate on one pool). Async RL breaks the lockstep — the actor keeps generating with slightly stale weights (the policy from k steps ago) while the learner steps on the trajectories it already has. Now both halves run continuously and overlap, which can lift throughput 2–3× (RL 19).

The cost is statistical, not computational. Trajectories sampled under πθ−k but trained against πθ are off-policy — the gradient estimator is biased, importance-sampling ratios drift, and learning can destabilize (the ratio-explosion failure mode of RL 06). The defenses are importance-sampling correction and a bounded staleness cap.

The knob: max staleness k
k = 0 is fully synchronous and on-policy — best sample quality, worst overlap. Larger k lets the actor run further ahead — better overlap and throughput, but more off-policy data. The trade is throughput (overlap) vs sample quality (on-policyness), and unlike a pure systems knob, pushing it too far hurts the model, not just the clock. Measure final quality at fixed compute, not steps/hour.

6 · The capacity split — how many actors vs learners?

Given a fixed cluster, what fraction of GPUs goes to the actor pool versus the learner pool? This is lesson 08's producer-must-keep-up-with-consumer framing, applied to a loop: the rollout pool is the producer of trajectories, the learner is the consumer. Size the pools so neither waits.

rollout_throughput = actors · rollout_tps   ≈   learners · learn_tps = train_consumption

If the actors produce R tokens/s and the learner consumes T tokens/s, the loop runs at min(R, T) and the other side idles by 1 − min(R,T)/max(R,T). The balanced split puts R ≈ T, so no GPU class is starved. Because reasoning rollouts are slow producers (long sequential decode), balancing usually lands most of the cluster on the actor side — the same conclusion §4 reached, now as a sizing rule.

Interactive · rollout-bound or train-bound? + actor/learner split

Set the cluster size, the reasoning knob (avg output tokens), the per-GPU throughputs, and the fraction of GPUs assigned to actors. The widget computes which half is the bottleneck, how badly the other half idles, and the split that balances them.

Bottleneck & capacity split

Assumptions: actors = GPUs·frac, learners = GPUs·(1−frac). Rollout throughput = actors·rollout_tps; train consumption = learners·learn_tps. The loop runs at min(rollout, train); the other side idles by (1 − min/max). The output-tokens slider is the reasoning knob: it doesn't change per-GPU rates but it's why rollout dominates — long outputs mean the actor pool must sustain far more token-seconds of sequential decode per trajectory. Optimal frac balances rollout = train. Order-of-magnitude, per lesson 02's ±30% contract.

rollout tok/s
train tok/s
bottleneck
idle GPUs

What carries forward