all_lessons / ml_system_design / 09a · systems first principles RL systems 1 / 4

RL frameworks, derived from one equation

Lesson 09 showed the loop has four roles and stacks two memory bills. This lesson asks the next question: what number is the framework actually trying to make bigger, and what stops it? Everything a SOTA RL framework does — colocated engines, streaming queues, async rollout, direct weight sync, repacking, speculative decode — is an answer to that one equation. We derive the equation first, then read the frameworks off it.

The target is not steps/second
A framework turns GPU-seconds into a policy update. The thing worth maximizing is useful, near-on-policy tokens per dollar — not raw throughput. Each word in that phrase is load-bearing: useful (the token survived filtering and was trained on), near-on-policy (it was generated by weights close to the ones being updated, or the gradient is biased), per dollar (the GPU is the cost). Hold those three words; the rest of the lesson is what happens when one of them becomes the wall.

1 · The loop is a pipeline — so its speed is its slowest stage

One iteration of RL post-training runs four things in sequence: the actor generates, the reward scores, the reference supplies the KL baseline, the learner steps, and then the new weights sync back. Name the wall-clock of each stage (the RL track's notation, RL 22a):

τR rollout  ·  τV verifier/reward  ·  τref reference forward  ·  τT trainer step  ·  τS weight sync

How these combine into the per-step wall-clock τstep is entirely a function of one thing — how much the stages overlap — which is the placement decision from lesson 09, restated as arithmetic:

PlacementτstepWhy
Colocated / synchronousτR + τV + τref + τT + τSOne pool does everything in turn — pure sum, no overlap.
Disaggregated / synchronousmax(τRVref, τT) + τSActor pool overlaps learner pool; the sync barrier still serializes.
Fully asyncmax(τRVref, τT, τS)Everything overlaps; the price is off-policy data (§3, 09c).

And the throughput that τstep buys is just the tokens produced per step over that time. With a batch of B prompts, K sampled completions each, and mean length :

goodput = B · K · L̄useful / τstep

This single fraction is the whole game, and reading it tells you what every framework optimization is for. The numerator is "make more useful tokens per step": bigger effective batch, or fewer tokens thrown away by filtering and staleness. The denominator is "make the step shorter": shrink the dominant τ, or change the placement so the sum becomes a max. Because τstep is a sum-or-max of stages, only the dominant stage moves the clock — a pipeline runs at the speed of its slowest stage. That is the entire reason "find the bottleneck" is the first move, and why optimizing any other stage is wasted work.

The one-line restatement of the whole sub-track
A framework is SOTA when it raises min(actor production, reward throughput, learner consumption) and shortens τstep by overlapping stages — without shrinking useful by training on stale or filtered-away data. The fastest system and the most on-policy system are usually not the same system. That tension is the subject of 09b–09d.

2 · Where the time goes — five resources, read off the roofline

"Find the bottleneck" needs a candidate list. Don't memorize one — derive it. Each stage consumes a distinct physical resource, and lesson 02's roofline already tells you which resource binds each stage. Walk the loop once:

That walk produces the candidate table — but now each row is a conclusion, not a fact to take on faith. The "framework response" column is just "what do you reach for when this resource is the dominant τ":

ResourceStageRooflineBecomes the wall when…Framework response
Decode bandwidthτRBW-boundLong-CoT / agentic: many sequential tokens.vLLM/SGLang, continuous batching, speculative decode, more actors.
Env / reward latencyτVoff-GPUTests, browsers, builds, judge models dominate the tail.Remote env pools, caching, async reward queues, timeouts.
Training FLOPs + HBMτTcompute-boundBig policy, fat batches, optimizer + activation memory.FSDP/Megatron, sequence packing, recompute, LoRA.
Weight movementτSnetwork-boundLarge N on a slow link, every step (09b).Resharding, bucketing, DMA, relay, colocate.
Freshness budget(couples to all)statisticalAsync overlap outruns the policy → off-policy bias.Bounded staleness, IS correction, staleness-aware loss (09c).

The last row is not a stage — it is the cost of overlapping the others. That coupling is what makes RL systems harder than the training systems of lesson 07, and it deserves its own argument.

3 · Why this is harder than a pretraining system

Pretraining (lesson 07) is a giant streaming matmul over a fixed dataset; the dataloader's job is to be invisible (lesson 08). RL inverts every one of those properties — and each inversion is why a new optimization exists:

Pretraining assumes…RL breaks it because…So the framework must…
The dataset is fixed and external.The model generates its own next dataset.Couple producer (actor) and consumer (learner) in a loop (lesson 08's framing).
Samples are uniform-cost.Lengths are long-tailed; one completion can be 5× the mean (09c).Tolerate stragglers — repack, cap, abort.
Every sample is equally valuable.Most rollouts get reward 0; a few are gold.Filter and prioritize — but track what it dropped.
The data distribution is stationary.The generator changes after every learner step.Tag each sample with the policy version that made it.
One parallel layout fits the whole run.Actor wants an inference layout, learner a training layout.Convert layouts and sync weights every step (09b).
The new invariant: policy version is part of the data
Pretraining wants the dataloader invisible. RL needs the policy version of every trajectory to be visible. You cannot correctly optimize the data plane without knowing which weights produced each sample and how stale it is — because the gradient correction (the importance-sampling ratio of RL 06) is a function of exactly that. Throughput without version accounting is just a faster way to train on the wrong distribution.

4 · The bottleneck map, and the method that uses it

Put the five stages in a row, add the sync edge that closes the loop, and label each with the fix you reach for when it dominates. This is the picture to draw on a whiteboard before naming a single framework:

actors τR · decode, BW reward/env τV · tail learner τT · F/B, compute weight sync τS · network fix: more actors, speculative decode fix: env pool, cache, async reward fix: packing, MFU, parallelism fix: bucket, DMA, relay, colocate

The method is the one this whole track runs (the index's five-step loop): measure the split, fix the dominant term, re-measure — because fixing the wall exposes the next one. Never guess. A worked split makes it concrete; the numbers below are the 7B example from RL 22a (K=16, L≈1024, 8×H100, disaggregated):

τR ≈ 30 s  (+ 1.4× straggler)  >>  τT ≈ 8 s  ·  τref ≈ 3 s  ·  τS ≈ 1 s  ⟹  τstep ≈ max(33, 8) + 1 ≈ 34 s

Read it: rollout is ~95% of the step. Doubling learner GPUs takes the step from 34 s to… 34 s. The only move that matters is shrinking τR — more actors, faster decode, or killing the straggler tail. That is what "linearized" means: the arithmetic named the wall, and the wall named the fix.

Interactive · framework bottleneck triage

Set rough per-iteration times. The widget composes τstep for the sync and async placements, names the binding wall, and prints the optimization family you would reach for first. It is deliberately crude — for design interviews and first-cut sizing, not for committing a cluster.

RL loop bottleneck triage

Sync loop = τR + τV + τT + τS (sum, no overlap). Async loop ≈ max(τRV, τT) + exposed sync — overlap turns the sum into a max. The binding wall is the largest single term; that is the only one worth optimizing first.

sync τstep
-
async τstep
-
binding wall
-
first fix
-

What carries forward

Sources used

SourceSystem idea used
HybridFlow / verlHybrid controller and actor resharding between training and generation.
AReaLFully asynchronous generation/training with staleness-aware PPO.
AsyncFlowStreaming data storage (TransferQueue), producer-consumer scheduling, service decoupling.
LaminarTrajectory-level asynchrony, relay workers, and dynamic repacking.
RelaxFault-isolated services, omni-modal stack, and a continuous staleness knob.
LlamaRLDistributed async PyTorch and direct-memory weight synchronization.