RL frameworks, derived from one equation
Lesson 09 showed the loop has four roles and stacks two memory bills. This lesson asks the next question: what number is the framework actually trying to make bigger, and what stops it? Everything a SOTA RL framework does — colocated engines, streaming queues, async rollout, direct weight sync, repacking, speculative decode — is an answer to that one equation. We derive the equation first, then read the frameworks off it.
1 · The loop is a pipeline — so its speed is its slowest stage
One iteration of RL post-training runs four things in sequence: the actor generates, the reward scores, the reference supplies the KL baseline, the learner steps, and then the new weights sync back. Name the wall-clock of each stage (the RL track's notation, RL 22a):
How these combine into the per-step wall-clock τstep is entirely a function of one thing — how much the stages overlap — which is the placement decision from lesson 09, restated as arithmetic:
| Placement | τstep | Why |
|---|---|---|
| Colocated / synchronous | τR + τV + τref + τT + τS | One pool does everything in turn — pure sum, no overlap. |
| Disaggregated / synchronous | max(τR+τV+τref, τT) + τS | Actor pool overlaps learner pool; the sync barrier still serializes. |
| Fully async | max(τR+τV+τref, τT, τS) | Everything overlaps; the price is off-policy data (§3, 09c). |
And the throughput that τstep buys is just the tokens produced per step over that time. With a batch of B prompts, K sampled completions each, and mean length L̄:
This single fraction is the whole game, and reading it tells you what every framework optimization is for. The numerator is "make more useful tokens per step": bigger effective batch, or fewer tokens thrown away by filtering and staleness. The denominator is "make the step shorter": shrink the dominant τ, or change the placement so the sum becomes a max. Because τstep is a sum-or-max of stages, only the dominant stage moves the clock — a pipeline runs at the speed of its slowest stage. That is the entire reason "find the bottleneck" is the first move, and why optimizing any other stage is wasted work.
2 · Where the time goes — five resources, read off the roofline
"Find the bottleneck" needs a candidate list. Don't memorize one — derive it. Each stage consumes a distinct physical resource, and lesson 02's roofline already tells you which resource binds each stage. Walk the loop once:
- ACTOR emits tokens one at a time; each decode step reads the whole KV cache and the weights to produce one token. Arithmetic intensity ≈ 1–3 → memory-bandwidth-bound (lesson 04). The resource is decode bandwidth, and it queues when outputs are long.
- REWARD / env is whatever scores the output: a verifier (run a unit test, cheap CPU) or a tool/sandbox/judge (a network round-trip). The resource is environment latency, and it queues on the p95 tail — one slow test holds the batch.
- REFERENCE is one forward pass per scored token, no KV reuse across steps → also bandwidth-bound, a smaller cousin of the actor.
- LEARNER processes the whole batch of sequences in parallel: forward + backward + optimizer. Arithmetic intensity ≈ 400 → compute-bound, scored by MFU (lesson 07). The resource is training FLOPs + HBM.
- WEIGHT SYNC moves 2N bytes from learner layout to actor layout. It does no math — it sits off the roofline entirely, limited by interconnect bandwidth (lesson 02's 18× NVLink-vs-IB gap). The resource is weight movement; it queues on slow links and large N (09b).
That walk produces the candidate table — but now each row is a conclusion, not a fact to take on faith. The "framework response" column is just "what do you reach for when this resource is the dominant τ":
| Resource | Stage | Roofline | Becomes the wall when… | Framework response |
|---|---|---|---|---|
| Decode bandwidth | τR | BW-bound | Long-CoT / agentic: many sequential tokens. | vLLM/SGLang, continuous batching, speculative decode, more actors. |
| Env / reward latency | τV | off-GPU | Tests, browsers, builds, judge models dominate the tail. | Remote env pools, caching, async reward queues, timeouts. |
| Training FLOPs + HBM | τT | compute-bound | Big policy, fat batches, optimizer + activation memory. | FSDP/Megatron, sequence packing, recompute, LoRA. |
| Weight movement | τS | network-bound | Large N on a slow link, every step (09b). | Resharding, bucketing, DMA, relay, colocate. |
| Freshness budget | (couples to all) | statistical | Async overlap outruns the policy → off-policy bias. | Bounded staleness, IS correction, staleness-aware loss (09c). |
The last row is not a stage — it is the cost of overlapping the others. That coupling is what makes RL systems harder than the training systems of lesson 07, and it deserves its own argument.
3 · Why this is harder than a pretraining system
Pretraining (lesson 07) is a giant streaming matmul over a fixed dataset; the dataloader's job is to be invisible (lesson 08). RL inverts every one of those properties — and each inversion is why a new optimization exists:
| Pretraining assumes… | RL breaks it because… | So the framework must… |
|---|---|---|
| The dataset is fixed and external. | The model generates its own next dataset. | Couple producer (actor) and consumer (learner) in a loop (lesson 08's framing). |
| Samples are uniform-cost. | Lengths are long-tailed; one completion can be 5× the mean (09c). | Tolerate stragglers — repack, cap, abort. |
| Every sample is equally valuable. | Most rollouts get reward 0; a few are gold. | Filter and prioritize — but track what it dropped. |
| The data distribution is stationary. | The generator changes after every learner step. | Tag each sample with the policy version that made it. |
| One parallel layout fits the whole run. | Actor wants an inference layout, learner a training layout. | Convert layouts and sync weights every step (09b). |
4 · The bottleneck map, and the method that uses it
Put the five stages in a row, add the sync edge that closes the loop, and label each with the fix you reach for when it dominates. This is the picture to draw on a whiteboard before naming a single framework:
The method is the one this whole track runs (the index's five-step loop): measure the split, fix the dominant term, re-measure — because fixing the wall exposes the next one. Never guess. A worked split makes it concrete; the numbers below are the 7B example from RL 22a (K=16, L≈1024, 8×H100, disaggregated):
Read it: rollout is ~95% of the step. Doubling learner GPUs takes the step from 34 s to… 34 s. The only move that matters is shrinking τR — more actors, faster decode, or killing the straggler tail. That is what "linearized" means: the arithmetic named the wall, and the wall named the fix.
Interactive · framework bottleneck triage
Set rough per-iteration times. The widget composes τstep for the sync and async placements, names the binding wall, and prints the optimization family you would reach for first. It is deliberately crude — for design interviews and first-cut sizing, not for committing a cluster.
What carries forward
- One equation governs everything: goodput = B·K·L̄useful / τstep. Frameworks either grow the numerator (more useful tokens) or shrink the denominator (shorter, more-overlapped step).
- The loop is a pipeline, so τstep is a sum (no overlap) or a max (overlap) of five stage times — and only the dominant stage moves the clock. That is why you measure the split before touching anything.
- The five resources fall out of the roofline: decode (BW-bound), env (off-GPU tail), reference (BW-bound), trainer (compute-bound), weight sync (network-bound) — plus freshness, the statistical cost of overlapping them.
- RL is harder than pretraining because the model makes its own data, that data is long-tailed and unevenly valuable, the generator moves every step, and actor vs learner want different layouts. The new invariant: policy version is part of the data.
- Next: 09b derives the weight-sync term τS and the placement that pays or hides it; 09c derives the rollout term τR and the staleness it costs to overlap; 09d matches frameworks to whichever wall you measured.
Sources used
| Source | System idea used |
|---|---|
| HybridFlow / verl | Hybrid controller and actor resharding between training and generation. |
| AReaL | Fully asynchronous generation/training with staleness-aware PPO. |
| AsyncFlow | Streaming data storage (TransferQueue), producer-consumer scheduling, service decoupling. |
| Laminar | Trajectory-level asynchrony, relay workers, and dynamic repacking. |
| Relax | Fault-isolated services, omni-modal stack, and a continuous staleness knob. |
| LlamaRL | Distributed async PyTorch and direct-memory weight synchronization. |