RL frameworks, derived from one equation

Lesson 09 showed the loop has four roles and stacks two memory bills. This lesson asks the next question: what number is the framework actually trying to make bigger, and what stops it? Everything a SOTA RL framework does — colocated engines, streaming queues, async rollout, direct weight sync, repacking, speculative decode — is an answer to that one equation. We derive the equation first, then read the frameworks off it.

The target is not steps/second

A framework turns GPU-seconds into a policy update. The thing worth maximizing is useful, near-on-policy tokens per dollar — not raw throughput. Each word in that phrase is load-bearing: useful (the token survived filtering and was trained on), near-on-policy (it was generated by weights close to the ones being updated, or the gradient is biased), per dollar (the GPU is the cost). Hold those three words; the rest of the lesson is what happens when one of them becomes the wall.

1 · The loop is a pipeline — so its speed is its slowest stage

One iteration of RL post-training runs four things in sequence: the actor generates, the reward scores, the reference supplies the KL baseline, the learner steps, and then the new weights sync back. Name the wall-clock of each stage (the RL track's notation, RL 22a):

τ_R rollout · τ_V verifier/reward · τ_ref reference forward · τ_T trainer step · τ_S weight sync

How these combine into the per-step wall-clock τ_step is entirely a function of one thing — how much the stages overlap — which is the placement decision from lesson 09, restated as arithmetic:

Placement	τ_step	Why
Colocated / synchronous	τ_R + τ_V + τ_ref + τ_T + τ_S	One pool does everything in turn — pure sum, no overlap.
Disaggregated / synchronous	max(τ_R+τ_V+τ_ref, τ_T) + τ_S	Actor pool overlaps learner pool; the sync barrier still serializes.
Fully async	max(τ_R+τ_V+τ_ref, τ_T, τ_S)	Everything overlaps; the price is off-policy data (§3, 09c).

And the throughput that τ_step buys is just the tokens produced per step over that time. With a batch of B prompts, K sampled completions each, and mean length L̄:

goodput = B · K · L̄_useful / τ_step

This single fraction is the whole game, and reading it tells you what every framework optimization is for. The numerator is "make more useful tokens per step": bigger effective batch, or fewer tokens thrown away by filtering and staleness. The denominator is "make the step shorter": shrink the dominant τ, or change the placement so the sum becomes a max. Because τ_step is a sum-or-max of stages, only the dominant stage moves the clock — a pipeline runs at the speed of its slowest stage. That is the entire reason "find the bottleneck" is the first move, and why optimizing any other stage is wasted work.

The one-line restatement of the whole sub-track

A framework is SOTA when it raises min(actor production, reward throughput, learner consumption) and shortens τ_step by overlapping stages — without shrinking L̄_useful by training on stale or filtered-away data. The fastest system and the most on-policy system are usually not the same system. That tension is the subject of 09b–09d.

2 · Where the time goes — five resources, read off the roofline

"Find the bottleneck" needs a candidate list. Don't memorize one — derive it. Each stage consumes a distinct physical resource, and lesson 02's roofline already tells you which resource binds each stage. Walk the loop once:

ACTOR emits tokens one at a time; each decode step reads the whole KV cache and the weights to produce one token. Arithmetic intensity ≈ 1–3 → memory-bandwidth-bound (lesson 04). The resource is decode bandwidth, and it queues when outputs are long.
REWARD / env is whatever scores the output: a verifier (run a unit test, cheap CPU) or a tool/sandbox/judge (a network round-trip). The resource is environment latency, and it queues on the p95 tail — one slow test holds the batch.
REFERENCE is one forward pass per scored token, no KV reuse across steps → also bandwidth-bound, a smaller cousin of the actor.
LEARNER processes the whole batch of sequences in parallel: forward + backward + optimizer. Arithmetic intensity ≈ 400 → compute-bound, scored by MFU (lesson 07). The resource is training FLOPs + HBM.
WEIGHT SYNC moves 2N bytes from learner layout to actor layout. It does no math — it sits off the roofline entirely, limited by interconnect bandwidth (lesson 02's 18× NVLink-vs-IB gap). The resource is weight movement; it queues on slow links and large N (09b).

That walk produces the candidate table — but now each row is a conclusion, not a fact to take on faith. The "framework response" column is just "what do you reach for when this resource is the dominant τ":

Resource	Stage	Roofline	Becomes the wall when…	Framework response
Decode bandwidth	τ_R	BW-bound	Long-CoT / agentic: many sequential tokens.	vLLM/SGLang, continuous batching, speculative decode, more actors.
Env / reward latency	τ_V	off-GPU	Tests, browsers, builds, judge models dominate the tail.	Remote env pools, caching, async reward queues, timeouts.
Training FLOPs + HBM	τ_T	compute-bound	Big policy, fat batches, optimizer + activation memory.	FSDP/Megatron, sequence packing, recompute, LoRA.
Weight movement	τ_S	network-bound	Large N on a slow link, every step (09b).	Resharding, bucketing, DMA, relay, colocate.
Freshness budget	(couples to all)	statistical	Async overlap outruns the policy → off-policy bias.	Bounded staleness, IS correction, staleness-aware loss (09c).

The last row is not a stage — it is the cost of overlapping the others. That coupling is what makes RL systems harder than the training systems of lesson 07, and it deserves its own argument.

3 · Why this is harder than a pretraining system

Pretraining (lesson 07) is a giant streaming matmul over a fixed dataset; the dataloader's job is to be invisible (lesson 08). RL inverts every one of those properties — and each inversion is why a new optimization exists:

Pretraining assumes…	RL breaks it because…	So the framework must…
The dataset is fixed and external.	The model generates its own next dataset.	Couple producer (actor) and consumer (learner) in a loop (lesson 08's framing).
Samples are uniform-cost.	Lengths are long-tailed; one completion can be 5× the mean (09c).	Tolerate stragglers — repack, cap, abort.
Every sample is equally valuable.	Most rollouts get reward 0; a few are gold.	Filter and prioritize — but track what it dropped.
The data distribution is stationary.	The generator changes after every learner step.	Tag each sample with the policy version that made it.
One parallel layout fits the whole run.	Actor wants an inference layout, learner a training layout.	Convert layouts and sync weights every step (09b).

The new invariant: policy version is part of the data

Pretraining wants the dataloader invisible. RL needs the policy version of every trajectory to be visible. You cannot correctly optimize the data plane without knowing which weights produced each sample and how stale it is — because the gradient correction (the importance-sampling ratio of RL 06) is a function of exactly that. Throughput without version accounting is just a faster way to train on the wrong distribution.

4 · The bottleneck map, and the method that uses it

Put the five stages in a row, add the sync edge that closes the loop, and label each with the fix you reach for when it dominates. This is the picture to draw on a whiteboard before naming a single framework:

The method is the one this whole track runs (the index's five-step loop): measure the split, fix the dominant term, re-measure — because fixing the wall exposes the next one. Never guess. A worked split makes it concrete; the numbers below are the 7B example from RL 22a (K=16, L≈1024, 8×H100, disaggregated):

τ_R ≈ 30 s (+ 1.4× straggler) >> τ_T ≈ 8 s · τ_ref ≈ 3 s · τ_S ≈ 1 s ⟹ τ_step ≈ max(33, 8) + 1 ≈ 34 s

Read it: rollout is ~95% of the step. Doubling learner GPUs takes the step from 34 s to… 34 s. The only move that matters is shrinking τ_R — more actors, faster decode, or killing the straggler tail. That is what "linearized" means: the arithmetic named the wall, and the wall named the fix.

Interactive · framework bottleneck triage

Set rough per-iteration times. The widget composes τ_step for the sync and async placements, names the binding wall, and prints the optimization family you would reach for first. It is deliberately crude — for design interviews and first-cut sizing, not for committing a cluster.

What carries forward

One equation governs everything: goodput = B·K·L̄_useful / τ_step. Frameworks either grow the numerator (more useful tokens) or shrink the denominator (shorter, more-overlapped step).
The loop is a pipeline, so τ_step is a sum (no overlap) or a max (overlap) of five stage times — and only the dominant stage moves the clock. That is why you measure the split before touching anything.
The five resources fall out of the roofline: decode (BW-bound), env (off-GPU tail), reference (BW-bound), trainer (compute-bound), weight sync (network-bound) — plus freshness, the statistical cost of overlapping them.
RL is harder than pretraining because the model makes its own data, that data is long-tailed and unevenly valuable, the generator moves every step, and actor vs learner want different layouts. The new invariant: policy version is part of the data.
Next: 09b derives the weight-sync term τ_S and the placement that pays or hides it; 09c derives the rollout term τ_R and the staleness it costs to overlap; 09d matches frameworks to whichever wall you measured.

Sources used

Source	System idea used
HybridFlow / verl	Hybrid controller and actor resharding between training and generation.
AReaL	Fully asynchronous generation/training with staleness-aware PPO.
AsyncFlow	Streaming data storage (TransferQueue), producer-consumer scheduling, service decoupling.
Laminar	Trajectory-level asynchrony, relay workers, and dynamic repacking.
Relax	Fault-isolated services, omni-modal stack, and a continuous staleness knob.
LlamaRL	Distributed async PyTorch and direct-memory weight synchronization.