The throughput equation — what you're actually optimizing
Lessons 19–22 give you topology, KV cache, scheduling, and memory math as a toolkit. This lesson is the assembly: write down the single equation those tools all push on, identify which term dominates under your setup, and the rest of system-level RL optimization writes itself as "lower the largest term until something else becomes the largest."
The optimization target, derived from the loop
Start where every previous lesson started: the controller's loop body, in seven lines.
Each line consumes wall-clock. Label the five non-trivial terms τ_R (rollout), τ_V (verifier), τ_ref (reference forward), τ_T (trainer forward + backward + step), and τ_S (weight sync). Per-step wall-clock τ_step is some composition of those terms, and the composition depends entirely on the topology you chose in lesson 19:
τ_step (disaggregated) = max(τ_R + τ_V, τ_ref + τ_T) + τ_S
τ_step (fully async) ≈ max(τ_R + τ_V, τ_ref + τ_T, τ_S)
That is the throughput equation. Three lines, one per topology. Everything else in lessons 20–22 — paged attention, continuous batching, chunked CE, FP8 sync — is a constant-factor reduction on one of those terms. The job of an RL infra engineer is to identify which term dominates today and apply the cheapest patch that demotes it below something else.
The five terms, sized and bounded
For a representative 7B-class run on an 8×H100 node, with K=16, batch=64 prompts, mean response length 1k tokens, the five terms typically size like this. Read the column "what bounds it" — that is the column you need to dominate to know where the patches go.
| Term | Typical share | What bounds it | Lowering it |
|---|---|---|---|
| τ_R rollout | 60–80% | HBM bandwidth on decode; compute on prefill | prefix cache · packing · paged KV · continuous batch · spec decode · FP8 |
| τ_V verifier | 1–10% | verifier I/O (sandbox spinup, judge call latency) | cache verifier calls · pre-launched sandboxes · async judges |
| τ_ref ref fwd | 3–8% | HBM bandwidth (it's a forward, no backward) | collocate with trainer · FP8 ref · skip on cached prompts |
| τ_T trainer | 10–26% | compute on forward+backward; activation memory on long L | chunked CE · sequence packing · activation recompute · TP/SP |
| τ_S weight sync | 2–10% | NCCL bandwidth · dtype cast · reshard layout mismatch | broadcast tree · bucket fuse · FP8 sync · zero-copy reshard · overlap with next prefill |
Two patterns that fall out and recur at every model scale:
- Rollout dominates. 60–80% is a real number; lesson 28 attributes most of it to decode-bandwidth. This is why lesson 22b is dedicated to the rollout side: if you optimize anywhere else first, you are optimizing the wrong term.
- Weight sync looks small until it isn't. 2–10% is the typical share, but it scales with model size and reshard distance between trainer and inference layouts. At 70B+ with TP=8 on trainer and TP=4 on rollout, the reshard alone can hit 20%. This is the wall lesson 22c is built against.
Per-role roofline: where each term lives on the hardware
The "what bounds it" column of the table is a roofline statement. Every kernel a GPU runs sits at one of three regimes:
- Compute-bound. Arithmetic intensity (FLOPs / byte) exceeds the hardware's roofline knee; the bottleneck is TFLOPS. Examples: prefill in rollout, forward + backward in trainer at long L.
- Memory-bandwidth-bound. Arithmetic intensity sits below the knee; the bottleneck is HBM bytes/sec. Examples: decode in rollout, reference forward, KV reads.
- Communication-bound. Most of wall-clock is spent moving bytes between GPUs or across the network. Examples: weight sync, all-reduce on backward at very high TP, NCCL allgather during MoE.
One reading of the chart: τ_R and τ_ref live below the knee, τ_T lives above it, τ_S lives off the chart entirely. This is why patches that work on the trainer (chunked CE, sequence packing) do nothing for the rollout (which needs paged KV, prefix cache, continuous batching) — they're on different sides of the roofline.
Interactive: where is your wall-clock going?
Pick a model size, a cluster, a topology, and a batch shape. The widget computes each τ from a coarse first-principles model and shows where the wall-clock goes. The point isn't to predict absolute numbers — those depend on a dozen kernels — but to predict which term dominates, which is enough to choose the next optimization to apply.
The "where is my wall-clock going" decision tree
The optimization order is not "apply every patch from lesson 21." It is "measure, then patch the largest term, then re-measure." A decision tree that follows from the equation:
The point of the decision tree is the bottom annotation. You apply one patch, re-measure, and re-decide. If you batch up "all the optimizations" and apply them at once, you can't attribute the gain — and you'll spend the next month inheriting a stack of patches whose individual contributions you can't reproduce.
The tokens/sec formula, written out
The useful definition above expands into something you can actually log:
Where keep_rate is the fraction of rollouts that survive dynamic sampling and the data-pipeline's online filter (lessons 13, 18a stage 7). For a healthy run, keep_rate sits at 0.7–0.95; when it drops to 0.4–0.5 you are wasting a lot of decode FLOPs on degenerate groups, which the lesson on data pipelines (18a) addresses.
Three useful invariants fall out of writing it this way:
- Doubling K is not 2× the gradient signal. It is roughly √(K/K₀) times the signal under most variance-reduction baselines (lessons 11, 12, 14). But τ_R grows linearly in K — see lesson 22b. So past a point, larger K loses tokens/sec faster than it gains signal/sec.
- Long L hurts twice. Once in τ_R (more decode steps), and again in τ_T (longer trainer forward — activation memory grows quadratically before mitigations). Length capping is one of the highest-leverage knobs.
- Topology only matters if the two largest terms are commensurate. If τ_R = 0.8 and τ_T = 0.1, switching from colocated to disaggregated saves at most 0.1 — barely worth the engineering. If τ_R = 0.5 and τ_T = 0.4, disaggregated nearly halves the step.
A worked example: where does a 7B run's hour go?
Concretize. 7B model, K=16, batch=64 prompts, mean L=1024 tokens, disaggregated topology on 8×H100. Each step is roughly:
| Term | Estimate | Why |
|---|---|---|
| τ_R | ~30 s | 64·16=1024 trajectories at 1k tokens each = 1M tokens decoded; at ~12k tok/s/H100 sustained with paged KV + continuous batching (lesson 22), 4 rollout-side H100s deliver ~48k tok/s ≈ 21 s of mean decode, plus a long-tail multiplier of ~1.4× (lesson 22b) |
| τ_V | ~1 s | verifier checks are cheap (~ms each) and parallelize trivially |
| τ_ref | ~3 s | one forward pass over 1M tokens at FP8 on 4 trainer-side H100s ≈ 3 s memory-bound forward |
| τ_T | ~8 s | forward + backward over the same 1M tokens at BF16 with chunked CE ≈ 8 s |
| τ_S | ~1 s | 7B params, FP8 broadcast across the node, ~1 s of NCCL |
Under disaggregated topology, τ_step = max(τ_R + τ_V, τ_ref + τ_T) + τ_S = max(31, 11) + 1 = 32 seconds. Useful tok/s = (64 · 16 · 1024) / 32 ≈ 33k tok/s across the cluster.
Two things to notice. First, the trainer side (τ_ref + τ_T = 11 s) is sitting idle for the last 20 seconds of every step. That idle time is the opportunity async pipelining (lesson 22c) tries to claim. Second, the rollout-side wall-clock is mostly mean decode plus a long-tail multiplier; patching the tail (lesson 22b's bet) cuts τ_R by ~30% without changing the kernel.