rl_lessons / 22a · throughput equation lesson 8½ / 9 · part III

The throughput equation — what you're actually optimizing

Lessons 19–22 give you topology, KV cache, scheduling, and memory math as a toolkit. This lesson is the assembly: write down the single equation those tools all push on, identify which term dominates under your setup, and the rest of system-level RL optimization writes itself as "lower the largest term until something else becomes the largest."

Where this lesson sits
This opens a three-lesson sub-track on system throughput. Lesson 22a (here) derives the optimization target from the controller loop. Lesson 22b drills into the largest single term for most setups — rollout long-tail stragglers — and the packing and dynamic-K patches. Lesson 22c covers async pipelining and the weight-sync engineering that makes async worth running.

The optimization target, derived from the loop

Start where every previous lesson started: the controller's loop body, in seven lines.

for step in range(N): prompts = sample_batch() # cheap rollouts = INFERENCE.generate(policy, prompts, K) # τ_R rewards = ENV.score(rollouts) # τ_V logp_ref = REF.forward(rollouts) # τ_ref logp_θ = TRAINER.forward(policy, rollouts) # part of τ_T advantages, loss = algo(rewards, logp_θ, logp_ref) TRAINER.backward(loss); optimizer.step() # rest of τ_T WEIGHT_SYNC.push(policy → INFERENCE) # τ_S

Each line consumes wall-clock. Label the five non-trivial terms τ_R (rollout), τ_V (verifier), τ_ref (reference forward), τ_T (trainer forward + backward + step), and τ_S (weight sync). Per-step wall-clock τ_step is some composition of those terms, and the composition depends entirely on the topology you chose in lesson 19:

τ_step (colocated) = τ_R + τ_V + τ_ref + τ_T + τ_S
τ_step (disaggregated) = max(τ_R + τ_V, τ_ref + τ_T) + τ_S
τ_step (fully async) ≈ max(τ_R + τ_V, τ_ref + τ_T, τ_S)

That is the throughput equation. Three lines, one per topology. Everything else in lessons 20–22 — paged attention, continuous batching, chunked CE, FP8 sync — is a constant-factor reduction on one of those terms. The job of an RL infra engineer is to identify which term dominates today and apply the cheapest patch that demotes it below something else.

The single useful definition of "throughput"
Useful tokens per second = (B · K · L̄) / τ_step, where B is prompt batch, K is rollouts per prompt, L̄ is mean response length. Useful means tokens whose log-prob participates in a gradient step. Tokens that were generated but discarded by dynamic sampling (lesson 13) or by online filter (lesson 18a stage 7) are not useful — they are part of τ_R but not the numerator. Optimizations that move generated tokens from "discarded" to "useful" raise throughput without lowering any term.

The five terms, sized and bounded

For a representative 7B-class run on an 8×H100 node, with K=16, batch=64 prompts, mean response length 1k tokens, the five terms typically size like this. Read the column "what bounds it" — that is the column you need to dominate to know where the patches go.

TermTypical shareWhat bounds itLowering it
τ_R rollout60–80%HBM bandwidth on decode; compute on prefillprefix cache · packing · paged KV · continuous batch · spec decode · FP8
τ_V verifier1–10%verifier I/O (sandbox spinup, judge call latency)cache verifier calls · pre-launched sandboxes · async judges
τ_ref ref fwd3–8%HBM bandwidth (it's a forward, no backward)collocate with trainer · FP8 ref · skip on cached prompts
τ_T trainer10–26%compute on forward+backward; activation memory on long Lchunked CE · sequence packing · activation recompute · TP/SP
τ_S weight sync2–10%NCCL bandwidth · dtype cast · reshard layout mismatchbroadcast tree · bucket fuse · FP8 sync · zero-copy reshard · overlap with next prefill

Two patterns that fall out and recur at every model scale:

Per-role roofline: where each term lives on the hardware

The "what bounds it" column of the table is a roofline statement. Every kernel a GPU runs sits at one of three regimes:

arithmetic intensity (FLOPs / byte, log scale) attained TFLOPS (log) slope = HBM bandwidth peak TFLOPS ≈ 989 (H100) knee rollout decode (τ_R) memory-BW bound · AI ≈ 1 ref forward (τ_ref) memory-BW bound · AI ≈ 3 rollout prefill compute bound · AI ≈ 100–150 trainer fwd+bwd (τ_T) compute bound · AI ≈ 400 weight sync (τ_S) network-BW bound · off-roofline

One reading of the chart: τ_R and τ_ref live below the knee, τ_T lives above it, τ_S lives off the chart entirely. This is why patches that work on the trainer (chunked CE, sequence packing) do nothing for the rollout (which needs paged KV, prefix cache, continuous batching) — they're on different sides of the roofline.

First-principles consequence
Most of RL wall-clock is decode-bandwidth-bound. That single sentence explains why the optimization toolkit looks the way it does: prefix caching (read fewer KV bytes), paged attention (pack more sequences into the same HBM), continuous batching (raise effective batch on the same HBM read), FP8 KV (cut bytes per token). None of those touch FLOPS, because FLOPS are not the bottleneck on the term that dominates.

Interactive: where is your wall-clock going?

Pick a model size, a cluster, a topology, and a batch shape. The widget computes each τ from a coarse first-principles model and shows where the wall-clock goes. The point isn't to predict absolute numbers — those depend on a dozen kernels — but to predict which term dominates, which is enough to choose the next optimization to apply.

Per-role wall-clock model
Slide model size, K rollouts per prompt, mean response length, and topology. The bar chart is the predicted breakdown of τ_step; the KPIs report total time, useful tokens/sec, and the dominant term.
τ_step
tok/sec useful
dominant
compute util
What to try. (1) 7B / K=16 / L=1024 / disagg: rollout dominates, ~70% of step. (2) Crank L to 8192 with same topology: rollout share grows past 85% — long sequences are the heaviest tax. (3) Drop to colocated: total τ_step jumps because everything serializes; tok/sec drops by ~30%. (4) 70B / async: weight sync becomes visible — at large models reshard cost grows faster than rollout.

The "where is my wall-clock going" decision tree

The optimization order is not "apply every patch from lesson 21." It is "measure, then patch the largest term, then re-measure." A decision tree that follows from the equation:

measure τ_step components torch.profiler · nvidia-smi τ_R dominates? decode → memory BW τ_T dominates? forward+backward → compute τ_S dominates? weight sync → comm BW prefix cache continuous batch paged KV · FP8 KV + lesson 22b patches chunked CE sequence packing activation recompute TP / SP broadcast tree bucket fuse FP8 sync · zero-copy + lesson 22c patches re-measure after every patch — the dominant term changes optimization is iterative, not one-shot

The point of the decision tree is the bottom annotation. You apply one patch, re-measure, and re-decide. If you batch up "all the optimizations" and apply them at once, you can't attribute the gain — and you'll spend the next month inheriting a stack of patches whose individual contributions you can't reproduce.

The tokens/sec formula, written out

The useful definition above expands into something you can actually log:

tokens/sec = (B · K · L̄ · keep_rate) / τ_step

Where keep_rate is the fraction of rollouts that survive dynamic sampling and the data-pipeline's online filter (lessons 13, 18a stage 7). For a healthy run, keep_rate sits at 0.7–0.95; when it drops to 0.4–0.5 you are wasting a lot of decode FLOPs on degenerate groups, which the lesson on data pipelines (18a) addresses.

Three useful invariants fall out of writing it this way:

  1. Doubling K is not 2× the gradient signal. It is roughly √(K/K₀) times the signal under most variance-reduction baselines (lessons 11, 12, 14). But τ_R grows linearly in K — see lesson 22b. So past a point, larger K loses tokens/sec faster than it gains signal/sec.
  2. Long L hurts twice. Once in τ_R (more decode steps), and again in τ_T (longer trainer forward — activation memory grows quadratically before mitigations). Length capping is one of the highest-leverage knobs.
  3. Topology only matters if the two largest terms are commensurate. If τ_R = 0.8 and τ_T = 0.1, switching from colocated to disaggregated saves at most 0.1 — barely worth the engineering. If τ_R = 0.5 and τ_T = 0.4, disaggregated nearly halves the step.

A worked example: where does a 7B run's hour go?

Concretize. 7B model, K=16, batch=64 prompts, mean L=1024 tokens, disaggregated topology on 8×H100. Each step is roughly:

TermEstimateWhy
τ_R~30 s64·16=1024 trajectories at 1k tokens each = 1M tokens decoded; at ~12k tok/s/H100 sustained with paged KV + continuous batching (lesson 22), 4 rollout-side H100s deliver ~48k tok/s ≈ 21 s of mean decode, plus a long-tail multiplier of ~1.4× (lesson 22b)
τ_V~1 sverifier checks are cheap (~ms each) and parallelize trivially
τ_ref~3 sone forward pass over 1M tokens at FP8 on 4 trainer-side H100s ≈ 3 s memory-bound forward
τ_T~8 sforward + backward over the same 1M tokens at BF16 with chunked CE ≈ 8 s
τ_S~1 s7B params, FP8 broadcast across the node, ~1 s of NCCL

Under disaggregated topology, τ_step = max(τ_R + τ_V, τ_ref + τ_T) + τ_S = max(31, 11) + 1 = 32 seconds. Useful tok/s = (64 · 16 · 1024) / 32 ≈ 33k tok/s across the cluster.

Two things to notice. First, the trainer side (τ_ref + τ_T = 11 s) is sitting idle for the last 20 seconds of every step. That idle time is the opportunity async pipelining (lesson 22c) tries to claim. Second, the rollout-side wall-clock is mostly mean decode plus a long-tail multiplier; patching the tail (lesson 22b's bet) cuts τ_R by ~30% without changing the kernel.

Modeling caveats
Two simplifications worth naming. (a) The disaggregated formula folds the verifier into the rollout stream (typical when env.score is in-process with generate) and the reference into the trainer stream (typical when ref shares the trainer's TP group). With dedicated verifier nodes (sandboxed code, LLM-judge) or a separate ref pool, these become their own streams and the equation grows two more max-arguments. (b) The per-role roofline numbers above are calibrated to a 7B BF16 model on H100 with paged KV and continuous batching — real numbers vary 2–3× with sequence length, batch shape, and engine choice.
Takeaway
Per-step wall-clock is max or sum over five terms — rollout, verifier, ref forward, trainer, sync — whose composition is set by topology. Each term has a roofline; each roofline has a different optimization toolkit. The job is to measure, find the dominant term, lower it by one patch, re-measure. Lessons 22b and 22c go deep on the two terms that dominate in practice — rollout long-tail and weight sync — but every system patch in lessons 20–28 maps back to lowering one of these five.