Throughput bottlenecks, the usual optimizations, and how to identify them
An RL training step is a pipeline of five distinct workloads. Each has different hardware physics, different memory pressure, and different failure modes. This lesson is the systems-level answer: where the wall-clock goes, which knob moves it, and how you'd diagnose any of it on a real cluster.
This lesson zooms in on the cost pressure from lesson 26's three-forces frame. Whichever combination of signal + estimator you've picked, the cost question is the same: five workloads, one critical path, where does the wall-clock actually go and how do we move it?
Where the wall-clock goes — the typical step
One training step touches five workloads. The visualization below is a stacked-bar timeline (one step, in seconds) for a representative 7B-parameter run on 8× H100 in a disaggregated topology. Click any segment to see the physics of that workload.
The shape is consistent across published profiles: rollout is 60–80% of wall-clock, training is 10–26%, reference and sync sit in the low single-digit percent each. Any optimization budget should be spent in proportion to that distribution.
Bottleneck 1 · Rollout — memory-bandwidth-bound decode
The rollout engine spends nearly all its time in autoregressive decode: emit one token at a time, each time reading the full set of model weights and the full KV cache so far. The compute per token is trivial; the bandwidth dominates. Concrete numbers, 7B model on H100:
That ceiling tells you where to look for slack: anything that reduces bytes-moved-per-token, or that hides one stream under another, is a real optimization. The most-cited ones, in order of leverage:
| Optimization | Where it lives | Win on typical RL |
|---|---|---|
| Prefix caching — share the prompt's KV across K rollouts | vLLM APC, SGLang RadixAttention | 3–10× on K-rollout RL (depends on prompt/completion ratio) |
| Continuous batching — admit new sequences into freed slots | vLLM default since 2023 | 2–20× over padded static batching (more when length variance is high) |
| Paged KV cache — block-granular allocation, no fragmentation | vLLM, SGLang | 2–4× max-batch over contiguous |
| Chunked prefill — interleave long prefills with ongoing decodes | vLLM scheduler flag | 10–20% on long-prompt workloads |
| Speculative decoding — draft K tokens, verify in one batch | EAGLE-2/3, Medusa | 1.5–3× at greedy/low T (high draft acceptance); degrades toward ~1× as T rises and acceptance drops |
| KV cache quantization (int8/fp8) — halve KV bytes | vLLM kv_cache_dtype flag | 1.5–2× decode; needs log-prob match audit |
| GQA / MQA — fewer KV heads | Model architecture | 4–8× smaller KV; baked into modern models |
| Tensor parallel decode — shard weights across GPUs | vLLM tp_size | shrinks per-GPU weight read; adds collective overhead |
Bottleneck 2 · Training — compute-bound, but memory-fragile
The trainer is the opposite shape: one forward+backward over the packed batch. Forward does one matmul per linear layer (~2·P·tokens FLOPs); backward does two matmuls per linear layer (one for ∂L/∂W, one for ∂L/∂x → ~4·P·tokens FLOPs). Hence the canonical "6·P·tokens" rule per step. At 7B with packed batch of 32k tokens: 1.3 PFLOPs per step. An H100 sustains ~800 TFLOPs/s real-world bf16, so a single H100 takes ~1.6 s; 8× FSDP brings that to ~0.2 s plus all-gather overhead.
The bottleneck flips between two regimes depending on sequence length:
- Compute-bound at short L: matmul throughput dominates. Optimization is "use TF32/BF16/FP8 well" and "minimize FSDP all-gather waste".
- Memory-bound at long L: activation memory (∝ N·L·d for residuals, plus L² for attention) crowds out parameters and gradients. Optimization is "checkpoint activations, shard sequence dim, kill the (B,T,V) logits tensor".
| Optimization | What it does | When it pays |
|---|---|---|
| FSDP / ZeRO-3 | Shard params, grads, optimizer state across DP group. All-gather weights for forward. | Always above one GPU. |
| Activation checkpointing | Don't store all activations; recompute on backward. | Any sequence past ~2k tokens. ~33% extra forward FLOPs for ~10× activation memory. |
| Chunked cross-entropy | Compute log_softmax + gather in chunks over V, never materialize (B,T,V). | Any vocab ≥ 50k. ~30–50% trainer-forward memory savings. |
| Packed varlen attention | Pack many short trajectories into one (B=1, T_total) tensor with cu_seqlens. | Variable-length rollouts (always, in RL). 30–60% fewer FLOPs vs padded. |
| Sequence parallel | Shard the sequence dim across TP group for very long L. | Long-context RL (32k+ tokens). |
| Gradient accumulation | Multiple forwards before one optimizer step. | When effective batch > what fits. |
| BF16 forward + FP32 master + BF16 grad reduction | Mixed precision the safe way. | Default; halves activation + grad memory vs FP32. |
Bottleneck 3 · Reference — forward-only, often hideable
Reference scoring is one forward pass per trajectory. No backward, no optimizer state. Cost is small in absolute terms; the question is whether it lands on the critical path. Three options:
- Run ref inline on the trainer GPUs after rollout. Simplest, but it serializes with everything else.
- Run ref on a separate pool overlapped with rollout. Costs a pool of GPUs but takes ref off the critical path entirely.
- Fuse ref into the trainer's forward. Either via a shared trunk + two heads, or by computing ref_logp during the same packed forward.
For algorithms that need ref every step (PPO, GRPO with KL), the inline cost is ~10% of step time at 7B. For larger models, a separate pool wins.
Bottleneck 4 · Weight sync — bandwidth-bound, hideable
Every step, fresh trainer weights have to reach the rollout engine. The mechanics, from lesson 25:
- FSDP all-gather on the trainer (turn sharded weights into full ones).
- Dtype cast from BF16 master to the inference dtype (BF16 / FP8).
- NCCL broadcast from trainer rank 0 to all rollout ranks.
- Reshard into the rollout engine's TP layout.
For 70B BF16: 140 GB of weights moving across the cluster. Over NVLink (~600 GB/s) this is ~1 s; over InfiniBand (~50 GB/s) it's 3–20 s depending on fabric topology and contention. At 5–10 second total step times, that's already a significant fraction.
| Optimization | Why it pays |
|---|---|
| Sync less often (every N steps) | Drops cost linearly. Costs you a growing π_old vs π gap; clip fires more, but if your training is otherwise stable this is a free win. |
| Overlap sync with the next rollout's prefill | Hides sync entirely. Requires careful layer-boundary swap to avoid mid-forward inconsistency. |
| Fused dtype cast inside the broadcast | One HBM round-trip instead of two. |
| Async pipelines with versioned weights | Sync runs on its own cadence; trainer and rollout both stay busy. Cost: off-policy bookkeeping. |
| IPC handoff (colocated) | If rollout and trainer share GPUs, the "sync" is a pointer swap. ~zero copy. |
Bottleneck 5 · Algorithm + verifier — usually free, sometimes a long tail
Computing advantages, group statistics, KL — the actual loss math — is microseconds on modern GPUs. Except when the verifier is slow. Code execution can take 0.1–5 s per rollout; web verifiers can take 5–60 s. At K=64 rollouts per prompt these become the wall-clock anchor regardless of how fast the GPU is.
The fix is uniformly architectural: verifier dispatcher. Maintain a CPU/CPU-pool that runs verifiers asynchronously while the GPU continues generating other trajectories. The GPU never blocks on a verifier.
Interactive · the budget allocator
The widget below lets you set the percentages of step time spent in each role, then choose which optimizations to apply, and reports the new step time + the recipe-level win. Use it to see why "prefix caching first" is the right opening move.
How to identify the bottleneck in practice
An RL training run is a multi-process distributed system. "It's slow" can mean any of a dozen things. The diagnosis playbook, in the order a real engineer follows it:
Step 1 — wall-clock attribution per role
The first piece of code in any RL job should be per-role timing. From controller.py:
t0 = time.time(); trajs = rollout.generate(env, K); t_rollout = time.time() - t0
t0 = time.time(); reference.score(trajs); t_ref = time.time() - t0
t0 = time.time(); algorithm.compute_advantages(...) t_algo = time.time() - t0
t0 = time.time(); trainer.train_step(batch, loss) t_train = time.time() - t0
t0 = time.time(); weight_syncer.sync(trainer, ...) t_sync = time.time() - t0
log_to_wandb({"t/rollout": t_rollout, "t/ref": t_ref, ...})
Run for ~20 steps. Look at the wandb breakdown. Anything that's not 60–75% rollout means the system is unusual (good or bad). Don't optimize past this step.
Step 2 — GPU utilization vs memory occupancy
nvidia-smi -l 1 in a side window during training. Two diagnostic signatures:
- GPU util ≈ 40–60% but memory ≈ 80%+: You're memory-bound (rollout decode). Look at KV cache size, batch size, prefix caching status.
- GPU util ≈ 90%+ but memory ≈ 30%: You're compute-bound (trainer or prefill). Increase batch size or pack more.
- GPU util oscillating between 0% and 100% with periodicity: You're sync-bound (waiting on weight broadcast) or verifier-bound (waiting on tool-output).
Step 3 — torch profiler for the slow role
Once you know which role to look at, drop in torch.profiler for a few steps of that role only:
from torch.profiler import profile, ProfilerActivity, schedule
with profile(activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU],
schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=lambda p: p.export_chrome_trace("trace.json")) as prof:
for _ in range(5):
trainer.train_step(batch, loss)
prof.step()
Open trace.json in chrome://tracing or Perfetto. Look for:
- NCCL kernels (ncclKernel_AllGather, ncclKernel_Broadcast) taking a noticeable fraction of the timeline → FSDP or sync is the bottleneck.
- Attention kernels (FlashAttn variants) dominating → expected. Compare to expected FLOPs.
- log_softmax / cross_entropy > 10% of train time → chunked CE not on. Easy win.
- Empty / idle stretches on the GPU timeline → CPU stalls (tokenization, verifier, Python overhead).
Step 4 — memory peak
torch.cuda.reset_peak_memory_stats()
trainer.train_step(batch, loss)
peak = torch.cuda.max_memory_allocated() / 1e9
log_to_wandb({"mem/peak_gb": peak})
If peak ≫ allocated, activations are dominating and checkpointing or chunked CE will help. If peak ≈ allocated and high, you're at the memory ceiling and need FSDP or sharding.
Step 5 — Nsight Systems for the cross-GPU story
For multi-node distributed bottlenecks (FSDP all-gather, NCCL broadcast), nsys is the right tool:
nsys profile -t cuda,nvtx,nccl -o trace python train.py
# open trace.nsys-rep in Nsight Systems UI
You'll see per-GPU timelines aligned with NCCL collective bars. Look for misalignment (some ranks waiting for a slow one) and bubble time (gaps between kernels).
Correctness canaries — the bugs that masquerade as performance
Before you optimize anything, make sure the run is correct. A slow-looking run is sometimes a broken run; you don't want to spend a week speeding up training that isn't actually learning. Three canaries worth monitoring every step:
- Log ρ histogram. The PPO ratio should be centered on 1 with light tails. If it skews systematically or the tails grow, you have a log-prob match bug (lesson 25) — rollout and trainer aren't agreeing on what π says.
- frac_clipped. From lesson 23's failure-mode taxonomy: > 30% means π_old has drifted too far from π_θ (stale sync, async without versioning, or a kernel mismatch).
- KL(π_θ ‖ π_ref) trajectory. Should grow smoothly. A sudden spike means the policy lurched (entropy collapse, reward hacking, or numerics bug).
Long-run failure modes — what fails on day 10
Long RL runs always fail. The diagnostic playbook above covers "step 1 is slow". This section covers "step 10000 broke" — the reliability checklist the engineer (lesson 24) owns.
| Failure mode | Detection signal | Mitigation |
|---|---|---|
| KV cache fragmentation creep | Rollout tokens/sec drifts down 20%+ over days; engine reports paged-block hit rate falling | Restart engine on a cadence; or upgrade to a paged allocator that compacts |
| Verifier slowdown | Reward call p99 latency creeping up; verifier queue backing up | Async dispatcher, GPU-accelerated verifier, or hot-reload of stale sandbox VMs |
| Reward hacking | Train reward up, held-out eval flat or down | Held-out eval at every K steps; flag divergence; review verifier for exploits |
| NCCL / network partition | One rank stuck in all-reduce; step time spikes | NCCL timeout + watchdog; auto-restart with rank-aware checkpointing |
| One rollout worker OOMs on adversarial prompt | Single-worker crash; if not isolated, brings down the trainer | Detached process groups, request-level circuit breakers, dead-letter queue for bad prompts |
| Optimizer state corruption / sharded checkpoint mismatch | Loss spike on resume; gradient norm jumps | Versioned sharded checkpoints (params + optimizer + RNG + sync gen_id); statistically-equivalent resume test |
| Throughput regression after a sync | Step time creeps up after weight sync; ratio histogram skews | Compare attention kernels between rollout and trainer; audit log-prob match (canaries above) |
The minimum reliability surface — anything the engineer should ship before a multi-day run:
- Component health metrics: per-step rollout tok/s, trainer step time, weight-sync latency, ref forward time, reward distribution, KL to reference, gradient norm. Per-trajectory: length distribution, tool error rate, verifier latency.
- Sharded checkpointing: trainer params + optimizer + RNG + rollout-engine snapshot + sync generation. Resume should yield a statistically equivalent next step. (Bitwise determinism is broken by NCCL nondeterminism and verifier sandboxes in practice.)
- Failure isolation: a rollout worker dying doesn't take the trainer with it. Detached process groups, request-level circuit breakers, dead-letter queues.
- Held-out eval cadence: every K steps, run a held-out eval; flag when reward goes up but eval flat. This is the canary for reward hacking.
- Throughput regression alarms: a sudden 20% step-time creep is the cheapest signal you'll get for KV fragmentation, verifier degradation, or network contention.
The ROI-ordered playbook
Combining everything: when an engineer joins a slow RL run, the order of attack is:
- Profile first. Wall-clock per role. Anything not ~60–75% rollout means the profile is wrong or the setup is unusual.
- Prefix caching on the rollout engine. Single highest-leverage RL optimization.
- Continuous batching + chunked prefill. Pad-free, decode-prefill mix. Easy 2× rollout.
- Chunked cross-entropy on the trainer. Memory win that unlocks longer sequences.
- Packed varlen attention. 30–60% fewer FLOPs in train.
- Log-prob match audit. Fix this before chasing speed — it's the silent correctness bug surface.
- Speculative decoding for low-T rollouts. Conditional 1.5–2×.
- Sync overlap (or sync-every-N). Multi-node win.
- KV cache quantization (int8/fp8) with explicit log-prob match audit.
- Custom fused PPO / KL kernel. Marginal. Do last.
Where do bottlenecks land? Rollout decode (60–80%), trainer compute and activation memory (10–26%), the rest (sync + ref + algo, ~10% combined). Verifiers can balloon to dominate if they're slow.
What are the usual optimizations? Per role: prefix caching, continuous batching, paged KV, chunked prefill, spec decode, KV quant, GQA (rollout); FSDP, activation checkpointing, chunked CE, packed varlen, sequence parallel (trainer); separate pool (ref); async overlap, sync-less-often (sync). System-wide: disaggregated topology and async pipelines.
How do you find them? Per-role wall-clock timing first, then nvidia-smi for util-vs-memory signatures, then torch.profiler on the slow role, then memory peak, then Nsight for the cross-GPU story, and always log the ρ-histogram to catch silent correctness bugs masquerading as performance bugs.