Throughput bottlenecks, the usual optimizations, and how to identify them

An RL training step is a pipeline of five distinct workloads. Each has different hardware physics, different memory pressure, and different failure modes. This lesson is the systems-level answer: where the wall-clock goes, which knob moves it, and how you'd diagnose any of it on a real cluster.

This lesson zooms in on the cost pressure from lesson 26's three-forces frame. Whichever combination of signal + estimator you've picked, the cost question is the same: five workloads, one critical path, where does the wall-clock actually go and how do we move it?

Where the wall-clock goes — the typical step

One training step touches five workloads. The visualization below is a stacked-bar timeline (one step, in seconds) for a representative 7B-parameter run on 8× H100 in a disaggregated topology. Click any segment to see the physics of that workload.

The shape is consistent across published profiles: rollout is 60–80% of wall-clock, training is 10–26%, reference and sync sit in the low single-digit percent each. Any optimization budget should be spent in proportion to that distribution.

Bottleneck 1 · Rollout — memory-bandwidth-bound decode

The rollout engine spends nearly all its time in autoregressive decode: emit one token at a time, each time reading the full set of model weights and the full KV cache so far. The compute per token is trivial; the bandwidth dominates. Concrete numbers, 7B model on H100:

weight read ≈ 14 GB / decode step (amortized across the batch) KV read at L=2048, B=32, GQA(8 KV heads) ≈ 32 × 2048 × 131 KB ≈ 8.6 GB total HBM bandwidth 3.35 TB/s ⟹ aggregate cluster throughput ≈ 3–4k tokens/s (≈100 tok/s/seq at B=32)

That ceiling tells you where to look for slack: anything that reduces bytes-moved-per-token, or that hides one stream under another, is a real optimization. The most-cited ones, in order of leverage:

Optimization	Where it lives	Win on typical RL
Prefix caching — share the prompt's KV across K rollouts	vLLM APC, SGLang RadixAttention	3–10× on K-rollout RL (depends on prompt/completion ratio)
Continuous batching — admit new sequences into freed slots	vLLM default since 2023	2–20× over padded static batching (more when length variance is high)
Paged KV cache — block-granular allocation, no fragmentation	vLLM, SGLang	2–4× max-batch over contiguous
Chunked prefill — interleave long prefills with ongoing decodes	vLLM scheduler flag	10–20% on long-prompt workloads
Speculative decoding — draft K tokens, verify in one batch	EAGLE-2/3, Medusa	1.5–3× at greedy/low T (high draft acceptance); degrades toward ~1× as T rises and acceptance drops
KV cache quantization (int8/fp8) — halve KV bytes	vLLM kv_cache_dtype flag	1.5–2× decode; needs log-prob match audit
GQA / MQA — fewer KV heads	Model architecture	4–8× smaller KV; baked into modern models
Tensor parallel decode — shard weights across GPUs	vLLM tp_size	shrinks per-GPU weight read; adds collective overhead

The single highest-leverage RL-specific optimization

Prefix caching. K-rollout sampling (K=4–64) means K trajectories share the entire prompt. Without sharing: K prefills + K decodes. With sharing: 1 prefill + K decodes. For prompt:completion ratios where prompt > completion (math, code, multi-turn), this lands as 3–10× rollout speedup. If your rollout engine doesn't have it on, turn it on before doing anything else.

Bottleneck 2 · Training — compute-bound, but memory-fragile

The trainer is the opposite shape: one forward+backward over the packed batch. Forward does one matmul per linear layer (~2·P·tokens FLOPs); backward does two matmuls per linear layer (one for ∂L/∂W, one for ∂L/∂x → ~4·P·tokens FLOPs). Hence the canonical "6·P·tokens" rule per step. At 7B with packed batch of 32k tokens: 1.3 PFLOPs per step. An H100 sustains ~800 TFLOPs/s real-world bf16, so a single H100 takes ~1.6 s; 8× FSDP brings that to ~0.2 s plus all-gather overhead.

The bottleneck flips between two regimes depending on sequence length:

Compute-bound at short L: matmul throughput dominates. Optimization is "use TF32/BF16/FP8 well" and "minimize FSDP all-gather waste".
Memory-bound at long L: activation memory (∝ N·L·d for residuals, plus L² for attention) crowds out parameters and gradients. Optimization is "checkpoint activations, shard sequence dim, kill the (B,T,V) logits tensor".

Optimization	What it does	When it pays
FSDP / ZeRO-3	Shard params, grads, optimizer state across DP group. All-gather weights for forward.	Always above one GPU.
Activation checkpointing	Don't store all activations; recompute on backward.	Any sequence past ~2k tokens. ~33% extra forward FLOPs for ~10× activation memory.
Chunked cross-entropy	Compute log_softmax + gather in chunks over V, never materialize (B,T,V).	Any vocab ≥ 50k. ~30–50% trainer-forward memory savings.
Packed varlen attention	Pack many short trajectories into one (B=1, T_total) tensor with cu_seqlens.	Variable-length rollouts (always, in RL). 30–60% fewer FLOPs vs padded.
Sequence parallel	Shard the sequence dim across TP group for very long L.	Long-context RL (32k+ tokens).
Gradient accumulation	Multiple forwards before one optimizer step.	When effective batch > what fits.
BF16 forward + FP32 master + BF16 grad reduction	Mixed precision the safe way.	Default; halves activation + grad memory vs FP32.

Bottleneck 3 · Reference — forward-only, often hideable

Reference scoring is one forward pass per trajectory. No backward, no optimizer state. Cost is small in absolute terms; the question is whether it lands on the critical path. Three options:

Run ref inline on the trainer GPUs after rollout. Simplest, but it serializes with everything else.
Run ref on a separate pool overlapped with rollout. Costs a pool of GPUs but takes ref off the critical path entirely.
Fuse ref into the trainer's forward. Either via a shared trunk + two heads, or by computing ref_logp during the same packed forward.

For algorithms that need ref every step (PPO, GRPO with KL), the inline cost is ~10% of step time at 7B. For larger models, a separate pool wins.

Bottleneck 4 · Weight sync — bandwidth-bound, hideable

Every step, fresh trainer weights have to reach the rollout engine. The mechanics, from lesson 25:

FSDP all-gather on the trainer (turn sharded weights into full ones).
Dtype cast from BF16 master to the inference dtype (BF16 / FP8).
NCCL broadcast from trainer rank 0 to all rollout ranks.
Reshard into the rollout engine's TP layout.

For 70B BF16: 140 GB of weights moving across the cluster. Over NVLink (~600 GB/s) this is ~1 s; over InfiniBand (~50 GB/s) it's 3–20 s depending on fabric topology and contention. At 5–10 second total step times, that's already a significant fraction.

Optimization	Why it pays
Sync less often (every N steps)	Drops cost linearly. Costs you a growing π_old vs π gap; clip fires more, but if your training is otherwise stable this is a free win.
Overlap sync with the next rollout's prefill	Hides sync entirely. Requires careful layer-boundary swap to avoid mid-forward inconsistency.
Fused dtype cast inside the broadcast	One HBM round-trip instead of two.
Async pipelines with versioned weights	Sync runs on its own cadence; trainer and rollout both stay busy. Cost: off-policy bookkeeping.
IPC handoff (colocated)	If rollout and trainer share GPUs, the "sync" is a pointer swap. ~zero copy.

Bottleneck 5 · Algorithm + verifier — usually free, sometimes a long tail

Computing advantages, group statistics, KL — the actual loss math — is microseconds on modern GPUs. Except when the verifier is slow. Code execution can take 0.1–5 s per rollout; web verifiers can take 5–60 s. At K=64 rollouts per prompt these become the wall-clock anchor regardless of how fast the GPU is.

The fix is uniformly architectural: verifier dispatcher. Maintain a CPU/CPU-pool that runs verifiers asynchronously while the GPU continues generating other trajectories. The GPU never blocks on a verifier.

Interactive · the budget allocator

The widget below lets you set the percentages of step time spent in each role, then choose which optimizations to apply, and reports the new step time + the recipe-level win. Use it to see why "prefix caching first" is the right opening move.

Optimization-budget allocator

Sliders set baseline percentages (must add to 100). Toggle optimizations on/off. The bar chart updates and the bottom KPI reports the resulting step-time reduction.

rollout %: 60 train %: 20 ref %: 10 sync %: 5

prefix caching (rollout × 0.4) continuous batching (rollout × 0.6) speculative decoding (rollout × 0.7) FP8 inference (rollout × 0.7)

chunked CE (train × 0.8) packed varlen (train × 0.7) FSDP+ckpt (train × 0.85) async sync overlap (sync → 0) sync every 5 steps (sync × 0.2) separate ref pool (ref → 0)

Baseline step time

100s

Optimized step time

—

Speedup

—

Largest remaining slice

—

How to identify the bottleneck in practice

An RL training run is a multi-process distributed system. "It's slow" can mean any of a dozen things. The diagnosis playbook, in the order a real engineer follows it:

Step 1 — wall-clock attribution per role

The first piece of code in any RL job should be per-role timing. From controller.py:

t0 = time.time();  trajs = rollout.generate(env, K);  t_rollout = time.time() - t0
t0 = time.time();  reference.score(trajs);            t_ref     = time.time() - t0
t0 = time.time();  algorithm.compute_advantages(...)  t_algo    = time.time() - t0
t0 = time.time();  trainer.train_step(batch, loss)    t_train   = time.time() - t0
t0 = time.time();  weight_syncer.sync(trainer, ...)   t_sync    = time.time() - t0
log_to_wandb({"t/rollout": t_rollout, "t/ref": t_ref, ...})

Run for ~20 steps. Look at the wandb breakdown. Anything that's not 60–75% rollout means the system is unusual (good or bad). Don't optimize past this step.

Step 2 — GPU utilization vs memory occupancy

nvidia-smi -l 1 in a side window during training. Two diagnostic signatures:

GPU util ≈ 40–60% but memory ≈ 80%+: You're memory-bound (rollout decode). Look at KV cache size, batch size, prefix caching status.
GPU util ≈ 90%+ but memory ≈ 30%: You're compute-bound (trainer or prefill). Increase batch size or pack more.
GPU util oscillating between 0% and 100% with periodicity: You're sync-bound (waiting on weight broadcast) or verifier-bound (waiting on tool-output).

Step 3 — torch profiler for the slow role

Once you know which role to look at, drop in torch.profiler for a few steps of that role only:

from torch.profiler import profile, ProfilerActivity, schedule
with profile(activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU],
             schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
             on_trace_ready=lambda p: p.export_chrome_trace("trace.json")) as prof:
    for _ in range(5):
        trainer.train_step(batch, loss)
        prof.step()

Open trace.json in chrome://tracing or Perfetto. Look for:

NCCL kernels (ncclKernel_AllGather, ncclKernel_Broadcast) taking a noticeable fraction of the timeline → FSDP or sync is the bottleneck.
Attention kernels (FlashAttn variants) dominating → expected. Compare to expected FLOPs.
log_softmax / cross_entropy > 10% of train time → chunked CE not on. Easy win.
Empty / idle stretches on the GPU timeline → CPU stalls (tokenization, verifier, Python overhead).

Step 4 — memory peak

torch.cuda.reset_peak_memory_stats()
trainer.train_step(batch, loss)
peak = torch.cuda.max_memory_allocated() / 1e9
log_to_wandb({"mem/peak_gb": peak})

If peak ≫ allocated, activations are dominating and checkpointing or chunked CE will help. If peak ≈ allocated and high, you're at the memory ceiling and need FSDP or sharding.

Step 5 — Nsight Systems for the cross-GPU story

For multi-node distributed bottlenecks (FSDP all-gather, NCCL broadcast), nsys is the right tool:

nsys profile -t cuda,nvtx,nccl -o trace python train.py
# open trace.nsys-rep in Nsight Systems UI

You'll see per-GPU timelines aligned with NCCL collective bars. Look for misalignment (some ranks waiting for a slow one) and bubble time (gaps between kernels).

Correctness canaries — the bugs that masquerade as performance

Before you optimize anything, make sure the run is correct. A slow-looking run is sometimes a broken run; you don't want to spend a week speeding up training that isn't actually learning. Three canaries worth monitoring every step:

Log ρ histogram. The PPO ratio should be centered on 1 with light tails. If it skews systematically or the tails grow, you have a log-prob match bug (lesson 25) — rollout and trainer aren't agreeing on what π says.
frac_clipped. From lesson 23's failure-mode taxonomy: > 30% means π_old has drifted too far from π_θ (stale sync, async without versioning, or a kernel mismatch).
KL(π_θ ‖ π_ref) trajectory. Should grow smoothly. A sudden spike means the policy lurched (entropy collapse, reward hacking, or numerics bug).

Long-run failure modes — what fails on day 10

Long RL runs always fail. The diagnostic playbook above covers "step 1 is slow". This section covers "step 10000 broke" — the reliability checklist the engineer (lesson 24) owns.

Failure mode	Detection signal	Mitigation
KV cache fragmentation creep	Rollout tokens/sec drifts down 20%+ over days; engine reports paged-block hit rate falling	Restart engine on a cadence; or upgrade to a paged allocator that compacts
Verifier slowdown	Reward call p99 latency creeping up; verifier queue backing up	Async dispatcher, GPU-accelerated verifier, or hot-reload of stale sandbox VMs
Reward hacking	Train reward up, held-out eval flat or down	Held-out eval at every K steps; flag divergence; review verifier for exploits
NCCL / network partition	One rank stuck in all-reduce; step time spikes	NCCL timeout + watchdog; auto-restart with rank-aware checkpointing
One rollout worker OOMs on adversarial prompt	Single-worker crash; if not isolated, brings down the trainer	Detached process groups, request-level circuit breakers, dead-letter queue for bad prompts
Optimizer state corruption / sharded checkpoint mismatch	Loss spike on resume; gradient norm jumps	Versioned sharded checkpoints (params + optimizer + RNG + sync gen_id); statistically-equivalent resume test
Throughput regression after a sync	Step time creeps up after weight sync; ratio histogram skews	Compare attention kernels between rollout and trainer; audit log-prob match (canaries above)

The minimum reliability surface — anything the engineer should ship before a multi-day run:

Component health metrics: per-step rollout tok/s, trainer step time, weight-sync latency, ref forward time, reward distribution, KL to reference, gradient norm. Per-trajectory: length distribution, tool error rate, verifier latency.
Sharded checkpointing: trainer params + optimizer + RNG + rollout-engine snapshot + sync generation. Resume should yield a statistically equivalent next step. (Bitwise determinism is broken by NCCL nondeterminism and verifier sandboxes in practice.)
Failure isolation: a rollout worker dying doesn't take the trainer with it. Detached process groups, request-level circuit breakers, dead-letter queues.
Held-out eval cadence: every K steps, run a held-out eval; flag when reward goes up but eval flat. This is the canary for reward hacking.
Throughput regression alarms: a sudden 20% step-time creep is the cheapest signal you'll get for KV fragmentation, verifier degradation, or network contention.

The ROI-ordered playbook

Combining everything: when an engineer joins a slow RL run, the order of attack is:

Profile first. Wall-clock per role. Anything not ~60–75% rollout means the profile is wrong or the setup is unusual.
Prefix caching on the rollout engine. Single highest-leverage RL optimization.
Continuous batching + chunked prefill. Pad-free, decode-prefill mix. Easy 2× rollout.
Chunked cross-entropy on the trainer. Memory win that unlocks longer sequences.
Packed varlen attention. 30–60% fewer FLOPs in train.
Log-prob match audit. Fix this before chasing speed — it's the silent correctness bug surface.
Speculative decoding for low-T rollouts. Conditional 1.5–2×.
Sync overlap (or sync-every-N). Multi-node win.
KV cache quantization (int8/fp8) with explicit log-prob match audit.
Custom fused PPO / KL kernel. Marginal. Do last.

Takeaway — the three answers

Where do bottlenecks land? Rollout decode (60–80%), trainer compute and activation memory (10–26%), the rest (sync + ref + algo, ~10% combined). Verifiers can balloon to dominate if they're slow.

What are the usual optimizations? Per role: prefix caching, continuous batching, paged KV, chunked prefill, spec decode, KV quant, GQA (rollout); FSDP, activation checkpointing, chunked CE, packed varlen, sequence parallel (trainer); separate pool (ref); async overlap, sync-less-often (sync). System-wide: disaggregated topology and async pipelines.

How do you find them? Per-role wall-clock timing first, then nvidia-smi for util-vs-memory signatures, then torch.profiler on the slow role, then memory peak, then Nsight for the cross-GPU story, and always log the ρ-histogram to catch silent correctness bugs masquerading as performance bugs.