RL / lessons / 28 · bottlenecks lesson 3 / 3 · part V

Throughput bottlenecks, the usual optimizations, and how to identify them

An RL training step is a pipeline of five distinct workloads. Each has different hardware physics, different memory pressure, and different failure modes. This lesson is the systems-level answer: where the wall-clock goes, which knob moves it, and how you'd diagnose any of it on a real cluster.

This lesson zooms in on the cost pressure from lesson 26's three-forces frame. Whichever combination of signal + estimator you've picked, the cost question is the same: five workloads, one critical path, where does the wall-clock actually go and how do we move it?

Where the wall-clock goes — the typical step

One training step touches five workloads. The visualization below is a stacked-bar timeline (one step, in seconds) for a representative 7B-parameter run on 8× H100 in a disaggregated topology. Click any segment to see the physics of that workload.

ROLLOUT 60% decode-bound REF 10% ALGO 5% TRAIN 20% compute-bound SYNC 5% 0s step midpoint end of step A typical RL step's wall-clock budget click a segment to learn its bottleneck physics Hover or click any segment to see the physics.

The shape is consistent across published profiles: rollout is 60–80% of wall-clock, training is 10–26%, reference and sync sit in the low single-digit percent each. Any optimization budget should be spent in proportion to that distribution.

Bottleneck 1 · Rollout — memory-bandwidth-bound decode

The rollout engine spends nearly all its time in autoregressive decode: emit one token at a time, each time reading the full set of model weights and the full KV cache so far. The compute per token is trivial; the bandwidth dominates. Concrete numbers, 7B model on H100:

weight read  ≈  14 GB / decode step  (amortized across the batch) KV read at L=2048, B=32, GQA(8 KV heads)  ≈  32 × 2048 × 131 KB ≈ 8.6 GB total HBM bandwidth  3.35 TB/s  ⟹  aggregate cluster throughput ≈ 3–4k tokens/s (≈100 tok/s/seq at B=32)

That ceiling tells you where to look for slack: anything that reduces bytes-moved-per-token, or that hides one stream under another, is a real optimization. The most-cited ones, in order of leverage:

OptimizationWhere it livesWin on typical RL
Prefix caching — share the prompt's KV across K rolloutsvLLM APC, SGLang RadixAttention3–10× on K-rollout RL (depends on prompt/completion ratio)
Continuous batching — admit new sequences into freed slotsvLLM default since 20232–20× over padded static batching (more when length variance is high)
Paged KV cache — block-granular allocation, no fragmentationvLLM, SGLang2–4× max-batch over contiguous
Chunked prefill — interleave long prefills with ongoing decodesvLLM scheduler flag10–20% on long-prompt workloads
Speculative decoding — draft K tokens, verify in one batchEAGLE-2/3, Medusa1.5–3× at greedy/low T (high draft acceptance); degrades toward ~1× as T rises and acceptance drops
KV cache quantization (int8/fp8) — halve KV bytesvLLM kv_cache_dtype flag1.5–2× decode; needs log-prob match audit
GQA / MQA — fewer KV headsModel architecture4–8× smaller KV; baked into modern models
Tensor parallel decode — shard weights across GPUsvLLM tp_sizeshrinks per-GPU weight read; adds collective overhead
The single highest-leverage RL-specific optimization
Prefix caching. K-rollout sampling (K=4–64) means K trajectories share the entire prompt. Without sharing: K prefills + K decodes. With sharing: 1 prefill + K decodes. For prompt:completion ratios where prompt > completion (math, code, multi-turn), this lands as 3–10× rollout speedup. If your rollout engine doesn't have it on, turn it on before doing anything else.

Bottleneck 2 · Training — compute-bound, but memory-fragile

The trainer is the opposite shape: one forward+backward over the packed batch. Forward does one matmul per linear layer (~2·P·tokens FLOPs); backward does two matmuls per linear layer (one for ∂L/∂W, one for ∂L/∂x → ~4·P·tokens FLOPs). Hence the canonical "6·P·tokens" rule per step. At 7B with packed batch of 32k tokens: 1.3 PFLOPs per step. An H100 sustains ~800 TFLOPs/s real-world bf16, so a single H100 takes ~1.6 s; 8× FSDP brings that to ~0.2 s plus all-gather overhead.

The bottleneck flips between two regimes depending on sequence length:

OptimizationWhat it doesWhen it pays
FSDP / ZeRO-3Shard params, grads, optimizer state across DP group. All-gather weights for forward.Always above one GPU.
Activation checkpointingDon't store all activations; recompute on backward.Any sequence past ~2k tokens. ~33% extra forward FLOPs for ~10× activation memory.
Chunked cross-entropyCompute log_softmax + gather in chunks over V, never materialize (B,T,V).Any vocab ≥ 50k. ~30–50% trainer-forward memory savings.
Packed varlen attentionPack many short trajectories into one (B=1, T_total) tensor with cu_seqlens.Variable-length rollouts (always, in RL). 30–60% fewer FLOPs vs padded.
Sequence parallelShard the sequence dim across TP group for very long L.Long-context RL (32k+ tokens).
Gradient accumulationMultiple forwards before one optimizer step.When effective batch > what fits.
BF16 forward + FP32 master + BF16 grad reductionMixed precision the safe way.Default; halves activation + grad memory vs FP32.

Bottleneck 3 · Reference — forward-only, often hideable

Reference scoring is one forward pass per trajectory. No backward, no optimizer state. Cost is small in absolute terms; the question is whether it lands on the critical path. Three options:

For algorithms that need ref every step (PPO, GRPO with KL), the inline cost is ~10% of step time at 7B. For larger models, a separate pool wins.

Bottleneck 4 · Weight sync — bandwidth-bound, hideable

Every step, fresh trainer weights have to reach the rollout engine. The mechanics, from lesson 25:

  1. FSDP all-gather on the trainer (turn sharded weights into full ones).
  2. Dtype cast from BF16 master to the inference dtype (BF16 / FP8).
  3. NCCL broadcast from trainer rank 0 to all rollout ranks.
  4. Reshard into the rollout engine's TP layout.

For 70B BF16: 140 GB of weights moving across the cluster. Over NVLink (~600 GB/s) this is ~1 s; over InfiniBand (~50 GB/s) it's 3–20 s depending on fabric topology and contention. At 5–10 second total step times, that's already a significant fraction.

OptimizationWhy it pays
Sync less often (every N steps)Drops cost linearly. Costs you a growing π_old vs π gap; clip fires more, but if your training is otherwise stable this is a free win.
Overlap sync with the next rollout's prefillHides sync entirely. Requires careful layer-boundary swap to avoid mid-forward inconsistency.
Fused dtype cast inside the broadcastOne HBM round-trip instead of two.
Async pipelines with versioned weightsSync runs on its own cadence; trainer and rollout both stay busy. Cost: off-policy bookkeeping.
IPC handoff (colocated)If rollout and trainer share GPUs, the "sync" is a pointer swap. ~zero copy.

Bottleneck 5 · Algorithm + verifier — usually free, sometimes a long tail

Computing advantages, group statistics, KL — the actual loss math — is microseconds on modern GPUs. Except when the verifier is slow. Code execution can take 0.1–5 s per rollout; web verifiers can take 5–60 s. At K=64 rollouts per prompt these become the wall-clock anchor regardless of how fast the GPU is.

The fix is uniformly architectural: verifier dispatcher. Maintain a CPU/CPU-pool that runs verifiers asynchronously while the GPU continues generating other trajectories. The GPU never blocks on a verifier.

Interactive · the budget allocator

The widget below lets you set the percentages of step time spent in each role, then choose which optimizations to apply, and reports the new step time + the recipe-level win. Use it to see why "prefix caching first" is the right opening move.

Optimization-budget allocator
Sliders set baseline percentages (must add to 100). Toggle optimizations on/off. The bar chart updates and the bottom KPI reports the resulting step-time reduction.
Baseline step time
100s
Optimized step time
Speedup
Largest remaining slice

How to identify the bottleneck in practice

An RL training run is a multi-process distributed system. "It's slow" can mean any of a dozen things. The diagnosis playbook, in the order a real engineer follows it:

Step 1 — wall-clock attribution per role

The first piece of code in any RL job should be per-role timing. From controller.py:

t0 = time.time();  trajs = rollout.generate(env, K);  t_rollout = time.time() - t0
t0 = time.time();  reference.score(trajs);            t_ref     = time.time() - t0
t0 = time.time();  algorithm.compute_advantages(...)  t_algo    = time.time() - t0
t0 = time.time();  trainer.train_step(batch, loss)    t_train   = time.time() - t0
t0 = time.time();  weight_syncer.sync(trainer, ...)   t_sync    = time.time() - t0
log_to_wandb({"t/rollout": t_rollout, "t/ref": t_ref, ...})

Run for ~20 steps. Look at the wandb breakdown. Anything that's not 60–75% rollout means the system is unusual (good or bad). Don't optimize past this step.

Step 2 — GPU utilization vs memory occupancy

nvidia-smi -l 1 in a side window during training. Two diagnostic signatures:

Step 3 — torch profiler for the slow role

Once you know which role to look at, drop in torch.profiler for a few steps of that role only:

from torch.profiler import profile, ProfilerActivity, schedule
with profile(activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU],
             schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
             on_trace_ready=lambda p: p.export_chrome_trace("trace.json")) as prof:
    for _ in range(5):
        trainer.train_step(batch, loss)
        prof.step()

Open trace.json in chrome://tracing or Perfetto. Look for:

Step 4 — memory peak

torch.cuda.reset_peak_memory_stats()
trainer.train_step(batch, loss)
peak = torch.cuda.max_memory_allocated() / 1e9
log_to_wandb({"mem/peak_gb": peak})

If peak ≫ allocated, activations are dominating and checkpointing or chunked CE will help. If peak ≈ allocated and high, you're at the memory ceiling and need FSDP or sharding.

Step 5 — Nsight Systems for the cross-GPU story

For multi-node distributed bottlenecks (FSDP all-gather, NCCL broadcast), nsys is the right tool:

nsys profile -t cuda,nvtx,nccl -o trace python train.py
# open trace.nsys-rep in Nsight Systems UI

You'll see per-GPU timelines aligned with NCCL collective bars. Look for misalignment (some ranks waiting for a slow one) and bubble time (gaps between kernels).

Correctness canaries — the bugs that masquerade as performance

Before you optimize anything, make sure the run is correct. A slow-looking run is sometimes a broken run; you don't want to spend a week speeding up training that isn't actually learning. Three canaries worth monitoring every step:

Long-run failure modes — what fails on day 10

Long RL runs always fail. The diagnostic playbook above covers "step 1 is slow". This section covers "step 10000 broke" — the reliability checklist the engineer (lesson 24) owns.

Failure modeDetection signalMitigation
KV cache fragmentation creep Rollout tokens/sec drifts down 20%+ over days; engine reports paged-block hit rate falling Restart engine on a cadence; or upgrade to a paged allocator that compacts
Verifier slowdown Reward call p99 latency creeping up; verifier queue backing up Async dispatcher, GPU-accelerated verifier, or hot-reload of stale sandbox VMs
Reward hacking Train reward up, held-out eval flat or down Held-out eval at every K steps; flag divergence; review verifier for exploits
NCCL / network partition One rank stuck in all-reduce; step time spikes NCCL timeout + watchdog; auto-restart with rank-aware checkpointing
One rollout worker OOMs on adversarial prompt Single-worker crash; if not isolated, brings down the trainer Detached process groups, request-level circuit breakers, dead-letter queue for bad prompts
Optimizer state corruption / sharded checkpoint mismatch Loss spike on resume; gradient norm jumps Versioned sharded checkpoints (params + optimizer + RNG + sync gen_id); statistically-equivalent resume test
Throughput regression after a sync Step time creeps up after weight sync; ratio histogram skews Compare attention kernels between rollout and trainer; audit log-prob match (canaries above)

The minimum reliability surface — anything the engineer should ship before a multi-day run:

  1. Component health metrics: per-step rollout tok/s, trainer step time, weight-sync latency, ref forward time, reward distribution, KL to reference, gradient norm. Per-trajectory: length distribution, tool error rate, verifier latency.
  2. Sharded checkpointing: trainer params + optimizer + RNG + rollout-engine snapshot + sync generation. Resume should yield a statistically equivalent next step. (Bitwise determinism is broken by NCCL nondeterminism and verifier sandboxes in practice.)
  3. Failure isolation: a rollout worker dying doesn't take the trainer with it. Detached process groups, request-level circuit breakers, dead-letter queues.
  4. Held-out eval cadence: every K steps, run a held-out eval; flag when reward goes up but eval flat. This is the canary for reward hacking.
  5. Throughput regression alarms: a sudden 20% step-time creep is the cheapest signal you'll get for KV fragmentation, verifier degradation, or network contention.

The ROI-ordered playbook

Combining everything: when an engineer joins a slow RL run, the order of attack is:

  1. Profile first. Wall-clock per role. Anything not ~60–75% rollout means the profile is wrong or the setup is unusual.
  2. Prefix caching on the rollout engine. Single highest-leverage RL optimization.
  3. Continuous batching + chunked prefill. Pad-free, decode-prefill mix. Easy 2× rollout.
  4. Chunked cross-entropy on the trainer. Memory win that unlocks longer sequences.
  5. Packed varlen attention. 30–60% fewer FLOPs in train.
  6. Log-prob match audit. Fix this before chasing speed — it's the silent correctness bug surface.
  7. Speculative decoding for low-T rollouts. Conditional 1.5–2×.
  8. Sync overlap (or sync-every-N). Multi-node win.
  9. KV cache quantization (int8/fp8) with explicit log-prob match audit.
  10. Custom fused PPO / KL kernel. Marginal. Do last.
Takeaway — the three answers

Where do bottlenecks land? Rollout decode (60–80%), trainer compute and activation memory (10–26%), the rest (sync + ref + algo, ~10% combined). Verifiers can balloon to dominate if they're slow.

What are the usual optimizations? Per role: prefix caching, continuous batching, paged KV, chunked prefill, spec decode, KV quant, GQA (rollout); FSDP, activation checkpointing, chunked CE, packed varlen, sequence parallel (trainer); separate pool (ref); async overlap, sync-less-often (sync). System-wide: disaggregated topology and async pipelines.

How do you find them? Per-role wall-clock timing first, then nvidia-smi for util-vs-memory signatures, then torch.profiler on the slow role, then memory peak, then Nsight for the cross-GPU story, and always log the ρ-histogram to catch silent correctness bugs masquerading as performance bugs.