Memory & throughput math — what fits where
Before you turn on training, you should be able to write down on the back of an envelope what the model's memory cost is, what sharding will bring it inside an H100, and what your steady-state tokens/second ceiling looks like. This lesson is that envelope.
Why this matters in RL specifically
In RL you are holding four copies of the policy in memory simultaneously: the trainer's bf16 params, fp32 optimizer master + Adam states, the frozen reference (lesson 03), and the rollout engine's inference copy. A single 70B model is 140 GB just for the trainer's optimizer state. If your envelope is wrong by a factor of 2 you OOM mid-epoch. If it's wrong by 0.5× you're not using half your cluster. The math is small, and you should know it cold.
Param-byte math (the single most useful identity)
Let P = parameter count (e.g., 7e9 for a 7B model). In bf16 each parameter takes 2 bytes. Then:
| What | Bytes per param | Why |
|---|---|---|
| bf16 weights | 2 | Forward + backward pass storage. |
| fp32 master copy | 4 | For numerical stability during the optimizer step. |
| fp32 grad | 4 | Accumulated in fp32; reduce-scattered in FSDP. |
| AdamW state (m, v) | 4 + 4 = 8 | First and second moments, both fp32. |
| Total (training) | 18 | Often cited as "16×" — it's 16 if you skip the fp32 master, 18 if you keep it. Some shops use bf16 Adam state to drop to 12. |
| bf16 weights (inference) | 2 | Plus KV cache (see below). |
So for a 7B model, training memory just for parameters and optimizer ≈ 7e9 × 18 = 126 GB. An H100 has 80 GB HBM. You cannot train a 7B model on a single H100 without sharding.
Sanity-check examples
- 7B training: ~126 GB → 2× H100 with FSDP/ZeRO-3 fits comfortably.
- 70B training: ~1.26 TB → 16× H100 with FSDP. Tight but works.
- 405B training: ~7.3 TB → 96× H100 cluster minimum.
Activation memory · the part the envelope often forgets
Forward pass produces intermediate activations that backward needs. For a transformer with hidden size d, sequence length L, batch B, and number of layers N:
The first term comes from per-layer activations (residual streams, MLP intermediates); the second from attention scores (L × L per head). For a 7B model (d=4096, N=32), L=8192, B=4, in bf16:
This is more than the 126 GB of param+optimizer. Activation memory is often the actual binder, not parameters.
Activation checkpointing
Don't store activations; recompute them on backward. Costs an extra ~33% forward FLOPs but reduces activation memory by an order of magnitude. Off by default in PyTorch; almost always on in production. With selective checkpointing (recompute attention, store everything else) you can get back to ~10 GB of activation memory at the cost of ~10% extra compute.
Sequence parallel
For long contexts, even a single sequence's activations don't fit. Sequence parallel partitions the sequence dimension across GPUs in the tensor-parallel group. Combined with FSDP, you can train 32k-context 70B models that wouldn't otherwise fit anywhere.
KV cache math (the inference side)
For each layer, per token, you cache one K and one V vector per KV head. With full multi-head attention (one KV head per query head) the cache per token is 2 × Nlayers × d × bytes. With Grouped-Query Attention (GQA), KV heads are a fraction of query heads, and the cache shrinks by the same factor.
If we pretend GQA didn't exist: a 32-layer, d=4096 model in bf16 would cost 2 × 32 × 4096 × 2 = 524 KB / token; an 80-layer, d=8192 model would cost 2 × 80 × 8192 × 2 = 2.6 MB / token. Real Llama-3 uses GQA — 8 KV heads against 32 query heads for 8B (4× smaller, ~131 KB / token), and 8 KV heads against 64 query heads for 70B (8× smaller, ~328 KB / token).
Even with GQA the numbers are punishing: serving 256 concurrent 8k-context streams on Llama-3-70B needs 256 × 8192 × 328 KB ≈ 670 GB of KV cache. KV-cache size is what GQA, MQA, sliding-window attention, and KV quantization to int8 are all attacking.
Interactive · the model-sizing envelope
Drag the model size and context length. The widget computes train memory and inference memory, splits across the chosen sharding, and flags when you exceed an H100's 80 GB.
FLOP math · how many tokens/sec can the cluster do?
One forward pass on a transformer: ≈ 2 × P × tokens FLOPs (the 2 is forward only). Forward + backward (training) ≈ 6 × P × tokens. With full activation checkpointing the recompute adds another forward, so ≈ 8 × P × tokens. An H100 SXM5 in bf16 has a dense peak of ~989 TFLOPs; ~800 TFLOPs sustained is a strong real-world MFU.
So a single H100, sustained, processes about:
For 7B: ~19 k tokens/sec per H100. For 70B: ~1.9 k tokens/sec. For 405B: ~330 tokens/sec. These are per-GPU training-token throughputs at perfect utilization, which you will never hit; ~50–65% MFU is excellent in practice.
RL-specific throughput accounting
In RL you don't just train tokens, you generate them. Generation throughput is usually 3–6× lower per token than training because of the memory-bandwidth bottleneck (lesson 20). A balanced setup wants:
If rollout is much slower, you're rollout-bound — add inference GPUs. If training is much slower, add training GPUs. The right ratio for verifiable-reward RL is usually 2:1 inference:training GPUs on the same total cluster.
The "four copies" cost
The total memory footprint of an RL post-training job is:
| Copy | Per-param bytes | What it's for |
|---|---|---|
| Trainer (sharded) | 18 / NGPU | FSDP shard of weights + master + grads + Adam |
| Reference (frozen) | 2 / NGPU | For KL anchor (lesson 03). Bf16 only, no optimizer. |
| Rollout (inference) | 2 / NGPU,inf | Bf16 weights for sampling. Plus KV cache. |
| Old policy snapshot | ~0 to 2 | For async pipelines; versioned weights cache. |
This is why GRPO (which drops the value head — lesson 11) and DPO (which drops the reward model + rollout — lesson 16) are both popular: each removes one copy.
One worked example · a 7B GRPO run on 8× H100
To make this concrete, let's size a real recipe end to end.
- Model: 7B params. Cluster: 8× H100 (640 GB total).
- Trainer (4 GPUs, FSDP): 18 × 7e9 / 4 = 31.5 GB/GPU. Plus activations (with checkpointing) ~15 GB/GPU at L=4096, B=8. ~47 GB/GPU. Fits.
- Reference (1 GPU): 2 × 7e9 = 14 GB. Fits with room for activations during one forward.
- Rollout (3 GPUs, TP=1, vLLM): 14 GB weights per GPU + KV cache. At B=256, L=4096, KV per token ≈ 131 KB for Llama-3-8B-style GQA: 131 KB × 256 × 4096 / 1024² ≈ 131 GB KV total, ~44 GB per GPU. Combined with 14 GB weights → ~58 GB/GPU. Fits.
- Tokens/step: generate 256 × 4096 = 1M rollout tokens (~30s at 35k tok/s). Train one step on those tokens (~5s at 200k tok/s for B=64 micro-batches). Rollout-bound.
- End-to-end: ~35s/step → ~100 steps/hour → a 1000-step run takes ~10 hours.
Those numbers won't exactly match yours, but the way to get the right numbers is identical — fill in the table, find the bottleneck, scale that side.