rl_lessons / 22 · memory math lesson 8 / 9 · part III

Memory & throughput math — what fits where

Before you turn on training, you should be able to write down on the back of an envelope what the model's memory cost is, what sharding will bring it inside an H100, and what your steady-state tokens/second ceiling looks like. This lesson is that envelope.

Where this lesson sits
Lesson 19 covered cluster topology. Lessons 20–21 covered the inference engine (KV cache + scheduling). This lesson is the last systems lesson: the back-of-envelope math that ties them together — bytes per parameter for training, KV cache for inference, what a single H100 can sustain, and how the "four copies" of the policy (trainer + reference + rollout + optional snapshot) add up. After this, lesson 23 ties everything in Part III together with named recipes.

Why this matters in RL specifically

In RL you are holding four copies of the policy in memory simultaneously: the trainer's bf16 params, fp32 optimizer master + Adam states, the frozen reference (lesson 03), and the rollout engine's inference copy. A single 70B model is 140 GB just for the trainer's optimizer state. If your envelope is wrong by a factor of 2 you OOM mid-epoch. If it's wrong by 0.5× you're not using half your cluster. The math is small, and you should know it cold.

Param-byte math (the single most useful identity)

Let P = parameter count (e.g., 7e9 for a 7B model). In bf16 each parameter takes 2 bytes. Then:

WhatBytes per paramWhy
bf16 weights2Forward + backward pass storage.
fp32 master copy4For numerical stability during the optimizer step.
fp32 grad4Accumulated in fp32; reduce-scattered in FSDP.
AdamW state (m, v)4 + 4 = 8First and second moments, both fp32.
Total (training)18Often cited as "16×" — it's 16 if you skip the fp32 master, 18 if you keep it. Some shops use bf16 Adam state to drop to 12.
bf16 weights (inference)2Plus KV cache (see below).

So for a 7B model, training memory just for parameters and optimizer ≈ 7e9 × 18 = 126 GB. An H100 has 80 GB HBM. You cannot train a 7B model on a single H100 without sharding.

Sanity-check examples

Activation memory · the part the envelope often forgets

Forward pass produces intermediate activations that backward needs. For a transformer with hidden size d, sequence length L, batch B, and number of layers N:

Mact  ≈  N × L × B × (12 × d + 2 × L) × bytes

The first term comes from per-layer activations (residual streams, MLP intermediates); the second from attention scores (L × L per head). For a 7B model (d=4096, N=32), L=8192, B=4, in bf16:

≈ 32 × 8192 × 4 × (12·4096 + 2·8192) × 2 ≈ 137 GB

This is more than the 126 GB of param+optimizer. Activation memory is often the actual binder, not parameters.

Activation checkpointing

Don't store activations; recompute them on backward. Costs an extra ~33% forward FLOPs but reduces activation memory by an order of magnitude. Off by default in PyTorch; almost always on in production. With selective checkpointing (recompute attention, store everything else) you can get back to ~10 GB of activation memory at the cost of ~10% extra compute.

Sequence parallel

For long contexts, even a single sequence's activations don't fit. Sequence parallel partitions the sequence dimension across GPUs in the tensor-parallel group. Combined with FSDP, you can train 32k-context 70B models that wouldn't otherwise fit anywhere.

KV cache math (the inference side)

For each layer, per token, you cache one K and one V vector per KV head. With full multi-head attention (one KV head per query head) the cache per token is 2 × Nlayers × d × bytes. With Grouped-Query Attention (GQA), KV heads are a fraction of query heads, and the cache shrinks by the same factor.

If we pretend GQA didn't exist: a 32-layer, d=4096 model in bf16 would cost 2 × 32 × 4096 × 2 = 524 KB / token; an 80-layer, d=8192 model would cost 2 × 80 × 8192 × 2 = 2.6 MB / token. Real Llama-3 uses GQA — 8 KV heads against 32 query heads for 8B (4× smaller, ~131 KB / token), and 8 KV heads against 64 query heads for 70B (8× smaller, ~328 KB / token).

Even with GQA the numbers are punishing: serving 256 concurrent 8k-context streams on Llama-3-70B needs 256 × 8192 × 328 KB ≈ 670 GB of KV cache. KV-cache size is what GQA, MQA, sliding-window attention, and KV quantization to int8 are all attacking.

GQA / MQA, in one paragraph
Grouped-Query Attention reduces the number of K/V heads (e.g., 8 instead of 32, or 8 instead of 64) while keeping all the query heads. The result is a 4–8× smaller KV cache, at the cost of slight quality degradation. Multi-Query Attention is the extreme version (1 K/V head total). Almost every modern model (Llama-3, Qwen-2, Mistral) uses GQA precisely because KV cache is the inference memory bottleneck.

Interactive · the model-sizing envelope

Drag the model size and context length. The widget computes train memory and inference memory, splits across the chosen sharding, and flags when you exceed an H100's 80 GB.

Memory envelope
Assumes bf16 train (18 bytes/param incl optimizer + master + grad), with activation checkpointing dropping activation memory by 8×. KV cache assumes GQA with 8 KV heads.
Per-GPU train (GB)
Per-GPU infer (GB)
KV per token (KB)
H100 fit?

FLOP math · how many tokens/sec can the cluster do?

One forward pass on a transformer: ≈ 2 × P × tokens FLOPs (the 2 is forward only). Forward + backward (training) ≈ 6 × P × tokens. With full activation checkpointing the recompute adds another forward, so ≈ 8 × P × tokens. An H100 SXM5 in bf16 has a dense peak of ~989 TFLOPs; ~800 TFLOPs sustained is a strong real-world MFU.

So a single H100, sustained, processes about:

800e12 / (6 × P) tokens/sec   (no checkpointing)  ·  800e12 / (8 × P) tokens/sec   (with full recompute)

For 7B: ~19 k tokens/sec per H100. For 70B: ~1.9 k tokens/sec. For 405B: ~330 tokens/sec. These are per-GPU training-token throughputs at perfect utilization, which you will never hit; ~50–65% MFU is excellent in practice.

RL-specific throughput accounting

In RL you don't just train tokens, you generate them. Generation throughput is usually 3–6× lower per token than training because of the memory-bandwidth bottleneck (lesson 20). A balanced setup wants:

rollout_tokens_per_step / generation_throughput  ≈  training_tokens_per_step / training_throughput

If rollout is much slower, you're rollout-bound — add inference GPUs. If training is much slower, add training GPUs. The right ratio for verifiable-reward RL is usually 2:1 inference:training GPUs on the same total cluster.

The "four copies" cost

The total memory footprint of an RL post-training job is:

CopyPer-param bytesWhat it's for
Trainer (sharded)18 / NGPUFSDP shard of weights + master + grads + Adam
Reference (frozen)2 / NGPUFor KL anchor (lesson 03). Bf16 only, no optimizer.
Rollout (inference)2 / NGPU,infBf16 weights for sampling. Plus KV cache.
Old policy snapshot~0 to 2For async pipelines; versioned weights cache.

This is why GRPO (which drops the value head — lesson 11) and DPO (which drops the reward model + rollout — lesson 16) are both popular: each removes one copy.

One worked example · a 7B GRPO run on 8× H100

To make this concrete, let's size a real recipe end to end.

Those numbers won't exactly match yours, but the way to get the right numbers is identical — fill in the table, find the bottleneck, scale that side.

Two common mistakes
(1) Forgetting that fp32 Adam states are twice the parameter byte count, not the same. (2) Underestimating activation memory at long context — it grows linearly in L for the residual stream and quadratically in L for attention. A bf16 70B at L=32k without sequence parallel doesn't fit on any single 8× H100 node, full stop.
Takeaway
Train memory ≈ 18 bytes/param. Inference ≈ 2 bytes/param + KV cache. KV per token ≈ 2 × Nlayers × d × bytes for vanilla MHA, divided by the GQA ratio (4× for Llama-3-8B, 8× for 70B). One H100 SXM = 80 GB HBM, 3.35 TB/s HBM bandwidth, ~989 TFLOPs bf16 dense peak. Most RL design decisions reduce to staying within those numbers; do the math first.