Napkin math — the numbers every design rests on
A system designer who can't estimate a model's cost in FLOPs, bytes, and dollars in their head is flying blind. This lesson builds the seven estimates you reuse in every later lesson: forward/training FLOPs, weight memory, optimizer memory, KV-cache bytes, activation memory, arithmetic intensity, and $/token. None of them needs more than multiplication.
1 · Compute — the 2N and 6N rules
A transformer's compute is dominated by matrix multiplies. A matmul of an m×k by k×n matrix costs 2mkn FLOPs (the 2 is one multiply + one add). Summed over a model with N parameters, every parameter is touched by ~one multiply-add per token, giving the two rules you will use more than any others:
The backward pass costs about twice the forward (it computes gradients w.r.t. both inputs and weights), so 2N + 4N = 6N. These ignore attention's O(seq²) term, which is negligible until context lengths get long — a correction we revisit in lesson 06.
2 · Memory — four things live in HBM
GPU memory (HBM) is the scarcest resource in most designs. Four tenants compete for it. Know all four or you'll OOM in production:
| Tenant | Size | Present during… |
|---|---|---|
| Weights | 2N bytes (fp16/bf16); N at int8; N/2 at int4 | always |
| Optimizer + grads (Adam, mixed precision) | ≈ 12–16N bytes | training only |
| Activations | depends on batch × seq × layers (see §4) | training (until bwd); small in inference |
| KV cache | grows with concurrent tokens (see §3) | inference (the big one) |
Weights
fp16/bf16 is 2 bytes/param, so a model is 2N bytes. 7B → 14 GB, 70B → 140 GB, 405B → 810 GB. Compare to an 80 GB H100: 7B fits with room for KV; 70B needs ≥2 GPUs; 405B needs ≥11. This single comparison is often the first forcing function in a design.
Optimizer state — why training needs ~10× the memory of inference
Standard mixed-precision Adam holds, per parameter: an fp32 master weight (4B), fp32 momentum (4B), fp32 variance (4B), plus fp16 gradient (2B) and the fp16 weight (2B). That's ~16 bytes/param — eight times the weights alone.
A 70B model needs ~1.1 TB just for weights+optimizer — 14× H100s' worth before a single activation. This is the entire reason FSDP/ZeRO (lesson 07) exists: shard those 16N bytes across data-parallel ranks so each holds 16N / dp.
KV cache — the tenant that scales with traffic
Covered next; it is what makes serving memory a moving target rather than a constant.
3 · KV cache — bytes per token
During decode, every past token's key and value vectors are cached so attention doesn't recompute them. Per token:
The leading 2 is "K and V." Note n_kv_heads, not n_query_heads — this is exactly what GQA/MQA shrink (vLLM 09). Worked for Llama-3-70B (80 layers, 8 KV heads, head_dim 128, fp16):
So an 8K-token context holds 8192 · 320KB ≈ 2.6 GB of KV — per request. On an H100 with ~140 GB left after weights, you fit roughly 140/2.6 ≈ 50 such requests. That number — not compute — is usually your batch-size ceiling in serving, and the headline reason PagedAttention (pack KV tightly) and GQA (fewer KV heads) matter. We size real replicas on this in lesson 04.
4 · Activations — the training memory wildcard
In training you must keep layer activations around for the backward pass. A rough estimate per transformer layer is ≈ s · b · h · (some factor ~10–35) depending on what's recomputed, where s=seq len, b=microbatch, h=hidden. The exact constant matters less than the scaling: activations grow with batch × sequence × layers, while weights/optimizer don't. This is why long-context training blows up memory and why gradient (activation) checkpointing — recompute activations in bwd instead of storing them, trading ~33% more compute for a big memory cut — is standard at scale (lesson 07).
5 · The roofline — are you compute- or bandwidth-bound?
From lesson 01: a kernel's speed is min(compute_roof, bandwidth_roof). The crossover is the machine's arithmetic intensity ridge:
For an H100 bf16: 990e12 / 3.35e12 ≈ 295 FLOP/byte. A workload doing fewer than ~295 FLOPs per byte read is bandwidth-bound; more, compute-bound. Prefill (big matmuls over many tokens) sits right of the ridge → compute-bound. Decode at small batch sits far left → bandwidth-bound. Batching decode moves it rightward (more tokens per weight-read) — the whole game of lesson 04.
6 · The spec sheet to memorize
| GPU | HBM | BW (TB/s) | bf16 (TFLOP/s) | ridge (F/B) |
|---|---|---|---|---|
| A100 80GB | 80 GB | 2.0 | 312 | ~156 |
| H100 SXM | 80 GB | 3.35 | 990 | ~295 |
| H200 | 141 GB | 4.8 | 990 | ~206 |
| B200 (≈) | 192 GB | 8.0 | ~2250 | ~280 |
FLOP/s are dense bf16 without sparsity; vendors quote 2× with sparsity — ignore that for design. Interconnect: NVLink ~900 GB/s intra-node (8 GPUs), InfiniBand ~400 Gb/s ≈ 50 GB/s inter-node. The 18× gap between intra- and inter-node bandwidth drives every parallelism-placement decision in lesson 07.
7 · Dollars — closing the loop
Everything reduces to GPU-hours, and GPU-hours to dollars. The two master formulas:
serving $/1M tok ≈ $/gpu-hr / (throughput_tok_per_s_per_gpu · 3600) · 1e6
Serving cost is dominated by achieved tokens/sec/GPU, which is set by batch size, which is capped by KV memory, which we computed in §3. The chain bytes → batch → throughput → $ is the spine of every serving design. Play with it: