Napkin math — the numbers every design rests on

A system designer who can't estimate a model's cost in FLOPs, bytes, and dollars in their head is flying blind. This lesson builds the seven estimates you reuse in every later lesson: forward/training FLOPs, weight memory, optimizer memory, KV-cache bytes, activation memory, arithmetic intensity, and $/token. None of them needs more than multiplication.

The contract of this lesson

These are order-of-magnitude estimates, accurate to maybe ±30%. That is exactly the right precision for design: it tells you "you need ~16 GPUs, not 2 and not 200" — enough to choose a topology. The last 30% is a profiler's job, after the design is right.

1 · Compute — the 2N and 6N rules

A transformer's compute is dominated by matrix multiplies. A matmul of an m×k by k×n matrix costs 2mkn FLOPs (the 2 is one multiply + one add). Summed over a model with N parameters, every parameter is touched by ~one multiply-add per token, giving the two rules you will use more than any others:

forward pass: ≈ 2N FLOPs / token | training step (fwd+bwd): ≈ 6N FLOPs / token

The backward pass costs about twice the forward (it computes gradients w.r.t. both inputs and weights), so 2N + 4N = 6N. These ignore attention's O(seq²) term, which is negligible until context lengths get long — a correction we revisit in lesson 06.

Worked: cost to pretrain a 70B model on 15T tokens

6 · 70e9 · 15e12 = 6.3e24 FLOPs. On 1,024 H100s at 990 TFLOP/s peak and a realistic 40% MFU: usable rate = 1024 · 990e12 · 0.4 ≈ 4.05e17 FLOP/s. Time = 6.3e24 / 4.05e17 ≈ 1.56e7 s ≈ 180 days. At ~$2/GPU-hr that's 1024 · 24·180 · 2 ≈ $8.8M. Now you understand why the parallelism choice in lesson 07 — which sets MFU — is a multi-million-dollar decision.

2 · Memory — four things live in HBM

GPU memory (HBM) is the scarcest resource in most designs. Four tenants compete for it. Know all four or you'll OOM in production:

Tenant	Size	Present during…
Weights	2N bytes (fp16/bf16); N at int8; N/2 at int4	always
Optimizer + grads (Adam, mixed precision)	≈ 12–16N bytes	training only
Activations	depends on batch × seq × layers (see §4)	training (until bwd); small in inference
KV cache	grows with concurrent tokens (see §3)	inference (the big one)

Weights

fp16/bf16 is 2 bytes/param, so a model is 2N bytes. 7B → 14 GB, 70B → 140 GB, 405B → 810 GB. Compare to an 80 GB H100: 7B fits with room for KV; 70B needs ≥2 GPUs; 405B needs ≥11. This single comparison is often the first forcing function in a design.

Optimizer state — why training needs ~10× the memory of inference

Standard mixed-precision Adam holds, per parameter: an fp32 master weight (4B), fp32 momentum (4B), fp32 variance (4B), plus fp16 gradient (2B) and the fp16 weight (2B). That's ~16 bytes/param — eight times the weights alone.

training memory (no parallelism) ≈ (2 + 14) · N + activations = 16N + activations

A 70B model needs ~1.1 TB just for weights+optimizer — 14× H100s' worth before a single activation. This is the entire reason FSDP/ZeRO (lesson 07) exists: shard those 16N bytes across data-parallel ranks so each holds 16N / dp.

KV cache — the tenant that scales with traffic

Covered next; it is what makes serving memory a moving target rather than a constant.

3 · KV cache — bytes per token

During decode, every past token's key and value vectors are cached so attention doesn't recompute them. Per token:

kv_bytes/token = 2 · n_layers · n_kv_heads · head_dim · dtype_bytes

The leading 2 is "K and V." Note n_kv_heads, not n_query_heads — this is exactly what GQA/MQA shrink (vLLM 09). Worked for Llama-3-70B (80 layers, 8 KV heads, head_dim 128, fp16):

2 · 80 · 8 · 128 · 2 = 327,680 bytes ≈ 320 KB / token

So an 8K-token context holds 8192 · 320KB ≈ 2.6 GB of KV — per request. On an H100 with ~140 GB left after weights, you fit roughly 140/2.6 ≈ 50 such requests. That number — not compute — is usually your batch-size ceiling in serving, and the headline reason PagedAttention (pack KV tightly) and GQA (fewer KV heads) matter. We size real replicas on this in lesson 04.

The KV cache is the serving designer's main antagonist

Weights are a fixed admission fee; KV is the variable cost that decides how many users share a GPU. Long contexts, big batches, and many KV heads all inflate it. Half of inference optimization (lesson 06) is, at bottom, "spend fewer bytes per cached token."

4 · Activations — the training memory wildcard

In training you must keep layer activations around for the backward pass. A rough estimate per transformer layer is ≈ s · b · h · (some factor ~10–35) depending on what's recomputed, where s=seq len, b=microbatch, h=hidden. The exact constant matters less than the scaling: activations grow with batch × sequence × layers, while weights/optimizer don't. This is why long-context training blows up memory and why gradient (activation) checkpointing — recompute activations in bwd instead of storing them, trading ~33% more compute for a big memory cut — is standard at scale (lesson 07).

5 · The roofline — are you compute- or bandwidth-bound?

From lesson 01: a kernel's speed is min(compute_roof, bandwidth_roof). The crossover is the machine's arithmetic intensity ridge:

ridge (FLOP/byte) = peak_FLOPs / peak_bandwidth

For an H100 bf16: 990e12 / 3.35e12 ≈ 295 FLOP/byte. A workload doing fewer than ~295 FLOPs per byte read is bandwidth-bound; more, compute-bound. Prefill (big matmuls over many tokens) sits right of the ridge → compute-bound. Decode at small batch sits far left → bandwidth-bound. Batching decode moves it rightward (more tokens per weight-read) — the whole game of lesson 04.

6 · The spec sheet to memorize

GPU	HBM	BW (TB/s)	bf16 (TFLOP/s)	ridge (F/B)
A100 80GB	80 GB	2.0	312	~156
H100 SXM	80 GB	3.35	990	~295
H200	141 GB	4.8	990	~206
B200 (≈)	192 GB	8.0	~2250	~280

FLOP/s are dense bf16 without sparsity; vendors quote 2× with sparsity — ignore that for design. Interconnect: NVLink ~900 GB/s intra-node (8 GPUs), InfiniBand ~400 Gb/s ≈ 50 GB/s inter-node. The 18× gap between intra- and inter-node bandwidth drives every parallelism-placement decision in lesson 07.

7 · Dollars — closing the loop

Everything reduces to GPU-hours, and GPU-hours to dollars. The two master formulas:

training $ ≈ (6 · N · D) / (gpus · peak · MFU) · gpus · $/gpu-hr
serving $/1M tok ≈ $/gpu-hr / (throughput_tok_per_s_per_gpu · 3600) · 1e6

Serving cost is dominated by achieved tokens/sec/GPU, which is set by batch size, which is capped by KV memory, which we computed in §3. The chain bytes → batch → throughput → $ is the spine of every serving design. Play with it:

Carry these seven into every later lesson

2N fwd, 6N train (FLOPs/token) · 2N weight bytes · ~16N training bytes · 2·L·H_kv·d KV bytes/token · activations ∝ b·s·L · ridge = peak/BW · $ via GPU-hours. If you can reproduce these without the page open, you can size any ML system to the right order of magnitude — which is what step 2 of the loop requires.