Architecture — the shape under every stage
A line-by-line tour of model.py. The five post-training lessons that follow change the data and the loss, never this.
What we are modelling
An autoregressive language model factors a joint distribution over tokens as a product of conditionals:
Maximum-likelihood training minimises the sum of per-token negative log-likelihoods — equivalently, cross-entropy with the next token as target. That is the entire game. Every architectural decision below exists so that one forward pass can compute the conditional p(xt | x<t) for every t in parallel, and produce the gradients that lower the joint log-likelihood. If a design choice does not directly serve that, it has no business being in model.py.
The remarkable thing — the reason this lesson exists once and the next five lessons reuse it verbatim — is that the same predictor pθ(xt | x<t) is what we need for pretraining, instruction-tuning (SFT), chain-of-thought, DPO, and RL-from-verifiable-rewards. The post-training pipeline does not add new neural machinery; it only shifts which positions in a sequence count toward the loss and where the supervisory signal comes from. Hold that thought — it is the single most freeing idea in the whole series.
- B — batch size.
- T — sequence length in tokens. Tmax — the model's hard upper bound (position-embedding table size).
- V — vocabulary size.
- d — model dimension (residual stream width).
- h — number of attention heads. k = d / h — per-head dimension.
- L — number of Transformer blocks.
One forward pass, exact shapes
Here is the entire forward pass with every intermediate tensor labelled. Read it once top-to-bottom; we will then walk each line.
idx (B, T) int64 token ids
tok_emb (B, T, d) Wte[idx] Wte: (V, d)
pos_emb (T, d) Wpe[0:T] Wpe: (T_max, d)
x (B, T, d) tok_emb + pos_emb broadcast over B
--- repeat L times -----------------------------------------------------
x x + Attn(LN(x)) pre-norm residual
x x + MLP (LN(x)) pre-norm residual
-----------------------------------------------------------------------
x LayerNorm(x) (B, T, d) final norm
logits x @ Wteᵀ (B, T, V) weight-tied head
loss CE(logits[:, :-1], idx[:, 1:]) scalar
Notice three things straight away. First, the residual stream is (B, T, d) from start to finish — every block takes that shape in and returns the same shape. Second, the only places that touch the vocabulary are the embedding at the bottom and the head at the top — and those share weights. Third, the loss is built from logits at positions [0..T−2] against targets at positions [1..T−1]: we predict every token from its predecessors, all in one shot. That parallelism is what makes Transformers train faster than RNNs by orders of magnitude.
Embeddings: addition, not concatenation
The first non-trivial line of the forward pass is:
x = self.tok_emb(idx) + self.pos_emb(pos) # (B, T, d)
Two embedding tables: token (V × d) and position (Tmax × d). The token vector says what this slot holds; the position vector says where in the sequence the slot is. Why add them instead of concatenating?
- Addition preserves d. Concatenation would force every downstream layer to be wider; addition lets the same d-wide attention and MLP operate on either channel of information. The model learns, per feature dimension, how much weight to put on identity vs. position.
- Linear layers can recover either component. If x = etok + epos, any linear projection of x is a linear projection of etok plus a linear projection of epos — there is no information lost compared to concatenation, only a representational preference for low-rank mixing.
- Empirically it works. Concatenation has been tried; in pure-Transformer language models the gain does not justify the parameter bloat.
GPT-2 uses learned position embeddings — one vector per slot, learned by gradient descent. The trade-off:
| scheme | extrapolation | fit in-distribution | cost |
|---|---|---|---|
| learned | fails past Tmax | slightly best | Tmax·d params |
| sinusoidal | works (formula) | slightly worse | 0 params |
| RoPE | works, popular today | strong | 0 params |
| ALiBi | works, simple bias | strong | O(h) params |
For this codebase the toy sequences fit comfortably under Tmax, so we accept the cap in exchange for one fewer concept. If you ever want to deploy outside the training length, replace nn.Embedding(T_max, d) with RoPE in the attention module — nothing else needs to change.
Causal self-attention, shape by shape
This is the part with the most moving pieces. Walk it with me, shape by shape.
Now the same flow in code. From CausalSelfAttention.forward:
qkv = self.qkv(x) # (B, T, 3d)
q, k, v = qkv.chunk(3, dim=-1) # each (B, T, d)
q = q.view(B, T, self.h, self.k).transpose(1, 2) # (B, h, T, k)
k = k.view(B, T, self.h, self.k).transpose(1, 2) # (B, h, T, k)
v = v.view(B, T, self.h, self.k).transpose(1, 2) # (B, h, T, k)
scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.k) # (B, h, T, T)
scores = scores.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float("-inf"))
attn = F.softmax(scores, dim=-1) # (B, h, T, T)
out = attn @ v # (B, h, T, k)
out = out.transpose(1, 2).contiguous().view(B, T, d) # (B, T, d)
return self.proj(out) # (B, T, d)
Why one fused QKV linear instead of three
The line self.qkv = nn.Linear(d, 3*d, bias=False) is not a stylistic choice — it is a hardware one. Three separate Linears would issue three GEMM kernel launches and read x from HBM three times. The fused Linear is one GEMM, one read of x, one write of (B, T, 3d). Arithmetic intensity goes up, kernel-launch overhead goes down. The math is identical because three independent linears stacked output-wise is a single Linear of triple width.
Why split heads and transpose
After chunk we have q, k, v each of shape (B, T, d). We want to do h independent attention computations, each on dim-k queries and keys. The trick is to reshape (B, T, d) → (B, T, h, k) — same memory, just a different view — then transpose to (B, h, T, k) so that the leading dims (B, h) get batched over by the matmul, and the tail dims (T, k) become the actual matrix being multiplied. A single q @ k.transpose(-2, -1) then produces (B, h, T, T) in one shot.
The transpose is what makes that batching work. After the matmul we transpose back and call .contiguous() before .view, because PyTorch's view cannot reinterpret strides that have been permuted — you need a contiguous buffer.
Why divide by √k
Suppose the entries of q and k are roughly i.i.d. with mean 0 and variance 1. Then the dot product of two k-vectors is a sum of k independent products, each with variance 1. The variance of the sum is k; its standard deviation is √k. Without the scaling, raw scores would grow as √k on average. Plug that into a softmax: as scores get large, softmax becomes one-hot, and one-hot softmax has zero gradient. Training stalls.
Dividing by √k standardises score variance to O(1) regardless of head dimension. The softmax stays in its informative regime (gradients of order 1), and you can change h and k without re-tuning anything else. This is one of the very few places in the architecture where a derivation, not an experiment, picks the constant.
The causal mask
The causal mask is a lower-triangular matrix of 1s, broadcast as (1, 1, Tmax, Tmax) so it fits the (B, h, T, T) score tensor. Positions in the upper triangle get set to −∞ before the softmax, which makes the corresponding probabilities exactly zero. The mask is the entire reason this architecture can train B · T next-token predictions in parallel: at every position t, the model sees only positions ≤ t, so the prediction for t+1 never "cheats" by reading future tokens that the inference-time model will not have.
Without the mask, predicting xt+1 from a context that includes xt+1 itself is trivial — train loss would crash to zero on the spot, and the model would be useless at inference time, where future tokens do not exist yet. The mask is what makes "one forward pass = T supervised targets" honest.
Multi-head attention: why split d into h heads
A single attention head with width d can attend in one way at a time: a single similarity geometry, a single set of features that count as "relevant." Splitting the same d into h heads of width k = d / h gives the model h independent attention patterns, mixed back together by the output projection self.proj.
Empirically, heads specialise without being told to. The classic catalog (from interpretability work on GPT-2) includes "previous-token" heads that always copy from t − 1, "duplicate-token" heads that copy the most recent occurrence of the current token, "induction" heads that complete patterns, and various long-range positional heads. The output projection Wo is then in charge of combining these specialised channels into the next residual update.
The FLOP cost is essentially the same as a single big head, because the dominant cost is the q · k⊤ and attn · v matmuls, and those scale with the total hidden dimension. Splitting just rearranges where the bytes go; it does not add work. So you get h different attention patterns for free — a bargain so good it has no real competitor.
MLP: 4× expansion and GELU
After attention has moved information between positions, the MLP transforms features at each position independently:
self.fc = nn.Linear(d, 4 * d)
self.proj = nn.Linear(4 * d, d)
# forward: proj(gelu(fc(x)))
Three points worth being explicit about.
- Per-position. No mixing across the sequence — that has already happened in attention. The MLP is just a 2-layer feedforward applied identically at every t.
- 4× expansion. The historical Transformer ratio. Wide enough that the nonlinearity has room to do useful work; narrow enough that the MLP does not dominate the parameter budget completely. Per block: attention has 4 d2 params (QKV + proj), MLP has 8 d2 (two Linears of width 4d). The MLP holds about 2/3 of every block's weights — a fact people often miss.
- GELU, not ReLU. GELU is a smooth ReLU-like function: x · Φ(x). Smoothness gives non-zero gradient for small negative inputs, which empirically trains a little better than the hard ReLU bend, with negligible compute cost.
Alternatives exist: SwiGLU (used in LLaMA) replaces GELU with a gated variant and tweaks the expansion to 8/3 to preserve param count. For a teaching codebase, the plain GELU + 4× recipe is one fewer moving part and the qualitative behaviour is the same.
Pre-norm vs post-norm residuals
The Transformer block is:
x = x + self.attn(self.ln1(x))
x = x + self.mlp (self.ln2(x))
Note where the LayerNorm sits: inside the residual branch, applied to the input of each sub-layer. This is pre-norm. The original "Attention Is All You Need" paper used post-norm: x = LN(x + Attn(x)), with LN outside. Both are common in writing; only one trains nicely at depth without warmup.
The difference matters because of how the residual stream grows with depth. In pre-norm, each block adds an LN-normalised perturbation to x: the residual stream's variance grows at most linearly in L, and the final ln_f renormalises before the head. In post-norm, the LN comes after the addition, so the gradient that flows back through the residual passes through L LayerNorms in series — small Jacobian errors compound, the early-layer gradients become tiny or huge, and you need an LR warmup to keep things sane (Xiong et al., 2020, gives the explicit signal-propagation analysis).
Pre-norm is the default in every modern LLM (GPT-2 was the inflection point). The cost is essentially zero — same params, same flops — and it removes the warmup hyperparameter, which is the kind of "free" you take.
Weight tying: the head is the embedding
Look at the model's __init__:
self.head = nn.Linear(d, vocab, bias=False)
self.head.weight = self.tok_emb.weight # share storage
Two operations, one weight matrix. The embedding maps a token id to a d-vector by looking up a row of Wte. The head maps a d-vector to logits over the vocabulary by multiplying by Wte⊤. Sharing the storage cuts V · d parameters and empirically improves perplexity. Why?
- Same semantics. The input and output spaces are the same vocabulary — the same tokens. The "vector for token v" should be the same thing whether you are reading v in or trying to predict v out.
- Better gradient flow. Every output prediction provides gradient signal to the embedding row for the predicted token — at every position, for every example. The matrix gets vastly more updates than under separate parameterisation.
- Regularising effect. The shared parameterisation prevents the model from learning two unrelated representations and reduces the effective number of free parameters where it matters least.
The trade-off is rigidity: if you want a different output vocabulary from your input vocabulary (some translation setups), you cannot tie. For a same-vocab LM it is essentially always a win.
Init std 0.02
From _init_weights:
nn.init.normal_(m.weight, mean=0.0, std=0.02)
This is smaller than Kaiming's √(2/d) — for d = 128, Kaiming would give std ≈ 0.125, six times larger. The reason: in a deep residual stack, every block adds a perturbation whose magnitude scales with the magnitudes of its weights. Pile up L such additions and a Kaiming-scale init can blow up the residual stream's variance, especially in attention's value path. Small-std init keeps everything calm at step 0 and lets the optimizer find a good scale.
GPT-2 actually goes one step further: it rescales the residual-projection weights (self.proj in attention and MLP) by an additional 1/√(2L). The idea is to make the contribution of each layer's residual branch scale-stable with depth. We omit that refinement in this codebase to keep the init function one clean rule — it costs a little stability at very large L but is invisible at L = 4.
Parameter accounting
How big is this thing, and where do the parameters live? Let us count by hand, then check against the calculator below.
| component | params | note |
|---|---|---|
| token embedding Wte | V·d | shared with head |
| position embedding Wpe | Tmax·d | |
| per block: qkv Linear | 3·d·d = 3d² | fused |
| per block: attn proj | d·d = d² | |
| per block: MLP fc | d·4d + 4d = 4d² + 4d | bias on |
| per block: MLP proj | 4d·d + d = 4d² + d | bias on |
| per block: 2× LayerNorm | 4d | γ, β each (d,) |
| final LayerNorm | 2d | |
| head | 0 | tied to Wte |
Per block: 3d² + d² + 4d² + 4d² + O(d) = 12d² + O(d). Total: N ≈ V·d + Tmax·d + L · 12 d² + O(L·d).
For the SFT toy task with d = 128, h = 4, L = 4, V = 29, Tmax = 11:
- Wte: 29 · 128 = 3,712
- Wpe: 11 · 128 = 1,408
- Per block: 12 · 128² + O(d) ≈ 12 · 16,384 + a few hundred ≈ 197,000
- Four blocks: ≈ 787,000
- Total: ≈ 793,000 params — under 1M, fits in a CPU cache, trains in minutes.
Crucially, the block parameters dominate by roughly 150×. Embeddings barely matter at this size. As you scale d, the L · 12 d² term grows quadratically while the embedding terms grow linearly — eventually almost everything is "in the blocks."
Interactive: shape calculator and parameter counter
Move the sliders. Watch every intermediate tensor shape, the per-component parameter count, the rough forward-pass FLOPs, and the causal-mask heatmap update in real time. The constraint d mod h = 0 is enforced — when you change d, h is snapped to a divisor.
The design space, in one table
Every choice above had a reasonable alternative. Here is the trade-off matrix to keep next to you when you read other codebases.
choice in model.py | alternative | why we picked this one |
|---|---|---|
| learned positions, Tmax fixed | sinusoidal · RoPE · ALiBi | simplest; toy seqs fit. Swap to RoPE for length extrapolation. |
| tied unembedding (head = Wte) | separate head matrix | fewer params, better PPL, same-vocab in/out. |
| pre-norm | post-norm | trainable without warmup, stable at depth. |
| MLP 4× expansion, GELU | SwiGLU 8/3× · ReLU · GeGLU | simplest recipe, qualitatively identical at this scale. |
| bias=False on qkv & proj | bias=True | biases barely help in attention; saves a few params. |
| init std 0.02 flat | 0.02 + 1/√(2L) on residual projs | at L=4 the refinement is invisible; we drop it for clarity. |
| fused qkv Linear (d→3d) | 3 separate Linears | one GEMM, one HBM read of x. |
| F.softmax + masked_fill | FlashAttention / SDPA | readable; production stacks use the fused op for memory. |
Where this lesson sits in the pipeline
Take a step back. The next five lessons — pretrain, SFT, CoT, DPO, RLVR — all run this same architecture. They differ only in two ways:
- The data. Raw corpus → (prompt, response) pairs → (prompt, reasoning+answer) → (prompt, yw, yl) preference pairs → (prompt, rolled-out completion, verifier score).
- The positions that count in the loss. All positions → response positions only → response positions only → preference-pair logratios → reward-weighted log-probs of response tokens with a KL anchor.
The architecture is fixed. The post-training story is, line for line, about what you put into idx and which entries of logits you turn into a loss. Holding that separation in your head is what makes the whole pipeline tellable as a linear story.
In the next lesson, we will see this architecture trained on raw text with the dense cross-entropy loss — and watch the per-character probability distribution sharpen as compression discovers grammar, then arithmetic, then structure. That is pretraining.