Architecture — the shape under every stage

A line-by-line tour of model.py. The five post-training lessons that follow change the data and the loss, never this.

What we are modelling

An autoregressive language model factors a joint distribution over tokens as a product of conditionals:

p(x₁, x₂, …, x_T) = ∏_{t = 1..T} p(x_t | x_<t)

Maximum-likelihood training minimises the sum of per-token negative log-likelihoods — equivalently, cross-entropy with the next token as target. That is the entire game. Every architectural decision below exists so that one forward pass can compute the conditional p(x_t | x_<t) for every t in parallel, and produce the gradients that lower the joint log-likelihood. If a design choice does not directly serve that, it has no business being in model.py.

The remarkable thing — the reason this lesson exists once and the next five lessons reuse it verbatim — is that the same predictor p_θ(x_t | x_<t) is what we need for pretraining, instruction-tuning (SFT), chain-of-thought, DPO, and RL-from-verifiable-rewards. The post-training pipeline does not add new neural machinery; it only shifts which positions in a sequence count toward the loss and where the supervisory signal comes from. Hold that thought — it is the single most freeing idea in the whole series.

Symbols, kept compact

B — batch size.
T — sequence length in tokens. T_max — the model's hard upper bound (position-embedding table size).
V — vocabulary size.
d — model dimension (residual stream width).
h — number of attention heads. k = d / h — per-head dimension.
L — number of Transformer blocks.

One forward pass, exact shapes

Here is the entire forward pass with every intermediate tensor labelled. Read it once top-to-bottom; we will then walk each line.

idx         (B, T)              int64 token ids
tok_emb     (B, T, d)           Wte[idx]                Wte: (V, d)
pos_emb        (T, d)           Wpe[0:T]                Wpe: (T_max, d)
x           (B, T, d)           tok_emb + pos_emb       broadcast over B
--- repeat L times -----------------------------------------------------
x           x + Attn(LN(x))                             pre-norm residual
x           x + MLP (LN(x))                             pre-norm residual
-----------------------------------------------------------------------
x           LayerNorm(x)        (B, T, d)               final norm
logits      x @ Wteᵀ            (B, T, V)               weight-tied head
loss        CE(logits[:, :-1], idx[:, 1:])               scalar

Notice three things straight away. First, the residual stream is (B, T, d) from start to finish — every block takes that shape in and returns the same shape. Second, the only places that touch the vocabulary are the embedding at the bottom and the head at the top — and those share weights. Third, the loss is built from logits at positions [0..T−2] against targets at positions [1..T−1]: we predict every token from its predecessors, all in one shot. That parallelism is what makes Transformers train faster than RNNs by orders of magnitude.

Embeddings: addition, not concatenation

The first non-trivial line of the forward pass is:

x = self.tok_emb(idx) + self.pos_emb(pos)              # (B, T, d)

Two embedding tables: token (V × d) and position (T_max × d). The token vector says what this slot holds; the position vector says where in the sequence the slot is. Why add them instead of concatenating?

Addition preserves d. Concatenation would force every downstream layer to be wider; addition lets the same d-wide attention and MLP operate on either channel of information. The model learns, per feature dimension, how much weight to put on identity vs. position.
Linear layers can recover either component. If x = e_tok + e_pos, any linear projection of x is a linear projection of e_tok plus a linear projection of e_pos — there is no information lost compared to concatenation, only a representational preference for low-rank mixing.
Empirically it works. Concatenation has been tried; in pure-Transformer language models the gain does not justify the parameter bloat.

GPT-2 uses learned position embeddings — one vector per slot, learned by gradient descent. The trade-off:

scheme	extrapolation	fit in-distribution	cost
learned	fails past T_max	slightly best	T_max·d params
sinusoidal	works (formula)	slightly worse	0 params
RoPE	works, popular today	strong	0 params
ALiBi	works, simple bias	strong	O(h) params

For this codebase the toy sequences fit comfortably under T_max, so we accept the cap in exchange for one fewer concept. If you ever want to deploy outside the training length, replace nn.Embedding(T_max, d) with RoPE in the attention module — nothing else needs to change.

Causal self-attention, shape by shape

This is the part with the most moving pieces. Walk it with me, shape by shape.

Now the same flow in code. From CausalSelfAttention.forward:

qkv = self.qkv(x)                                      # (B, T, 3d)
q, k, v = qkv.chunk(3, dim=-1)                         # each (B, T, d)

q = q.view(B, T, self.h, self.k).transpose(1, 2)       # (B, h, T, k)
k = k.view(B, T, self.h, self.k).transpose(1, 2)       # (B, h, T, k)
v = v.view(B, T, self.h, self.k).transpose(1, 2)       # (B, h, T, k)

scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.k) # (B, h, T, T)
scores = scores.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float("-inf"))

attn = F.softmax(scores, dim=-1)                       # (B, h, T, T)
out  = attn @ v                                        # (B, h, T, k)
out  = out.transpose(1, 2).contiguous().view(B, T, d)  # (B, T, d)
return self.proj(out)                                  # (B, T, d)

Why one fused QKV linear instead of three

The line self.qkv = nn.Linear(d, 3*d, bias=False) is not a stylistic choice — it is a hardware one. Three separate Linears would issue three GEMM kernel launches and read x from HBM three times. The fused Linear is one GEMM, one read of x, one write of (B, T, 3d). Arithmetic intensity goes up, kernel-launch overhead goes down. The math is identical because three independent linears stacked output-wise is a single Linear of triple width.

Why split heads and transpose

After chunk we have q, k, v each of shape (B, T, d). We want to do h independent attention computations, each on dim-k queries and keys. The trick is to reshape (B, T, d) → (B, T, h, k) — same memory, just a different view — then transpose to (B, h, T, k) so that the leading dims (B, h) get batched over by the matmul, and the tail dims (T, k) become the actual matrix being multiplied. A single q @ k.transpose(-2, -1) then produces (B, h, T, T) in one shot.

The transpose is what makes that batching work. After the matmul we transpose back and call .contiguous() before .view, because PyTorch's view cannot reinterpret strides that have been permuted — you need a contiguous buffer.

Why divide by √k

Suppose the entries of q and k are roughly i.i.d. with mean 0 and variance 1. Then the dot product of two k-vectors is a sum of k independent products, each with variance 1. The variance of the sum is k; its standard deviation is √k. Without the scaling, raw scores would grow as √k on average. Plug that into a softmax: as scores get large, softmax becomes one-hot, and one-hot softmax has zero gradient. Training stalls.

Dividing by √k standardises score variance to O(1) regardless of head dimension. The softmax stays in its informative regime (gradients of order 1), and you can change h and k without re-tuning anything else. This is one of the very few places in the architecture where a derivation, not an experiment, picks the constant.

Quick check

Try removing the √k divisor in your head. With d = 128 and h = 4 we have k = 32. Score std without scaling is √32 ≈ 5.7. After softmax, a single dominant logit of 5.7 above the rest gives e^5.7 ≈ 300× the mass — already nearly one-hot. With /√k applied, the std is 1, and softmax spreads weight broadly enough for the gradient signal to pass.

The causal mask

The causal mask is a lower-triangular matrix of 1s, broadcast as (1, 1, T_max, T_max) so it fits the (B, h, T, T) score tensor. Positions in the upper triangle get set to −∞ before the softmax, which makes the corresponding probabilities exactly zero. The mask is the entire reason this architecture can train B · T next-token predictions in parallel: at every position t, the model sees only positions ≤ t, so the prediction for t+1 never "cheats" by reading future tokens that the inference-time model will not have.

Without the mask, predicting x_t+1 from a context that includes x_t+1 itself is trivial — train loss would crash to zero on the spot, and the model would be useless at inference time, where future tokens do not exist yet. The mask is what makes "one forward pass = T supervised targets" honest.

Multi-head attention: why split d into h heads

A single attention head with width d can attend in one way at a time: a single similarity geometry, a single set of features that count as "relevant." Splitting the same d into h heads of width k = d / h gives the model h independent attention patterns, mixed back together by the output projection self.proj.

Empirically, heads specialise without being told to. The classic catalog (from interpretability work on GPT-2) includes "previous-token" heads that always copy from t − 1, "duplicate-token" heads that copy the most recent occurrence of the current token, "induction" heads that complete patterns, and various long-range positional heads. The output projection W_o is then in charge of combining these specialised channels into the next residual update.

The FLOP cost is essentially the same as a single big head, because the dominant cost is the q · k^⊤ and attn · v matmuls, and those scale with the total hidden dimension. Splitting just rearranges where the bytes go; it does not add work. So you get h different attention patterns for free — a bargain so good it has no real competitor.

MLP: 4× expansion and GELU

After attention has moved information between positions, the MLP transforms features at each position independently:

self.fc   = nn.Linear(d, 4 * d)
self.proj = nn.Linear(4 * d, d)
# forward: proj(gelu(fc(x)))

Three points worth being explicit about.

Per-position. No mixing across the sequence — that has already happened in attention. The MLP is just a 2-layer feedforward applied identically at every t.
4× expansion. The historical Transformer ratio. Wide enough that the nonlinearity has room to do useful work; narrow enough that the MLP does not dominate the parameter budget completely. Per block: attention has 4 d² params (QKV + proj), MLP has 8 d² (two Linears of width 4d). The MLP holds about 2/3 of every block's weights — a fact people often miss.
GELU, not ReLU. GELU is a smooth ReLU-like function: x · Φ(x). Smoothness gives non-zero gradient for small negative inputs, which empirically trains a little better than the hard ReLU bend, with negligible compute cost.

Alternatives exist: SwiGLU (used in LLaMA) replaces GELU with a gated variant and tweaks the expansion to 8/3 to preserve param count. For a teaching codebase, the plain GELU + 4× recipe is one fewer moving part and the qualitative behaviour is the same.

Pre-norm vs post-norm residuals

The Transformer block is:

x = x + self.attn(self.ln1(x))
x = x + self.mlp (self.ln2(x))

Note where the LayerNorm sits: inside the residual branch, applied to the input of each sub-layer. This is pre-norm. The original "Attention Is All You Need" paper used post-norm: x = LN(x + Attn(x)), with LN outside. Both are common in writing; only one trains nicely at depth without warmup.

The difference matters because of how the residual stream grows with depth. In pre-norm, each block adds an LN-normalised perturbation to x: the residual stream's variance grows at most linearly in L, and the final ln_f renormalises before the head. In post-norm, the LN comes after the addition, so the gradient that flows back through the residual passes through L LayerNorms in series — small Jacobian errors compound, the early-layer gradients become tiny or huge, and you need an LR warmup to keep things sane (Xiong et al., 2020, gives the explicit signal-propagation analysis).

Pre-norm is the default in every modern LLM (GPT-2 was the inflection point). The cost is essentially zero — same params, same flops — and it removes the warmup hyperparameter, which is the kind of "free" you take.

Weight tying: the head is the embedding

Look at the model's __init__:

self.head = nn.Linear(d, vocab, bias=False)
self.head.weight = self.tok_emb.weight                 # share storage

Two operations, one weight matrix. The embedding maps a token id to a d-vector by looking up a row of W_te. The head maps a d-vector to logits over the vocabulary by multiplying by W_te^⊤. Sharing the storage cuts V · d parameters and empirically improves perplexity. Why?

Same semantics. The input and output spaces are the same vocabulary — the same tokens. The "vector for token v" should be the same thing whether you are reading v in or trying to predict v out.
Better gradient flow. Every output prediction provides gradient signal to the embedding row for the predicted token — at every position, for every example. The matrix gets vastly more updates than under separate parameterisation.
Regularising effect. The shared parameterisation prevents the model from learning two unrelated representations and reduces the effective number of free parameters where it matters least.

The trade-off is rigidity: if you want a different output vocabulary from your input vocabulary (some translation setups), you cannot tie. For a same-vocab LM it is essentially always a win.

Init std 0.02

From _init_weights:

nn.init.normal_(m.weight, mean=0.0, std=0.02)

This is smaller than Kaiming's √(2/d) — for d = 128, Kaiming would give std ≈ 0.125, six times larger. The reason: in a deep residual stack, every block adds a perturbation whose magnitude scales with the magnitudes of its weights. Pile up L such additions and a Kaiming-scale init can blow up the residual stream's variance, especially in attention's value path. Small-std init keeps everything calm at step 0 and lets the optimizer find a good scale.

GPT-2 actually goes one step further: it rescales the residual-projection weights (self.proj in attention and MLP) by an additional 1/√(2L). The idea is to make the contribution of each layer's residual branch scale-stable with depth. We omit that refinement in this codebase to keep the init function one clean rule — it costs a little stability at very large L but is invisible at L = 4.

Parameter accounting

How big is this thing, and where do the parameters live? Let us count by hand, then check against the calculator below.

component	params	note
token embedding W_te	V·d	shared with head
position embedding W_pe	T_max·d
per block: qkv Linear	3·d·d = 3d²	fused
per block: attn proj	d·d = d²
per block: MLP fc	d·4d + 4d = 4d² + 4d	bias on
per block: MLP proj	4d·d + d = 4d² + d	bias on
per block: 2× LayerNorm	4d	γ, β each (d,)
final LayerNorm	2d
head	0	tied to W_te

Per block: 3d² + d² + 4d² + 4d² + O(d) = 12d² + O(d). Total: N ≈ V·d + T_max·d + L · 12 d² + O(L·d).

For the SFT toy task with d = 128, h = 4, L = 4, V = 29, T_max = 11:

W_te: 29 · 128 = 3,712
W_pe: 11 · 128 = 1,408
Per block: 12 · 128² + O(d) ≈ 12 · 16,384 + a few hundred ≈ 197,000
Four blocks: ≈ 787,000
Total: ≈ 793,000 params — under 1M, fits in a CPU cache, trains in minutes.

Crucially, the block parameters dominate by roughly 150×. Embeddings barely matter at this size. As you scale d, the L · 12 d² term grows quadratically while the embedding terms grow linearly — eventually almost everything is "in the blocks."

Interactive: shape calculator and parameter counter

Move the sliders. Watch every intermediate tensor shape, the per-component parameter count, the rough forward-pass FLOPs, and the causal-mask heatmap update in real time. The constraint d mod h = 0 is enforced — when you change d, h is snapped to a divisor.

Forward-pass shapes and parameter counts

Embedding shapes are blue, attention shapes are orange, MLP shapes are green, head shapes are purple. The FLOPs estimate uses the Chinchilla-style "2N per token" rule for inference and "6N per token" for training; it ignores the attention quadratic term in T, which becomes significant only for T comparable to d.

B: 8 T: 11 d: 128 h: 4 L: 4 V: 29

k = d/h

Total params

—

Forward FLOPs / token

—

Train FLOPs / token

—

Forward-pass shapes

Per-component parameters

Causal mask (T × T)

green = attended · grey = masked

When the "2N per token" rule breaks

The Chinchilla rule counts dense-linear FLOPs and ignores attention's O(T²) term. For our toy T = 11, d = 128, attention is a rounding error. For T = 4096, d = 4096, attention can dominate. Whenever T approaches d the rule starts to under-count; for production reasoning models with long contexts this matters and a full FLOP count keeps both terms.

The design space, in one table

Every choice above had a reasonable alternative. Here is the trade-off matrix to keep next to you when you read other codebases.

choice in `model.py`	alternative	why we picked this one
learned positions, T_max fixed	sinusoidal · RoPE · ALiBi	simplest; toy seqs fit. Swap to RoPE for length extrapolation.
tied unembedding (head = W_te)	separate head matrix	fewer params, better PPL, same-vocab in/out.
pre-norm	post-norm	trainable without warmup, stable at depth.
MLP 4× expansion, GELU	SwiGLU 8/3× · ReLU · GeGLU	simplest recipe, qualitatively identical at this scale.
bias=False on qkv & proj	bias=True	biases barely help in attention; saves a few params.
init std 0.02 flat	0.02 + 1/√(2L) on residual projs	at L=4 the refinement is invisible; we drop it for clarity.
fused qkv Linear (d→3d)	3 separate Linears	one GEMM, one HBM read of x.
F.softmax + masked_fill	FlashAttention / SDPA	readable; production stacks use the fused op for memory.

Takeaway

Every shape in this architecture serves the same goal: compute p(x_t | x_<t) for every t in parallel, cheaply, and stably enough that gradient descent can drive the log-likelihood down. Pre-norm + weight tying + √k scaling + the causal mask are the four design choices that make that goal reachable at depth.

Where this lesson sits in the pipeline

Take a step back. The next five lessons — pretrain, SFT, CoT, DPO, RLVR — all run this same architecture. They differ only in two ways:

The data. Raw corpus → (prompt, response) pairs → (prompt, reasoning+answer) → (prompt, y_w, y_l) preference pairs → (prompt, rolled-out completion, verifier score).
The positions that count in the loss. All positions → response positions only → response positions only → preference-pair logratios → reward-weighted log-probs of response tokens with a KL anchor.

The architecture is fixed. The post-training story is, line for line, about what you put into idx and which entries of logits you turn into a loss. Holding that separation in your head is what makes the whole pipeline tellable as a linear story.

In the next lesson, we will see this architecture trained on raw text with the dense cross-entropy loss — and watch the per-character probability distribution sharpen as compression discovers grammar, then arithmetic, then structure. That is pretraining.