gpt_mini / lessons / 01 · architecture lesson 1 / 6

Architecture — the shape under every stage

A line-by-line tour of model.py. The five post-training lessons that follow change the data and the loss, never this.

What we are modelling

An autoregressive language model factors a joint distribution over tokens as a product of conditionals:

p(x1, x2, …, xT)  =  ∏t = 1..T p(xt | x<t)

Maximum-likelihood training minimises the sum of per-token negative log-likelihoods — equivalently, cross-entropy with the next token as target. That is the entire game. Every architectural decision below exists so that one forward pass can compute the conditional p(xt | x<t) for every t in parallel, and produce the gradients that lower the joint log-likelihood. If a design choice does not directly serve that, it has no business being in model.py.

The remarkable thing — the reason this lesson exists once and the next five lessons reuse it verbatim — is that the same predictor pθ(xt | x<t) is what we need for pretraining, instruction-tuning (SFT), chain-of-thought, DPO, and RL-from-verifiable-rewards. The post-training pipeline does not add new neural machinery; it only shifts which positions in a sequence count toward the loss and where the supervisory signal comes from. Hold that thought — it is the single most freeing idea in the whole series.

Symbols, kept compact

One forward pass, exact shapes

Here is the entire forward pass with every intermediate tensor labelled. Read it once top-to-bottom; we will then walk each line.

idx         (B, T)              int64 token ids
tok_emb     (B, T, d)           Wte[idx]                Wte: (V, d)
pos_emb        (T, d)           Wpe[0:T]                Wpe: (T_max, d)
x           (B, T, d)           tok_emb + pos_emb       broadcast over B
--- repeat L times -----------------------------------------------------
x           x + Attn(LN(x))                             pre-norm residual
x           x + MLP (LN(x))                             pre-norm residual
-----------------------------------------------------------------------
x           LayerNorm(x)        (B, T, d)               final norm
logits      x @ Wteᵀ            (B, T, V)               weight-tied head
loss        CE(logits[:, :-1], idx[:, 1:])               scalar

Notice three things straight away. First, the residual stream is (B, T, d) from start to finish — every block takes that shape in and returns the same shape. Second, the only places that touch the vocabulary are the embedding at the bottom and the head at the top — and those share weights. Third, the loss is built from logits at positions [0..T−2] against targets at positions [1..T−1]: we predict every token from its predecessors, all in one shot. That parallelism is what makes Transformers train faster than RNNs by orders of magnitude.

Embeddings: addition, not concatenation

The first non-trivial line of the forward pass is:

x = self.tok_emb(idx) + self.pos_emb(pos)              # (B, T, d)

Two embedding tables: token (V × d) and position (Tmax × d). The token vector says what this slot holds; the position vector says where in the sequence the slot is. Why add them instead of concatenating?

GPT-2 uses learned position embeddings — one vector per slot, learned by gradient descent. The trade-off:

schemeextrapolationfit in-distributioncost
learnedfails past Tmaxslightly bestTmax·d params
sinusoidalworks (formula)slightly worse0 params
RoPEworks, popular todaystrong0 params
ALiBiworks, simple biasstrongO(h) params

For this codebase the toy sequences fit comfortably under Tmax, so we accept the cap in exchange for one fewer concept. If you ever want to deploy outside the training length, replace nn.Embedding(T_max, d) with RoPE in the attention module — nothing else needs to change.

Causal self-attention, shape by shape

This is the part with the most moving pieces. Walk it with me, shape by shape.

x (B, T, d) qkv (one Linear) (B, T, 3d) q, k, v (chunk) each (B, T, d) split heads, transpose (B, h, T, k) scores = q · kᵀ / √k (B, h, T, T) + causal mask, softmax (B, h, T, T) attn · v (B, h, T, k) merge heads (B, T, d) proj (Linear d → d) (B, T, d) Each arrow's annotation is the new shape after that step. Reading order: top-left → top-right → wraps left → ends bottom-left.

Now the same flow in code. From CausalSelfAttention.forward:

qkv = self.qkv(x)                                      # (B, T, 3d)
q, k, v = qkv.chunk(3, dim=-1)                         # each (B, T, d)

q = q.view(B, T, self.h, self.k).transpose(1, 2)       # (B, h, T, k)
k = k.view(B, T, self.h, self.k).transpose(1, 2)       # (B, h, T, k)
v = v.view(B, T, self.h, self.k).transpose(1, 2)       # (B, h, T, k)

scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.k) # (B, h, T, T)
scores = scores.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float("-inf"))

attn = F.softmax(scores, dim=-1)                       # (B, h, T, T)
out  = attn @ v                                        # (B, h, T, k)
out  = out.transpose(1, 2).contiguous().view(B, T, d)  # (B, T, d)
return self.proj(out)                                  # (B, T, d)

Why one fused QKV linear instead of three

The line self.qkv = nn.Linear(d, 3*d, bias=False) is not a stylistic choice — it is a hardware one. Three separate Linears would issue three GEMM kernel launches and read x from HBM three times. The fused Linear is one GEMM, one read of x, one write of (B, T, 3d). Arithmetic intensity goes up, kernel-launch overhead goes down. The math is identical because three independent linears stacked output-wise is a single Linear of triple width.

Why split heads and transpose

After chunk we have q, k, v each of shape (B, T, d). We want to do h independent attention computations, each on dim-k queries and keys. The trick is to reshape (B, T, d) → (B, T, h, k) — same memory, just a different view — then transpose to (B, h, T, k) so that the leading dims (B, h) get batched over by the matmul, and the tail dims (T, k) become the actual matrix being multiplied. A single q @ k.transpose(-2, -1) then produces (B, h, T, T) in one shot.

The transpose is what makes that batching work. After the matmul we transpose back and call .contiguous() before .view, because PyTorch's view cannot reinterpret strides that have been permuted — you need a contiguous buffer.

Why divide by √k

Suppose the entries of q and k are roughly i.i.d. with mean 0 and variance 1. Then the dot product of two k-vectors is a sum of k independent products, each with variance 1. The variance of the sum is k; its standard deviation is √k. Without the scaling, raw scores would grow as √k on average. Plug that into a softmax: as scores get large, softmax becomes one-hot, and one-hot softmax has zero gradient. Training stalls.

Dividing by √k standardises score variance to O(1) regardless of head dimension. The softmax stays in its informative regime (gradients of order 1), and you can change h and k without re-tuning anything else. This is one of the very few places in the architecture where a derivation, not an experiment, picks the constant.

Quick check
Try removing the √k divisor in your head. With d = 128 and h = 4 we have k = 32. Score std without scaling is √32 ≈ 5.7. After softmax, a single dominant logit of 5.7 above the rest gives e5.7 ≈ 300× the mass — already nearly one-hot. With /√k applied, the std is 1, and softmax spreads weight broadly enough for the gradient signal to pass.

The causal mask

The causal mask is a lower-triangular matrix of 1s, broadcast as (1, 1, Tmax, Tmax) so it fits the (B, h, T, T) score tensor. Positions in the upper triangle get set to −∞ before the softmax, which makes the corresponding probabilities exactly zero. The mask is the entire reason this architecture can train B · T next-token predictions in parallel: at every position t, the model sees only positions ≤ t, so the prediction for t+1 never "cheats" by reading future tokens that the inference-time model will not have.

A 4×4 causal-mask grid. Row = query position t; column = key position j. ✓ = attended, ✗ = masked to −∞. j=0 j=1 j=2 j=3 t=0 t=1 t=2 t=3 Position 0 sees only itself. Position 1 sees 0 and itself. Position 2 sees 0, 1, and itself. Position 3 sees the whole prefix.

Without the mask, predicting xt+1 from a context that includes xt+1 itself is trivial — train loss would crash to zero on the spot, and the model would be useless at inference time, where future tokens do not exist yet. The mask is what makes "one forward pass = T supervised targets" honest.

Multi-head attention: why split d into h heads

A single attention head with width d can attend in one way at a time: a single similarity geometry, a single set of features that count as "relevant." Splitting the same d into h heads of width k = d / h gives the model h independent attention patterns, mixed back together by the output projection self.proj.

Empirically, heads specialise without being told to. The classic catalog (from interpretability work on GPT-2) includes "previous-token" heads that always copy from t − 1, "duplicate-token" heads that copy the most recent occurrence of the current token, "induction" heads that complete patterns, and various long-range positional heads. The output projection Wo is then in charge of combining these specialised channels into the next residual update.

The FLOP cost is essentially the same as a single big head, because the dominant cost is the q · k and attn · v matmuls, and those scale with the total hidden dimension. Splitting just rearranges where the bytes go; it does not add work. So you get h different attention patterns for free — a bargain so good it has no real competitor.

MLP: 4× expansion and GELU

After attention has moved information between positions, the MLP transforms features at each position independently:

self.fc   = nn.Linear(d, 4 * d)
self.proj = nn.Linear(4 * d, d)
# forward: proj(gelu(fc(x)))

Three points worth being explicit about.

  1. Per-position. No mixing across the sequence — that has already happened in attention. The MLP is just a 2-layer feedforward applied identically at every t.
  2. 4× expansion. The historical Transformer ratio. Wide enough that the nonlinearity has room to do useful work; narrow enough that the MLP does not dominate the parameter budget completely. Per block: attention has 4 d2 params (QKV + proj), MLP has 8 d2 (two Linears of width 4d). The MLP holds about 2/3 of every block's weights — a fact people often miss.
  3. GELU, not ReLU. GELU is a smooth ReLU-like function: x · Φ(x). Smoothness gives non-zero gradient for small negative inputs, which empirically trains a little better than the hard ReLU bend, with negligible compute cost.

Alternatives exist: SwiGLU (used in LLaMA) replaces GELU with a gated variant and tweaks the expansion to 8/3 to preserve param count. For a teaching codebase, the plain GELU + 4× recipe is one fewer moving part and the qualitative behaviour is the same.

Pre-norm vs post-norm residuals

The Transformer block is:

x = x + self.attn(self.ln1(x))
x = x + self.mlp (self.ln2(x))

Note where the LayerNorm sits: inside the residual branch, applied to the input of each sub-layer. This is pre-norm. The original "Attention Is All You Need" paper used post-norm: x = LN(x + Attn(x)), with LN outside. Both are common in writing; only one trains nicely at depth without warmup.

pre-norm (used here) input x LayerNorm Attn / MLP + (residual add) residual post-norm (original) input x Attn / MLP + (residual add) LayerNorm residual

The difference matters because of how the residual stream grows with depth. In pre-norm, each block adds an LN-normalised perturbation to x: the residual stream's variance grows at most linearly in L, and the final ln_f renormalises before the head. In post-norm, the LN comes after the addition, so the gradient that flows back through the residual passes through L LayerNorms in series — small Jacobian errors compound, the early-layer gradients become tiny or huge, and you need an LR warmup to keep things sane (Xiong et al., 2020, gives the explicit signal-propagation analysis).

Pre-norm is the default in every modern LLM (GPT-2 was the inflection point). The cost is essentially zero — same params, same flops — and it removes the warmup hyperparameter, which is the kind of "free" you take.

Weight tying: the head is the embedding

Look at the model's __init__:

self.head = nn.Linear(d, vocab, bias=False)
self.head.weight = self.tok_emb.weight                 # share storage

Two operations, one weight matrix. The embedding maps a token id to a d-vector by looking up a row of Wte. The head maps a d-vector to logits over the vocabulary by multiplying by Wte. Sharing the storage cuts V · d parameters and empirically improves perplexity. Why?

The trade-off is rigidity: if you want a different output vocabulary from your input vocabulary (some translation setups), you cannot tie. For a same-vocab LM it is essentially always a win.

Init std 0.02

From _init_weights:

nn.init.normal_(m.weight, mean=0.0, std=0.02)

This is smaller than Kaiming's √(2/d) — for d = 128, Kaiming would give std ≈ 0.125, six times larger. The reason: in a deep residual stack, every block adds a perturbation whose magnitude scales with the magnitudes of its weights. Pile up L such additions and a Kaiming-scale init can blow up the residual stream's variance, especially in attention's value path. Small-std init keeps everything calm at step 0 and lets the optimizer find a good scale.

GPT-2 actually goes one step further: it rescales the residual-projection weights (self.proj in attention and MLP) by an additional 1/√(2L). The idea is to make the contribution of each layer's residual branch scale-stable with depth. We omit that refinement in this codebase to keep the init function one clean rule — it costs a little stability at very large L but is invisible at L = 4.

Parameter accounting

How big is this thing, and where do the parameters live? Let us count by hand, then check against the calculator below.

componentparamsnote
token embedding WteV·dshared with head
position embedding WpeTmax·d
per block: qkv Linear3·d·d = 3d²fused
per block: attn projd·d = d²
per block: MLP fcd·4d + 4d = 4d² + 4dbias on
per block: MLP proj4d·d + d = 4d² + dbias on
per block: 2× LayerNorm4dγ, β each (d,)
final LayerNorm2d
head0tied to Wte

Per block: 3d² + d² + 4d² + 4d² + O(d) = 12d² + O(d). Total: N ≈ V·d + Tmax·d + L · 12 d² + O(L·d).

For the SFT toy task with d = 128, h = 4, L = 4, V = 29, Tmax = 11:

Crucially, the block parameters dominate by roughly 150×. Embeddings barely matter at this size. As you scale d, the L · 12 d² term grows quadratically while the embedding terms grow linearly — eventually almost everything is "in the blocks."

Interactive: shape calculator and parameter counter

Move the sliders. Watch every intermediate tensor shape, the per-component parameter count, the rough forward-pass FLOPs, and the causal-mask heatmap update in real time. The constraint d mod h = 0 is enforced — when you change d, h is snapped to a divisor.

Forward-pass shapes and parameter counts
Embedding shapes are blue, attention shapes are orange, MLP shapes are green, head shapes are purple. The FLOPs estimate uses the Chinchilla-style "2N per token" rule for inference and "6N per token" for training; it ignores the attention quadratic term in T, which becomes significant only for T comparable to d.
k = d/h
32
Total params
Forward FLOPs / token
Train FLOPs / token
Forward-pass shapes

      
Per-component parameters

    
Causal mask (T × T)
green = attended · grey = masked
When the "2N per token" rule breaks
The Chinchilla rule counts dense-linear FLOPs and ignores attention's O(T2) term. For our toy T = 11, d = 128, attention is a rounding error. For T = 4096, d = 4096, attention can dominate. Whenever T approaches d the rule starts to under-count; for production reasoning models with long contexts this matters and a full FLOP count keeps both terms.

The design space, in one table

Every choice above had a reasonable alternative. Here is the trade-off matrix to keep next to you when you read other codebases.

choice in model.pyalternativewhy we picked this one
learned positions, Tmax fixed sinusoidal · RoPE · ALiBi simplest; toy seqs fit. Swap to RoPE for length extrapolation.
tied unembedding (head = Wte) separate head matrix fewer params, better PPL, same-vocab in/out.
pre-norm post-norm trainable without warmup, stable at depth.
MLP 4× expansion, GELU SwiGLU 8/3× · ReLU · GeGLU simplest recipe, qualitatively identical at this scale.
bias=False on qkv & proj bias=True biases barely help in attention; saves a few params.
init std 0.02 flat 0.02 + 1/√(2L) on residual projs at L=4 the refinement is invisible; we drop it for clarity.
fused qkv Linear (d→3d) 3 separate Linears one GEMM, one HBM read of x.
F.softmax + masked_fill FlashAttention / SDPA readable; production stacks use the fused op for memory.
Takeaway
Every shape in this architecture serves the same goal: compute p(xt | x<t) for every t in parallel, cheaply, and stably enough that gradient descent can drive the log-likelihood down. Pre-norm + weight tying + √k scaling + the causal mask are the four design choices that make that goal reachable at depth.

Where this lesson sits in the pipeline

Take a step back. The next five lessons — pretrain, SFT, CoT, DPO, RLVR — all run this same architecture. They differ only in two ways:

  1. The data. Raw corpus → (prompt, response) pairs → (prompt, reasoning+answer) → (prompt, yw, yl) preference pairs → (prompt, rolled-out completion, verifier score).
  2. The positions that count in the loss. All positions → response positions only → response positions only → preference-pair logratios → reward-weighted log-probs of response tokens with a KL anchor.

The architecture is fixed. The post-training story is, line for line, about what you put into idx and which entries of logits you turn into a loss. Holding that separation in your head is what makes the whole pipeline tellable as a linear story.

In the next lesson, we will see this architecture trained on raw text with the dense cross-entropy loss — and watch the per-character probability distribution sharpen as compression discovers grammar, then arithmetic, then structure. That is pretraining.