Mini GPT — Post-Training, From First Principles

A linearized tour of the four post-training stages that turn a text-completer into a reasoning assistant — built so you understand why each loss exists before you see the code.

This series of six interactive lessons unwraps a minimal char-level GPT-2 and the four post-training stages stacked on top of it: SFT, CoT, DPO, RLVR (GRPO). The architecture under all four is identical — model.py never changes. Each stage exists because the previous stage cannot do one specific thing; reading them in order answers the question "why bother with X at all?" by showing what the prior stage failed at. Every lesson has at least one interactive widget so you can grab a knob and feel the consequence.

Who this is for

You can read Python and you know what a neural network is, but the full chain from pretraining to RLVR is a blur. By the end you'll be able to point at any line in gpt_mini/ and say what it is doing and why that loss term has the shape it has.

The pipeline you're learning

The picture: raw text → instruction-following → reasoning traces → preference shaping → exploration under a verifier. Each arrow is a single new idea that the previous box could not express. Hover a box to see what it adds.

The lessons

Architecture — the shape under every stage

A line-by-line tour of model.py: embedding addition, scaled dot-product attention, the causal mask, multi-head split/merge, pre-norm residuals, weight tying. Trade-offs at every choice. Interactive: pick (B,T,d,h,L), see every intermediate shape and parameter count.

Pretrain — next-token prediction is unreasonably effective

Why one loss — minimizing −Σ log p(x_t | x_<t) — produces grammar, world knowledge, arithmetic, and code as side effects of compression. Why the resulting model can't follow instructions. Interactive: watch a char-level model concentrate probability mass on plausible continuations as training proceeds.

SFT — loss masking is what makes "supervised fine-tuning" supervised

From P(text) to P(response | prompt) without changing the loss shape — only the mask over which positions count. The chat template as a tokens-only protocol. Interactive: drag the mask boundary and watch the model's behavior degenerate when prompt positions leak into the gradient.

CoT — buying serial compute with response tokens

A depth-L transformer does O(L) sequential ops per forward pass. If a problem takes more serial steps than L, no amount of width helps — but emitting reasoning tokens turns L into R·L. CoT is adaptive test-time compute as a data choice; the training code is unchanged from SFT. Interactive: slide the number of summands and see the direct-vs-CoT accuracy gap open up.

DPO — closed-form preference optimization

The KL-regularized RL optimum is a Gibbs distribution; invert it for r(x,y), substitute into Bradley–Terry, watch log Z(x) cancel across a pair. No reward model, no value head, no on-policy sampling. Interactive: watch the implicit reward margin r̂_w − r̂_l climb through training, and feel what β controls.

RLVR — exploration under a verifier, with GRPO

When correctness is programmatically checkable, skip preferences entirely. Sample K rollouts from the policy, score each, use the group mean as the baseline (no value network), and anchor to π_ref with Schulman's k3 KL estimator. Interactive: K rollouts, see advantages, watch degenerate groups appear when the policy is too confident or too clueless.

How to use this

Linearly. Each lesson assumes the previous one. Lesson 3 only makes sense as a delta from lesson 2; lesson 6 only as a delta from lesson 5.
Touch every knob. Each widget has at least one configuration that breaks training. Find it. The bugs are the lesson.
Open the code. Each lesson links the corresponding file. The lessons explain why; the code is what.

Companion code

Lessons map 1-to-1 to files in gpt_mini/: model.py, 00_pretrain.py, 01_sft.py, 02_cot.py, 03_dpo.py, 04_rlvr.py. For deeper coverage of the RL side (PPO, RLOO, DAPO, Dr.GRPO) see the sibling reinforcement_learning/lessons/ series.