Mini GPT — Post-Training, From First Principles
A linearized tour of the four post-training stages that turn a text-completer into a reasoning assistant — built so you understand why each loss exists before you see the code.
This series of six interactive lessons unwraps a minimal char-level GPT-2 and the four post-training stages stacked on top of it: SFT, CoT, DPO, RLVR (GRPO). The architecture under all four is identical — model.py never changes. Each stage exists because the previous stage cannot do one specific thing; reading them in order answers the question "why bother with X at all?" by showing what the prior stage failed at. Every lesson has at least one interactive widget so you can grab a knob and feel the consequence.
gpt_mini/ and say what it is doing and why that loss term has the shape it has.
The pipeline you're learning
The picture: raw text → instruction-following → reasoning traces → preference shaping → exploration under a verifier. Each arrow is a single new idea that the previous box could not express. Hover a box to see what it adds.
The lessons
model.py: embedding addition, scaled dot-product attention, the causal mask, multi-head split/merge, pre-norm residuals, weight tying. Trade-offs at every choice. Interactive: pick (B,T,d,h,L), see every intermediate shape and parameter count.How to use this
- Linearly. Each lesson assumes the previous one. Lesson 3 only makes sense as a delta from lesson 2; lesson 6 only as a delta from lesson 5.
- Touch every knob. Each widget has at least one configuration that breaks training. Find it. The bugs are the lesson.
- Open the code. Each lesson links the corresponding file. The lessons explain why; the code is what.
gpt_mini/:
model.py,
00_pretrain.py,
01_sft.py,
02_cot.py,
03_dpo.py,
04_rlvr.py.
For deeper coverage of the RL side (PPO, RLOO, DAPO, Dr.GRPO) see the sibling RL/lessons/ series.