Famous recipes & failure modes — putting it all together

Six post-training recipes that defined the modern reasoning era — DeepSeek-R1-Zero, R1, Tülu 3, Qwen-2.5, InstructGPT, and the o-series sketch — described in the language of the previous twenty-two lessons. Then a taxonomy of the ways an RL run fails and the dashboards that catch each one.

Where this lesson sits

This is the closing lesson of Part III. Part I gave you the framework, Part II gave you the algorithms, and Part III gave you the reward and systems context to read modern papers. This lesson is the demonstration: every famous post-training recipe of the last three years factors into pieces you've already met. After this, Part IV (lessons 24–25) covers the engineer's perspective; Part V (lessons 26–28) is the synthesis layer.

How to read this lesson

Each recipe below is described as a tuple: (data, algorithm, reward source, system pattern, scale). The point is not to memorise them — the point is to see that every modern frontier post-training stack is some combination of pieces you've already met in lessons 01–22, with maybe one twist that justified the paper. By the end, you should be able to look at any new paper and chart it onto this map.

Here's the bird's-eye view first. Each recipe is a sequence of stages running left to right; the color tells you what kind of stage each is:

Three observations from this layout. (1) Every recipe ends in on-policy RL (green) — that's the irreducible step. Preference-only pipelines stall short; verifier-only pipelines without SFT collapse readability (R1-Zero). (2) The "DPO-then-PPO" sandwich shape (Tülu 3, Qwen-2.5) is now standard for general assistants. (3) The R1 recipe inserts a rejection-sampled SFT (purple) between two RL stages — that's the practical trick that lets you mix verifiable reasoning with general helpfulness.

Recipe 1 · InstructGPT (OpenAI '22) — the reference RLHF

Data	~13k SFT demonstrations + ~33k RM-stage prompts (each ranked into K=4–9 responses, yielding hundreds of thousands of pairwise comparisons) + ~31k PPO prompts. Human-labeler-written, expert-curated.
Algorithm	SFT → BT reward model (lesson 15) → PPO with KL-to-SFT.
Reward	Learned 6B reward model from human preferences.
System	Colocated PPO on internal infra; details largely unpublished.
Why famous	First public demonstration that RLHF on a 175B model yields large quality wins over SFT alone; established the three-stage template.

Everything you've learned in lessons 15 (RLHF) is the InstructGPT recipe. The descendants below are all "InstructGPT plus or minus one thing".

Recipe 2 · DeepSeek-R1-Zero ('25) — RL from scratch, no SFT

Data	Math + code prompts with verifiable answers. No reasoning demonstrations, no SFT.
Algorithm	GRPO (lesson 11) directly on the pretrained base model.
Reward	Pure verifier: r = 1 if answer matches gold, plus a small format reward for using `<think>…</think>` tags. No reward model.
System	DeepSeek's internal infrastructure; training framework not publicly named. Disaggregated rollout + trainer is the likely shape.
Why famous	RL alone (no SFT step) elicited and amplified long chain-of-thought, self-reflection, and back-tracking from a pretrained base model — using only a binary verifier reward. (Follow-up work, e.g. Liu et al. 2025's Understanding R1-Zero-Like Training, shows these patterns are already present in DeepSeek-V3-Base at epoch 0, so RL amplifies rather than purely "invents" them — but the gap between base and post-RL behavior is large enough that the original framing still stands.)
Caveat	R1-Zero's outputs are unstable, often switch languages mid-chain, and are hard for humans to read — which motivated R1 (next).

Recipe 3 · DeepSeek-R1 ('25) — the practical R1-Zero

Data	Four stages: (1) cold-start SFT on a few thousand human-curated reasoning chains; (2) reasoning-oriented RL on verifiable problems; (3) rejection-sampled SFT — ~600k filtered reasoning rollouts mixed with ~200k non-reasoning SFT data (writing, factual QA, self-cognition) so the model is a general assistant, not just a math reasoner; (4) RL across all scenarios with mixed reward sources.
Algorithm	Stage 2: GRPO on verifiable problems. Stage 3: SFT on the mixed ~800k corpus. Stage 4: GRPO again on a mix of verifiable + general (RM-scored) data.
Reward	Verifier for math/code (lesson 18); RM for general; format/language reward to keep chains human-readable.
System	Same framework as R1-Zero; multi-stage data pipeline is the engineering contribution.
Why famous	Took R1-Zero from "weird genius" to "production reasoning model that matches o1 on AIME". Released the 671B base + reasoning distillations to 1.5B–70B.

The "rejection-sampled SFT" trick, in one line

Sample many rollouts from your current RL'd model on a fresh prompt set; keep only the ones with reward 1; SFT the model on those keepers. This is supervised distillation from your own model's successes — a powerful way to lock in RL gains, smooth the policy, and create training data for smaller distilled models all at once.

Recipe 4 · Tülu 3 (AllenAI '24) — the open recipe

Data	~940k SFT mixture (instruction tuning) + DPO preference pairs (~273k at 8B, ~334k at 70B, ~366k at 405B) + RLVR data with verifiable-by-design tasks.
Algorithm	SFT → DPO (lesson 16) → RLVR with PPO on verifiable subsets. (Tülu 3.1 later added GRPO; original Tülu 3 was PPO-only for RLVR.)
Reward	Mixed: BT-style RM for general preferences; verifier for math, code, IFEval, etc.
System	Open framework (open-instruct → built atop TRL → moved to in-house tooling). Disaggregated rollout/trainer.
Why famous	First fully-open recipe with frontier-comparable scores at every stage. Their decomposition of "what each stage actually buys you" is the reference benchmark for ablations.

Recipe 5 · Qwen-2.5 / Qwen-3 ('24–'25) — the iterated RL recipe

Data	Heavy SFT mix + DPO pairs + RL prompts. Synthetic data generation at scale.
Algorithm	SFT → offline DPO → online DPO → GRPO. Two distinct DPO stages (different data sources) followed by on-policy RL.
Reward	RM for general preferences + verifiers for math/code in the GRPO stage.
System	Alibaba's internal framework; published less detail than DeepSeek, but the model series is the working evidence.
Why famous	One of the cleanest public demonstrations that a hybrid pipeline (offline preference + online preference + verifier-based RL) outperforms any single stage. The "iterated DPO" pattern as a named technique is more associated with Meta's Self-Rewarding LMs / Snorkel's Iterative DPO; Qwen-2.5 popularized the offline-then-online DPO progression in particular.

Recipe 6 · OpenAI o-series ('24–'25) — sketch only

OpenAI hasn't published the recipe, but the public information plus the model behavior strongly suggests:

Reasoning chains generated at scale with verifiers — math, code, possibly PRMs.
RL with a heavily filtered, deliberately long-form reasoning signal.
Test-time scaling via parallel/sequential sampling and possibly tree-style search (lesson 17), made cheap by aggressive prefix caching (lesson 21).
The user-visible "thinking time" is a knob that trades latency for test-time-search budget.

If you cared to reproduce the public-facing behavior with this lesson series alone, you'd start with R1's recipe and add a test-time-search wrapper around the trained model. That gets you ~80% of the way there based on currently understood mechanisms.

Interactive · trace a paper onto the lesson map

Below: paste any post-training paper into your head — abstract is enough — and click through the questions. The widget points you at which lessons describe what the paper is doing.

"What is this paper, in our terms?"

Answer the four questions; the matching lesson chips light up. (No data leaves the page; this is a static decision tree.)

Q1. Reward source:

Q2. On-policy?

Q3. Has critic?

Q4. Multi-turn?

The failure-mode taxonomy

Every RL run that goes wrong goes wrong in roughly one of the ways below. The dashboard signal that catches each one is in column 3 — and the lesson that introduced the mechanism is in column 4. Pin this table near your monitor.

Symptom	Likely cause	Dashboard tell	Lesson
Train reward goes up, held-out drops	Reward hacking / verifier leak	Held-out reward, frac_format_only correct	18
Reward is flat, loss looks fine	Stale weight sync (rollout policy not updating)	frac_clipped > 30%, \|log ρ\| growing	06
Outputs become very short or very long	Length bias: GRPO's per-rollout 1/\|y\| divisor (the Dr.GRPO finding); also DAPO's overlong-cutoff false negatives	mean response length over time	14, 13
Outputs become deterministic, exploration dies	Entropy collapse from symmetric PPO clip	policy entropy, frac_unique tokens	13
KL spikes mid-training	Reference drift; β too low; rollout numerical mismatch	KL(π‖π_ref), per-step KL distribution	03, 20
Many groups have zero advantage	Saturated success (too easy) or zero (too hard)	frac_groups_with_signal	13
Loss is NaN at the first step	fp16 overflow; ratio explosion from stale snapshot	grad norm distribution, first-step ratio histogram	10, 19
DPO chosen-prob and rejected-prob both drop	BT overconfidence on hard pairs	log π(y_w), log π(y_l) over time	16
Rollout much slower than trainer	Long prompts without chunked prefill; KV thrashing	rollout tok/s vs train tok/s	20
OOM mid-training at long sequence	Activation memory at long L; no checkpointing	peak memory per step	21

The single most diagnostic plot

Held-out task accuracy vs. wall-clock. Not train reward, not loss, not KL. Held-out accuracy is the only number that's safe from every form of reward hacking. If it isn't going up, something is wrong, and the rest of the dashboards are diagnostics for which thing is wrong.

You now know enough to read any 2025 RL paper

Every paper introducing a new post-training recipe is some recombination of:

An algorithm from lessons 09–14 (or DPO, lesson 16).
A reward source from lesson 15 (RM), 17 (PRM/verifier), or 18 (env-specific).
A system pattern from lesson 19 and an engine choice from lessons 20–21.
A multi-stage data flow that does SFT → preference → RL → distillation in some order.
One or two genuinely new ideas — usually the abstract's "we propose" sentence.

The first four are the substrate; the fifth is the actual contribution. When you read a new paper, factor it into the first four pieces and then read the fifth carefully — that's where the new physics is.

The end of Part III

You've now covered the four pillars of modern RL post-training: the math of the loss, the framework around it, the algorithms that improve it, and the production system that runs it. The reading list ahead is still long, but every paper you read will refer to mechanisms you can now name from first principles.