rl_lessons / 23 · famous recipes lesson 9 / 9 · part III

Famous recipes & failure modes — putting it all together

Six post-training recipes that defined the modern reasoning era — DeepSeek-R1-Zero, R1, Tülu 3, Qwen-2.5, InstructGPT, and the o-series sketch — described in the language of the previous twenty-two lessons. Then a taxonomy of the ways an RL run fails and the dashboards that catch each one.

Where this lesson sits
This is the closing lesson of Part III. Part I gave you the framework, Part II gave you the algorithms, and Part III gave you the reward and systems context to read modern papers. This lesson is the demonstration: every famous post-training recipe of the last three years factors into pieces you've already met. After this, Part IV (lessons 24–25) covers the engineer's perspective; Part V (lessons 26–28) is the synthesis layer.

How to read this lesson

Each recipe below is described as a tuple: (data, algorithm, reward source, system pattern, scale). The point is not to memorise them — the point is to see that every modern frontier post-training stack is some combination of pieces you've already met in lessons 01–22, with maybe one twist that justified the paper. By the end, you should be able to look at any new paper and chart it onto this map.

Here's the bird's-eye view first. Each recipe is a sequence of stages running left to right; the color tells you what kind of stage each is:

SFT preference (DPO / RM) on-policy RL (PPO / GRPO) rejection-sampled SFT verifier InstructGPT '22 SFT (13k demos) BT reward model PPO (KL to π_SFT) R1-Zero '25 (no SFT) GRPO from pretrained base verifier reward R1 '25 cold SFT GRPO (verif) rs-SFT ~800k mixed GRPO (mixed) Tülu 3 '24 SFT (~940k mix) DPO (~273k pairs, 8B) PPO on RLVR data Qwen-2.5 '24 SFT offline DPO online DPO GRPO stage 1 → → deliverable

Three observations from this layout. (1) Every recipe ends in on-policy RL (green) — that's the irreducible step. Preference-only pipelines stall short; verifier-only pipelines without SFT collapse readability (R1-Zero). (2) The "DPO-then-PPO" sandwich shape (Tülu 3, Qwen-2.5) is now standard for general assistants. (3) The R1 recipe inserts a rejection-sampled SFT (purple) between two RL stages — that's the practical trick that lets you mix verifiable reasoning with general helpfulness.

Recipe 1 · InstructGPT (OpenAI '22) — the reference RLHF

Data~13k SFT demonstrations + ~33k RM-stage prompts (each ranked into K=4–9 responses, yielding hundreds of thousands of pairwise comparisons) + ~31k PPO prompts. Human-labeler-written, expert-curated.
AlgorithmSFT → BT reward model (lesson 15) → PPO with KL-to-SFT.
RewardLearned 6B reward model from human preferences.
SystemColocated PPO on internal infra; details largely unpublished.
Why famousFirst public demonstration that RLHF on a 175B model yields large quality wins over SFT alone; established the three-stage template.

Everything you've learned in lessons 15 (RLHF) is the InstructGPT recipe. The descendants below are all "InstructGPT plus or minus one thing".

Recipe 2 · DeepSeek-R1-Zero ('25) — RL from scratch, no SFT

DataMath + code prompts with verifiable answers. No reasoning demonstrations, no SFT.
AlgorithmGRPO (lesson 11) directly on the pretrained base model.
RewardPure verifier: r = 1 if answer matches gold, plus a small format reward for using <think>…</think> tags. No reward model.
SystemDeepSeek's internal infrastructure; training framework not publicly named. Disaggregated rollout + trainer is the likely shape.
Why famousRL alone (no SFT step) elicited and amplified long chain-of-thought, self-reflection, and back-tracking from a pretrained base model — using only a binary verifier reward. (Follow-up work, e.g. Liu et al. 2025's Understanding R1-Zero-Like Training, shows these patterns are already present in DeepSeek-V3-Base at epoch 0, so RL amplifies rather than purely "invents" them — but the gap between base and post-RL behavior is large enough that the original framing still stands.)
CaveatR1-Zero's outputs are unstable, often switch languages mid-chain, and are hard for humans to read — which motivated R1 (next).

Recipe 3 · DeepSeek-R1 ('25) — the practical R1-Zero

DataFour stages: (1) cold-start SFT on a few thousand human-curated reasoning chains; (2) reasoning-oriented RL on verifiable problems; (3) rejection-sampled SFT — ~600k filtered reasoning rollouts mixed with ~200k non-reasoning SFT data (writing, factual QA, self-cognition) so the model is a general assistant, not just a math reasoner; (4) RL across all scenarios with mixed reward sources.
AlgorithmStage 2: GRPO on verifiable problems. Stage 3: SFT on the mixed ~800k corpus. Stage 4: GRPO again on a mix of verifiable + general (RM-scored) data.
RewardVerifier for math/code (lesson 18); RM for general; format/language reward to keep chains human-readable.
SystemSame framework as R1-Zero; multi-stage data pipeline is the engineering contribution.
Why famousTook R1-Zero from "weird genius" to "production reasoning model that matches o1 on AIME". Released the 671B base + reasoning distillations to 1.5B–70B.
The "rejection-sampled SFT" trick, in one line
Sample many rollouts from your current RL'd model on a fresh prompt set; keep only the ones with reward 1; SFT the model on those keepers. This is supervised distillation from your own model's successes — a powerful way to lock in RL gains, smooth the policy, and create training data for smaller distilled models all at once.

Recipe 4 · Tülu 3 (AllenAI '24) — the open recipe

Data~940k SFT mixture (instruction tuning) + DPO preference pairs (~273k at 8B, ~334k at 70B, ~366k at 405B) + RLVR data with verifiable-by-design tasks.
AlgorithmSFT → DPO (lesson 16) → RLVR with PPO on verifiable subsets. (Tülu 3.1 later added GRPO; original Tülu 3 was PPO-only for RLVR.)
RewardMixed: BT-style RM for general preferences; verifier for math, code, IFEval, etc.
SystemOpen framework (open-instruct → built atop TRL → moved to in-house tooling). Disaggregated rollout/trainer.
Why famousFirst fully-open recipe with frontier-comparable scores at every stage. Their decomposition of "what each stage actually buys you" is the reference benchmark for ablations.

Recipe 5 · Qwen-2.5 / Qwen-3 ('24–'25) — the iterated RL recipe

DataHeavy SFT mix + DPO pairs + RL prompts. Synthetic data generation at scale.
AlgorithmSFT → offline DPO → online DPO → GRPO. Two distinct DPO stages (different data sources) followed by on-policy RL.
RewardRM for general preferences + verifiers for math/code in the GRPO stage.
SystemAlibaba's internal framework; published less detail than DeepSeek, but the model series is the working evidence.
Why famousOne of the cleanest public demonstrations that a hybrid pipeline (offline preference + online preference + verifier-based RL) outperforms any single stage. The "iterated DPO" pattern as a named technique is more associated with Meta's Self-Rewarding LMs / Snorkel's Iterative DPO; Qwen-2.5 popularized the offline-then-online DPO progression in particular.

Recipe 6 · OpenAI o-series ('24–'25) — sketch only

OpenAI hasn't published the recipe, but the public information plus the model behavior strongly suggests:

If you cared to reproduce the public-facing behavior with this lesson series alone, you'd start with R1's recipe and add a test-time-search wrapper around the trained model. That gets you ~80% of the way there based on currently understood mechanisms.

Interactive · trace a paper onto the lesson map

Below: paste any post-training paper into your head — abstract is enough — and click through the questions. The widget points you at which lessons describe what the paper is doing.

"What is this paper, in our terms?"
Answer the four questions; the matching lesson chips light up. (No data leaves the page; this is a static decision tree.)

The failure-mode taxonomy

Every RL run that goes wrong goes wrong in roughly one of the ways below. The dashboard signal that catches each one is in column 3 — and the lesson that introduced the mechanism is in column 4. Pin this table near your monitor.

SymptomLikely causeDashboard tellLesson
Train reward goes up, held-out drops Reward hacking / verifier leak Held-out reward, frac_format_only correct 18
Reward is flat, loss looks fine Stale weight sync (rollout policy not updating) frac_clipped > 30%, |log ρ| growing 06
Outputs become very short or very long Length bias: GRPO's per-rollout 1/|y| divisor (the Dr.GRPO finding); also DAPO's overlong-cutoff false negatives mean response length over time 14, 13
Outputs become deterministic, exploration dies Entropy collapse from symmetric PPO clip policy entropy, frac_unique tokens 13
KL spikes mid-training Reference drift; β too low; rollout numerical mismatch KL(π‖π_ref), per-step KL distribution 03, 20
Many groups have zero advantage Saturated success (too easy) or zero (too hard) frac_groups_with_signal 13
Loss is NaN at the first step fp16 overflow; ratio explosion from stale snapshot grad norm distribution, first-step ratio histogram 10, 19
DPO chosen-prob and rejected-prob both drop BT overconfidence on hard pairs log π(y_w), log π(y_l) over time 16
Rollout much slower than trainer Long prompts without chunked prefill; KV thrashing rollout tok/s vs train tok/s 20
OOM mid-training at long sequence Activation memory at long L; no checkpointing peak memory per step 21
The single most diagnostic plot
Held-out task accuracy vs. wall-clock. Not train reward, not loss, not KL. Held-out accuracy is the only number that's safe from every form of reward hacking. If it isn't going up, something is wrong, and the rest of the dashboards are diagnostics for which thing is wrong.

You now know enough to read any 2025 RL paper

Every paper introducing a new post-training recipe is some recombination of:

The first four are the substrate; the fifth is the actual contribution. When you read a new paper, factor it into the first four pieces and then read the fifth carefully — that's where the new physics is.

The end of Part III
You've now covered the four pillars of modern RL post-training: the math of the loss, the framework around it, the algorithms that improve it, and the production system that runs it. The reading list ahead is still long, but every paper you read will refer to mechanisms you can now name from first principles.