Famous recipes & failure modes — putting it all together
Six post-training recipes that defined the modern reasoning era — DeepSeek-R1-Zero, R1, Tülu 3, Qwen-2.5, InstructGPT, and the o-series sketch — described in the language of the previous twenty-two lessons. Then a taxonomy of the ways an RL run fails and the dashboards that catch each one.
How to read this lesson
Each recipe below is described as a tuple: (data, algorithm, reward source, system pattern, scale). The point is not to memorise them — the point is to see that every modern frontier post-training stack is some combination of pieces you've already met in lessons 01–22, with maybe one twist that justified the paper. By the end, you should be able to look at any new paper and chart it onto this map.
Here's the bird's-eye view first. Each recipe is a sequence of stages running left to right; the color tells you what kind of stage each is:
Three observations from this layout. (1) Every recipe ends in on-policy RL (green) — that's the irreducible step. Preference-only pipelines stall short; verifier-only pipelines without SFT collapse readability (R1-Zero). (2) The "DPO-then-PPO" sandwich shape (Tülu 3, Qwen-2.5) is now standard for general assistants. (3) The R1 recipe inserts a rejection-sampled SFT (purple) between two RL stages — that's the practical trick that lets you mix verifiable reasoning with general helpfulness.
Recipe 1 · InstructGPT (OpenAI '22) — the reference RLHF
| Data | ~13k SFT demonstrations + ~33k RM-stage prompts (each ranked into K=4–9 responses, yielding hundreds of thousands of pairwise comparisons) + ~31k PPO prompts. Human-labeler-written, expert-curated. |
| Algorithm | SFT → BT reward model (lesson 15) → PPO with KL-to-SFT. |
| Reward | Learned 6B reward model from human preferences. |
| System | Colocated PPO on internal infra; details largely unpublished. |
| Why famous | First public demonstration that RLHF on a 175B model yields large quality wins over SFT alone; established the three-stage template. |
Everything you've learned in lessons 15 (RLHF) is the InstructGPT recipe. The descendants below are all "InstructGPT plus or minus one thing".
Recipe 2 · DeepSeek-R1-Zero ('25) — RL from scratch, no SFT
| Data | Math + code prompts with verifiable answers. No reasoning demonstrations, no SFT. |
| Algorithm | GRPO (lesson 11) directly on the pretrained base model. |
| Reward | Pure verifier: r = 1 if answer matches gold, plus a small format reward for using <think>…</think> tags. No reward model. |
| System | DeepSeek's internal infrastructure; training framework not publicly named. Disaggregated rollout + trainer is the likely shape. |
| Why famous | RL alone (no SFT step) elicited and amplified long chain-of-thought, self-reflection, and back-tracking from a pretrained base model — using only a binary verifier reward. (Follow-up work, e.g. Liu et al. 2025's Understanding R1-Zero-Like Training, shows these patterns are already present in DeepSeek-V3-Base at epoch 0, so RL amplifies rather than purely "invents" them — but the gap between base and post-RL behavior is large enough that the original framing still stands.) |
| Caveat | R1-Zero's outputs are unstable, often switch languages mid-chain, and are hard for humans to read — which motivated R1 (next). |
Recipe 3 · DeepSeek-R1 ('25) — the practical R1-Zero
| Data | Four stages: (1) cold-start SFT on a few thousand human-curated reasoning chains; (2) reasoning-oriented RL on verifiable problems; (3) rejection-sampled SFT — ~600k filtered reasoning rollouts mixed with ~200k non-reasoning SFT data (writing, factual QA, self-cognition) so the model is a general assistant, not just a math reasoner; (4) RL across all scenarios with mixed reward sources. |
| Algorithm | Stage 2: GRPO on verifiable problems. Stage 3: SFT on the mixed ~800k corpus. Stage 4: GRPO again on a mix of verifiable + general (RM-scored) data. |
| Reward | Verifier for math/code (lesson 18); RM for general; format/language reward to keep chains human-readable. |
| System | Same framework as R1-Zero; multi-stage data pipeline is the engineering contribution. |
| Why famous | Took R1-Zero from "weird genius" to "production reasoning model that matches o1 on AIME". Released the 671B base + reasoning distillations to 1.5B–70B. |
Recipe 4 · Tülu 3 (AllenAI '24) — the open recipe
| Data | ~940k SFT mixture (instruction tuning) + DPO preference pairs (~273k at 8B, ~334k at 70B, ~366k at 405B) + RLVR data with verifiable-by-design tasks. |
| Algorithm | SFT → DPO (lesson 16) → RLVR with PPO on verifiable subsets. (Tülu 3.1 later added GRPO; original Tülu 3 was PPO-only for RLVR.) |
| Reward | Mixed: BT-style RM for general preferences; verifier for math, code, IFEval, etc. |
| System | Open framework (open-instruct → built atop TRL → moved to in-house tooling). Disaggregated rollout/trainer. |
| Why famous | First fully-open recipe with frontier-comparable scores at every stage. Their decomposition of "what each stage actually buys you" is the reference benchmark for ablations. |
Recipe 5 · Qwen-2.5 / Qwen-3 ('24–'25) — the iterated RL recipe
| Data | Heavy SFT mix + DPO pairs + RL prompts. Synthetic data generation at scale. |
| Algorithm | SFT → offline DPO → online DPO → GRPO. Two distinct DPO stages (different data sources) followed by on-policy RL. |
| Reward | RM for general preferences + verifiers for math/code in the GRPO stage. |
| System | Alibaba's internal framework; published less detail than DeepSeek, but the model series is the working evidence. |
| Why famous | One of the cleanest public demonstrations that a hybrid pipeline (offline preference + online preference + verifier-based RL) outperforms any single stage. The "iterated DPO" pattern as a named technique is more associated with Meta's Self-Rewarding LMs / Snorkel's Iterative DPO; Qwen-2.5 popularized the offline-then-online DPO progression in particular. |
Recipe 6 · OpenAI o-series ('24–'25) — sketch only
OpenAI hasn't published the recipe, but the public information plus the model behavior strongly suggests:
- Reasoning chains generated at scale with verifiers — math, code, possibly PRMs.
- RL with a heavily filtered, deliberately long-form reasoning signal.
- Test-time scaling via parallel/sequential sampling and possibly tree-style search (lesson 17), made cheap by aggressive prefix caching (lesson 21).
- The user-visible "thinking time" is a knob that trades latency for test-time-search budget.
If you cared to reproduce the public-facing behavior with this lesson series alone, you'd start with R1's recipe and add a test-time-search wrapper around the trained model. That gets you ~80% of the way there based on currently understood mechanisms.
Interactive · trace a paper onto the lesson map
Below: paste any post-training paper into your head — abstract is enough — and click through the questions. The widget points you at which lessons describe what the paper is doing.
The failure-mode taxonomy
Every RL run that goes wrong goes wrong in roughly one of the ways below. The dashboard signal that catches each one is in column 3 — and the lesson that introduced the mechanism is in column 4. Pin this table near your monitor.
| Symptom | Likely cause | Dashboard tell | Lesson |
|---|---|---|---|
| Train reward goes up, held-out drops | Reward hacking / verifier leak | Held-out reward, frac_format_only correct | 18 |
| Reward is flat, loss looks fine | Stale weight sync (rollout policy not updating) | frac_clipped > 30%, |log ρ| growing | 06 |
| Outputs become very short or very long | Length bias: GRPO's per-rollout 1/|y| divisor (the Dr.GRPO finding); also DAPO's overlong-cutoff false negatives | mean response length over time | 14, 13 |
| Outputs become deterministic, exploration dies | Entropy collapse from symmetric PPO clip | policy entropy, frac_unique tokens | 13 |
| KL spikes mid-training | Reference drift; β too low; rollout numerical mismatch | KL(π‖π_ref), per-step KL distribution | 03, 20 |
| Many groups have zero advantage | Saturated success (too easy) or zero (too hard) | frac_groups_with_signal | 13 |
| Loss is NaN at the first step | fp16 overflow; ratio explosion from stale snapshot | grad norm distribution, first-step ratio histogram | 10, 19 |
| DPO chosen-prob and rejected-prob both drop | BT overconfidence on hard pairs | log π(y_w), log π(y_l) over time | 16 |
| Rollout much slower than trainer | Long prompts without chunked prefill; KV thrashing | rollout tok/s vs train tok/s | 20 |
| OOM mid-training at long sequence | Activation memory at long L; no checkpointing | peak memory per step | 21 |
You now know enough to read any 2025 RL paper
Every paper introducing a new post-training recipe is some recombination of:
- An algorithm from lessons 09–14 (or DPO, lesson 16).
- A reward source from lesson 15 (RM), 17 (PRM/verifier), or 18 (env-specific).
- A system pattern from lesson 19 and an engine choice from lessons 20–21.
- A multi-stage data flow that does SFT → preference → RL → distillation in some order.
- One or two genuinely new ideas — usually the abstract's "we propose" sentence.
The first four are the substrate; the fifth is the actual contribution. When you read a new paper, factor it into the first four pieces and then read the fifth carefully — that's where the new physics is.