Missing concepts — what the curriculum doesn't yet cover
Lessons 01–25 explain the spine of modern post-training RL. They name several concepts in passing and leave them for later. This lesson is "later" — a catalog of the concepts that get mentioned, what each one means, why it matters, and how the curriculum could expand to teach it.
Organizing the gaps
The curriculum's gaps cluster naturally along the three forces from lesson 26:
- Signal gaps — places where reward shaping, PRM training, RM internals, or environment design are bigger topics than the lessons covered.
- Estimator gaps — algorithm-side machinery that's named but not derived (GAE, KL estimators, entropy regularization, off-policy correction).
- Cost gaps — systems-side machinery that's referenced but not unpacked (FP8 training, MoE, KV offload, NVLink topology, FSDP nuances).
The diagram below maps each gap to where it would slot in. Hover any cell for the elevator pitch.
Signal gaps — where reward design is bigger than one lesson
1. Process Reward Models (PRMs) in depth
Lesson 17 introduces PRMs as "score every step in a chain of thought" and notes that DeepSeek abandoned them for R1 in favor of pure verifiers. What it doesn't cover:
- How PRMs are actually trained. Three families: (a) human-labeled step correctness (PRM800K, OpenAI); (b) Monte-Carlo bootstrapping — a step is "good" if completing from it succeeds > threshold % of the time; (c) self-training from a stronger model.
- The step-boundary problem. What counts as a "step" — a sentence? a line of code? a reasoning beat? PRM quality is sensitive to the segmentation.
- PRM-driven RL training (vs PRM-driven search). With per-step credit you can apply per-token advantage rather than per-trajectory advantage, reducing gradient variance by ~√L for L-step chains. This needs a GAE-style estimator (see below).
- Why PRMs lose to pure verifiers when both exist. A verifier is unhackable; a PRM is a learned proxy. Whenever the task has a verifier, the PRM's only job is variance reduction — and you can get that more cheaply from K-group baselines.
How to expand: a lesson "27.1 — Training a PRM" that walks through Monte-Carlo bootstrapping with a worked example, then plugs the PRM into a GAE-flavored advantage estimator and compares variance to GRPO with a flat reward.
2. Reward Model internals — the BT loss isn't the whole story
Lesson 15 introduces the Bradley-Terry preference loss and walks the toy. What's missing:
- RM architecture choices. Last-token regression head vs. mean-pool head vs. classifier head. Each has different reward-hacking failure modes.
- RM ensembling and uncertainty. When the RM is uncertain (variance across an ensemble), pessimistic policies (don't trust the high mean) can avoid hacks.
- KL-regularized RM training. If you regularize the RM toward staying close to the original LM, you get an RM that's harder to game — but harder to train, too.
- Reward model debiasing. Length bias (RMs prefer longer responses) is severe enough that several papers (e.g. Dubois et al. 2024) explicitly debias against length before policy updates.
- RM transferability. An RM trained on one base model often doesn't generalize to a different policy. Why this is and how Tülu 3 handled it.
How to expand: a lesson "15.5 — RM design beyond Bradley-Terry" covering ensemble heads, debiasing for length and sycophancy, and uncertainty-aware RL with the RM ensemble.
3. Curriculum, data mixing, and rejection sampling between stages now covered in 18a
This gap is now addressed in lesson 18a · Data pipelines & curation, which frames data curation as gradient-signal engineering around the p̂(1−p̂) curve and walks through the eight pipeline stages (source → license → verifier-feasibility → dedup/decon → stratify → mix → online filter → recycle). The three sub-topics that originally lived here:
- Difficulty stratification — stage 5 of 18a.
- Rejection-sampled SFT — stage 8 of 18a, traced through R1's stage 3.
- Mixing verifiable and preference data — stage 6 of 18a, with the per-source reward-whitening fix.
4. Constitutional AI / RLAIF
Not covered: a different answer to "where does reward come from" — let another LLM be the judge, guided by a written constitution (a list of principles). Used by Anthropic for Claude.
- RLAIF in one sentence: replace the human labeler in RLHF with a strong LLM judging against a written principle. Often produces preferences as cheaply as a verifier.
- The constitution. A short document of harms to avoid and behaviors to prefer; the LLM critic is asked, per principle, "did the response uphold or violate this?".
- Self-critique / self-refine loops. The policy generates, critiques itself, revises, and submits the revision — a different RL flavor where the trajectory is "draft → critique → revise".
How to expand: a lesson "15.7 — RLAIF and constitutional methods" since RLAIF is currently the dominant signal source for safety/values training at frontier labs and is conspicuously absent from the curriculum.
5. Adversarial robustness and reward hacking taxonomy
Lesson 18 has the canonical reward-hacking widget. What's not covered:
- Adversarial verifier construction. Once a model gets good, you can use the model itself to find verifier bugs and patch them — a Goodhart's-law arms race.
- Held-out verifier vs. training verifier. Treating the verifier itself as the thing you might overfit, and holding back a stricter version for eval.
- Reward shaping vs. reward hacking. Intentional shaping (format reward in R1) sits one step from accidental hacking; how to tell the difference.
6. Sandbox and safety design for code-execution rewards
Lesson 18 mentions Firecracker/gVisor sandboxes in passing. The full topic — seccomp profiles, network isolation, filesystem ephemerality, throughput-aware pooling, deterministic execution for replay — is its own discipline. Worth a "verifier infrastructure" lesson when the lesson series grows.
Estimator gaps — algorithm machinery named but not derived
7. Generalized Advantage Estimation (GAE)
Mentioned in lesson 10 as "λ-blended returns", used in original PPO. Not derived. The pitch:
GAE blends a low-variance, high-bias estimator (TD(0): δt alone) with a high-variance, unbiased estimator (Monte-Carlo: Σ γk rt+k) using λ as the knob. λ=0 picks TD(0); λ=1 picks Monte-Carlo. In practice λ ∈ [0.95, 0.99].
Why missing matters: the curriculum's algorithms (REINFORCE → Dr.GRPO) all assume terminal rewards and no value function — so GAE doesn't apply (there's no per-step δ to blend). The moment you have per-step rewards (PRM, agentic with per-turn checks) or a learned value head, GAE becomes the right advantage estimator. How to expand: a lesson "10.5 — GAE" derived as bias-variance interpolation between TD and Monte-Carlo, with a widget that varies λ on a sparse-reward toy and shows the variance drop.
8. KL estimators — k1, k2, k3 (Schulman's note)
The framework code uses the k3 estimator. Using the consistent definition r = log π_θ − log π_ref (so all three estimate KL(π_θ ‖ π_ref) when sampled from π_θ):
- k1 = r. Unbiased per sample, but signed — individual-token values can be negative, so the variance is large.
- k2 = ½·r². Always positive, biased low. Sometimes used for symmetric KL.
- k3 = e−r + r − 1. Always positive, unbiased in expectation. This is the production choice.
The framework code writes this as exp(log π_ref − log π_θ) − (log π_ref − log π_θ) − 1 — the same expression with −r substituted. Lesson 03 uses that form; the form here is algebraically identical.
How to expand: a small "lesson 03.5 — KL estimators" with a widget that draws all three on the same plot for a sample distribution shift, showing why k3 is the production winner.
9. Entropy regularization (vs. KL anchor)
The curriculum uses KL-to-reference as the only regularizer. Classical PPO often adds an entropy bonus:
KL keeps the policy near a fixed anchor; entropy bonus keeps the policy from collapsing onto a delta. They overlap but aren't the same. DAPO's "clip-higher" (lesson 13) is a back-door entropy preservation — explicit entropy bonus is the front-door version. How to expand: a "lesson 13.5 — entropy and exploration" with a widget showing entropy collapse under symmetric clip and three different mitigations (entropy bonus, clip-higher, KL-only).
10. Off-policy correction and replay buffers
The framework is purely on-policy: rollouts are used once and discarded. Async pipelines (SLIME) trade strict on-policy-ness for throughput, and you'd want some form of off-policy correction (importance reweighting, V-trace) to handle it cleanly. None of this is taught.
How to expand: a lesson "19.5 — Off-policy corrections" introducing the V-trace clipped importance ratio (from IMPALA), and showing a widget where you tune the "staleness" of rollouts and watch convergence with/without correction.
11. Reward scaling, whitening, and normalization
Dr.GRPO (lesson 14) drops the /std normalization and notes the consequence: reward magnitudes now matter. The general topic — how to keep gradients on a stable scale across very different reward sources — isn't unpacked.
- Reward whitening. Subtract running mean, divide by running std (à la batch norm) across rollouts.
- Per-source scaling. When mixing verifier (∈{0,1}) and RM (∈ℝ) rewards in one batch, scale each source so their advantages have comparable magnitudes.
- Clipping reward outliers. A single rollout with 10× the normal reward (numerical bug, mislabeled gold answer) can dominate a batch's gradient.
Cost gaps — systems machinery referenced but not unpacked
12. FP8 training and inference
Lesson 22 mentions bf16 throughout. FP8 is the next step: half the memory and bandwidth of bf16. Two flavors:
- FP8 inference: weights quantized to FP8 (E4M3 for forward) with per-tensor scales. Rollout engines (vLLM, TensorRT-LLM) support it.
- FP8 training: forward in FP8, master weights in BF16/FP32, careful loss-scaling. NVIDIA Transformer Engine. Effective in 2025+ for ≥70B training.
RL twist: if rollout is FP8 and trainer is BF16, the log-prob match problem (lesson 25) gets harder. Strict recompute on the trainer side is the safe path.
13. Mixture-of-Experts (MoE) under RL
Frontier models are increasingly sparse MoE (DeepSeek-V3, Mixtral, Qwen-MoE, Llama-4). The curriculum doesn't cover what MoE adds to RL post-training:
- Router weights. A small linear layer that routes each token to top-k experts. The router is also trained — and policy-gradient interacts with router load-balance loss in unintuitive ways.
- All-to-all collectives. MoE forward needs an all-to-all to send tokens to their expert's GPU. On the rollout side this is a new bottleneck; vLLM and SGLang have specialized MoE kernels.
- Async / engine drift. If rollout and trainer have slightly different router weights (because of disaggregated sync timing), tokens route to different experts → log-prob mismatch in a new flavor.
14. KV cache offload / CPU spill
When KV memory is tight, swap LRU pages to host (CPU pinned) memory and bring them back when scheduled. Trade-off: PCIe is ~50× slower than HBM. Used in production for very long-context agentic tasks.
15. NCCL / NVLink / InfiniBand topology
Lesson 25 mentions weight sync is bandwidth-bound. The full topic — how much bandwidth each fabric provides, when an all-gather is faster than a broadcast, how rail-aware reduce-scatter works — is its own subdiscipline (mostly covered in gpu_kernel_serving/, but not from an RL angle).
16. Checkpointing and recovery
The framework has no save/load story. Real systems checkpoint:
- FSDP-sharded weights (consolidated or sharded files).
- Optimizer state (momentum, variance — 2× param size).
- Rollout engine snapshot + the "weight generation" tag for async pipelines.
- The replay/rollout queue (for resumable runs).
Recovery from preemption is non-trivial — you can lose hours of rollouts if checkpointing is too sparse. A "lesson 28.5 — checkpointing" would be useful when scaling beyond toy runs.
17. Multi-LoRA serving and weight versioning
Production RL often wants to serve the in-training policy alongside a stable baseline (for A/B), or to maintain multiple specialized policies that share a base. LoRA adapters make this cheap — but the rollout engine has to schedule them, and the log-prob match story now has to hold per-adapter.
Interactive · pick a gap, see where it slots in
If you had to add just one
If the curriculum can absorb only a single new lesson, the highest-leverage candidate is GAE. Three reasons: (a) it unlocks per-step rewards (PRMs, agentic environments), which the curriculum is otherwise locked out of; (b) it's the bridge from the classical control-RL literature to LLM-RL, so it makes the surrounding RL textbooks readable; (c) the bias-variance interpolation with λ is one of the cleanest first-principles widgets you can build in this style. The second pick is RLAIF, since it's currently the dominant safety-RL signal source and not represented at all.