RL / lessons / 27 · missing concepts lesson 2 / 3 · part V

Missing concepts — what the curriculum doesn't yet cover

Lessons 01–25 explain the spine of modern post-training RL. They name several concepts in passing and leave them for later. This lesson is "later" — a catalog of the concepts that get mentioned, what each one means, why it matters, and how the curriculum could expand to teach it.

Organizing the gaps

The curriculum's gaps cluster naturally along the three forces from lesson 26:

The diagram below maps each gap to where it would slot in. Hover any cell for the elevator pitch.

SIGNAL gaps ESTIMATOR gaps COST gaps PRM training in depth Reward model internals Curriculum & data mixing Constitutional AI / RLAIF Adversarial verifier robustness Sandbox & safety design GAE (λ-returns) KL estimators (k1/k2/k3) Entropy regularization Off-policy & replay buffers V-trace, IMPALA-style corrections Reward scaling & whitening FP8 training & inference MoE routing under RL KV offload / CPU spill NVLink / IB / NCCL topology Checkpointing & recovery Multi-LoRA serving for RL

Signal gaps — where reward design is bigger than one lesson

1. Process Reward Models (PRMs) in depth

Lesson 17 introduces PRMs as "score every step in a chain of thought" and notes that DeepSeek abandoned them for R1 in favor of pure verifiers. What it doesn't cover:

How to expand: a lesson "27.1 — Training a PRM" that walks through Monte-Carlo bootstrapping with a worked example, then plugs the PRM into a GAE-flavored advantage estimator and compares variance to GRPO with a flat reward.

2. Reward Model internals — the BT loss isn't the whole story

Lesson 15 introduces the Bradley-Terry preference loss and walks the toy. What's missing:

How to expand: a lesson "15.5 — RM design beyond Bradley-Terry" covering ensemble heads, debiasing for length and sycophancy, and uncertainty-aware RL with the RM ensemble.

3. Curriculum, data mixing, and rejection sampling between stages now covered in 18a

This gap is now addressed in lesson 18a · Data pipelines & curation, which frames data curation as gradient-signal engineering around the p̂(1−p̂) curve and walks through the eight pipeline stages (source → license → verifier-feasibility → dedup/decon → stratify → mix → online filter → recycle). The three sub-topics that originally lived here:

4. Constitutional AI / RLAIF

Not covered: a different answer to "where does reward come from" — let another LLM be the judge, guided by a written constitution (a list of principles). Used by Anthropic for Claude.

How to expand: a lesson "15.7 — RLAIF and constitutional methods" since RLAIF is currently the dominant signal source for safety/values training at frontier labs and is conspicuously absent from the curriculum.

5. Adversarial robustness and reward hacking taxonomy

Lesson 18 has the canonical reward-hacking widget. What's not covered:

6. Sandbox and safety design for code-execution rewards

Lesson 18 mentions Firecracker/gVisor sandboxes in passing. The full topic — seccomp profiles, network isolation, filesystem ephemerality, throughput-aware pooling, deterministic execution for replay — is its own discipline. Worth a "verifier infrastructure" lesson when the lesson series grows.

Estimator gaps — algorithm machinery named but not derived

7. Generalized Advantage Estimation (GAE)

Mentioned in lesson 10 as "λ-blended returns", used in original PPO. Not derived. The pitch:

AtGAE(γ,λ) = Σk=0 (γλ)k · δt+k,    δt = rt + γ V(st+1) − V(st)

GAE blends a low-variance, high-bias estimator (TD(0): δt alone) with a high-variance, unbiased estimator (Monte-Carlo: Σ γk rt+k) using λ as the knob. λ=0 picks TD(0); λ=1 picks Monte-Carlo. In practice λ ∈ [0.95, 0.99].

Why missing matters: the curriculum's algorithms (REINFORCE → Dr.GRPO) all assume terminal rewards and no value function — so GAE doesn't apply (there's no per-step δ to blend). The moment you have per-step rewards (PRM, agentic with per-turn checks) or a learned value head, GAE becomes the right advantage estimator. How to expand: a lesson "10.5 — GAE" derived as bias-variance interpolation between TD and Monte-Carlo, with a widget that varies λ on a sparse-reward toy and shows the variance drop.

8. KL estimators — k1, k2, k3 (Schulman's note)

The framework code uses the k3 estimator. Using the consistent definition r = log π_θ − log π_ref (so all three estimate KL(π_θ ‖ π_ref) when sampled from π_θ):

The framework code writes this as exp(log π_ref − log π_θ) − (log π_ref − log π_θ) − 1 — the same expression with −r substituted. Lesson 03 uses that form; the form here is algebraically identical.

How to expand: a small "lesson 03.5 — KL estimators" with a widget that draws all three on the same plot for a sample distribution shift, showing why k3 is the production winner.

9. Entropy regularization (vs. KL anchor)

The curriculum uses KL-to-reference as the only regularizer. Classical PPO often adds an entropy bonus:

L = LPG − βKL · KL(πθ ‖ πref) + cent · H(πθ)

KL keeps the policy near a fixed anchor; entropy bonus keeps the policy from collapsing onto a delta. They overlap but aren't the same. DAPO's "clip-higher" (lesson 13) is a back-door entropy preservation — explicit entropy bonus is the front-door version. How to expand: a "lesson 13.5 — entropy and exploration" with a widget showing entropy collapse under symmetric clip and three different mitigations (entropy bonus, clip-higher, KL-only).

10. Off-policy correction and replay buffers

The framework is purely on-policy: rollouts are used once and discarded. Async pipelines (SLIME) trade strict on-policy-ness for throughput, and you'd want some form of off-policy correction (importance reweighting, V-trace) to handle it cleanly. None of this is taught.

How to expand: a lesson "19.5 — Off-policy corrections" introducing the V-trace clipped importance ratio (from IMPALA), and showing a widget where you tune the "staleness" of rollouts and watch convergence with/without correction.

11. Reward scaling, whitening, and normalization

Dr.GRPO (lesson 14) drops the /std normalization and notes the consequence: reward magnitudes now matter. The general topic — how to keep gradients on a stable scale across very different reward sources — isn't unpacked.

Cost gaps — systems machinery referenced but not unpacked

12. FP8 training and inference

Lesson 22 mentions bf16 throughout. FP8 is the next step: half the memory and bandwidth of bf16. Two flavors:

RL twist: if rollout is FP8 and trainer is BF16, the log-prob match problem (lesson 25) gets harder. Strict recompute on the trainer side is the safe path.

13. Mixture-of-Experts (MoE) under RL

Frontier models are increasingly sparse MoE (DeepSeek-V3, Mixtral, Qwen-MoE, Llama-4). The curriculum doesn't cover what MoE adds to RL post-training:

14. KV cache offload / CPU spill

When KV memory is tight, swap LRU pages to host (CPU pinned) memory and bring them back when scheduled. Trade-off: PCIe is ~50× slower than HBM. Used in production for very long-context agentic tasks.

15. NCCL / NVLink / InfiniBand topology

Lesson 25 mentions weight sync is bandwidth-bound. The full topic — how much bandwidth each fabric provides, when an all-gather is faster than a broadcast, how rail-aware reduce-scatter works — is its own subdiscipline (mostly covered in gpu_kernel_serving/, but not from an RL angle).

16. Checkpointing and recovery

The framework has no save/load story. Real systems checkpoint:

Recovery from preemption is non-trivial — you can lose hours of rollouts if checkpointing is too sparse. A "lesson 28.5 — checkpointing" would be useful when scaling beyond toy runs.

17. Multi-LoRA serving and weight versioning

Production RL often wants to serve the in-training policy alongside a stable baseline (for A/B), or to maintain multiple specialized policies that share a base. LoRA adapters make this cheap — but the rollout engine has to schedule them, and the log-prob match story now has to hold per-adapter.

Interactive · pick a gap, see where it slots in

Gap explorer
Pick any gap from the dropdown. The widget reports where in the existing curriculum it should slot in and what the prerequisite lesson is.
Slots in after
Prerequisite
Force
Effort

If you had to add just one

If the curriculum can absorb only a single new lesson, the highest-leverage candidate is GAE. Three reasons: (a) it unlocks per-step rewards (PRMs, agentic environments), which the curriculum is otherwise locked out of; (b) it's the bridge from the classical control-RL literature to LLM-RL, so it makes the surrounding RL textbooks readable; (c) the bias-variance interpolation with λ is one of the cleanest first-principles widgets you can build in this style. The second pick is RLAIF, since it's currently the dominant safety-RL signal source and not represented at all.

Takeaway
Three forces, three columns of gaps. Signal gaps want richer reward sources (PRM, RM, RLAIF, curriculum). Estimator gaps want sharper algorithmic machinery (GAE, KL flavors, entropy, off-policy correction). Cost gaps want deeper systems coverage (FP8, MoE, offload, checkpointing). The curriculum's spine is solid; these are the branches.