Data pipelines & curation — engineering the gradient signal
Lesson 18 told you where the reward number comes from. This lesson asks the question one step upstream: where do the prompts come from, and which ones are worth spending rollouts on? The answer reframes data curation from "good examples to imitate" into something more precise — gradient-signal engineering.
Why data is different in RL than SFT
In supervised fine-tuning, data is the target. You hand the model (x, y*) pairs and the loss −log π(y* | x) tells the model to be the demonstrator. The curation question for SFT is "what behavior do we want?" — and the answer is the response field of your dataset.
In RL post-training, data is only the prompt x. The model produces its own candidate y via rollout. The verifier or reward model scores it. The loss reshapes π toward responses that scored high. Your dataset never says what a good response looks like — the verifier does.
This is why a curriculum that works for SFT — "give the model lots of examples of correct behavior" — actively hurts in RL. A prompt the model already solves perfectly is a prompt with zero gradient. The most "high quality" examples for SFT are often the lowest-information examples for RL.
The single most important quantity: per-prompt pass rate
Pick a prompt x. Sample K rollouts y1, …, yK from the current policy. Score each with a binary verifier so each ri ∈ {0, 1}. Let p = P(r=1 | x) be the true pass rate of the prompt under the current policy.
Now write the advantage for the rollouts of this prompt under a group-mean baseline (RLOO, Dr.GRPO, REINFORCE-with-group-baseline). Each Ai = ri − p. Expected squared advantage:
That is just the variance of a Bernoulli with mean p. It is also, up to a constant, the magnitude of the per-prompt contribution to the unnormalized policy gradient. The function p(1 − p) is the most important curve in this lesson:
Two consequences fall out immediately and the rest of the lesson is downstream of them:
- At p̂ = 0 (all K rollouts fail) or p̂ = 1 (all succeed), the empirical advantage of every rollout in the group is identically zero. The gradient contribution of that prompt is structurally zero, no matter how many rollouts you took.
- You already paid for those rollouts. K decode passes hit the inference engine, the verifier ran, log-probs were computed under the trainer and reference — and the gradient is zero. That is not "small signal." It is wasted FLOPs.
The "edge of capability" is non-stationary
If pass rate is the curation target, the second crucial fact is that pass rate is a function of the current policy. Three points on the same prompt during a single training run:
- At step 0, the policy is the base model. A hard AIME prompt has p̂ ≈ 0.02 — degenerate, all-fail.
- At step 5,000, the policy has learned to chain steps. The same prompt now has p̂ ≈ 0.4 — squarely in the informative band.
- At step 20,000, the policy solves it routinely. p̂ ≈ 0.95 — degenerate, all-pass.
The informative band slides over the course of training. A static prompt pool — even one perfectly difficulty-balanced for the base model — will decay into mostly-degenerate as the policy improves. Curation is not a one-shot dataset construction step; it is a control loop that runs alongside training.
This is the root cause of three otherwise puzzling observations in the literature:
- R1-Zero's "aha moment" requires hard prompts. The reasoning behavior emerges only because the prompts stay just out of reach — too easy a pool and there's no pressure for chain-of-thought.
- DAPO's "dynamic sampling" exists. It is the online version of this: at training time, oversample and discard groups whose K rewards are all equal (collapses to p̂ ∈ {0, 1} under a binary verifier) before computing the loss. The lesson here is the offline complement — don't generate them in the first place.
- R1's stage 3 self-distillation works. Rejection-sampled SFT (keep only successful rollouts and SFT on them) is a way to compress what the model has learned so it can re-enter RL with the informative band re-centered on newly-hard prompts.
The pipeline, linearized
With the gradient-signal frame in hand, the rest of the pipeline writes itself. Eight stages, each one a filter that improves expected gradient per FLOP at some bookkeeping cost.
Stage 1 — Source
Where prompts originate. Four kinds, in rough order of cost-per-prompt and decreasing-quality-variance:
- Human-written. Expensive (~$1–10 per prompt), high quality, low scale (10³–10⁵). Used for cold-start SFT and high-stakes verifier targets where ambiguity is costly (InstructGPT's 13k demos, R1's "few thousand" cold-start reasoning chains).
- Public benchmarks & competitions. Free, high quality, low-to-medium scale (10⁴–10⁵). MATH, GSM8K, AIME, HumanEval, MBPP. The unspoken cost: if it's public, it might be in pretraining, which moves the contamination problem to stage 4.
- Model-synthesized. Cheap per-prompt (a single decode), unlimited scale (10⁶+), quality depends entirely on the generator and the filter that follows. R1's reasoning corpus is essentially this. The danger: a synthetic prompt distribution looks like its generator, not like the real world.
- Logged / scraped traffic. Cheap, realistic, license-ambiguous. Most production deployments lean on this for the "general assistant" half of their mix.
The source stage's job is not "pick the best one" but state the distribution explicitly. Every later stage operates on this distribution; if you can't write it down, you'll struggle to debug why training stalled three weeks in.
Stage 2 — License + safety filter
Cheap, mechanical, easy to skip and then expensive to redo. Drop anything that violates license, leaks PII, or contains content you don't want the model to optimize toward. Skipping this stage typically blocks a release for a legal review pass on the entire training pool — every prompt has to be re-cleared, retroactively.
Stage 3 — Verifier-feasibility
The verifier from lesson 18 needs something to compare against. For math: a gold answer that the verifier's normalization can canonicalize. For code: a unit-test suite that distinguishes correct from incorrect solutions on at least the held-out cases. For tool-use: an end-state predicate. A prompt without a robust check is unusable for RLVR — it'll get rewarded for the wrong reasons.
Concretely, a healthy stage-3 check, for each prompt, runs the verifier against (a) the gold answer (should return 1), (b) a hand-picked incorrect distractor (should return 0), and (c) a paraphrase of the gold answer (should usually return 1). Prompts where the verifier fails any of these are verifier-incompatible regardless of how good the prompt itself is.
Stage 4 — Dedup + decontamination
Two distinct jobs that share infrastructure:
- Dedup. Near-duplicate prompts within the training pool collapse to one effective example. MinHash with Jaccard threshold in [0.7, 0.8] on character or token n-grams catches the obvious cases (0.8 is the common pretraining default; 0.7 is more aggressive); exact-string dedup catches the embarrassing ones. Why it matters in RL specifically: a duplicated prompt gets sampled more often than it should, biasing the gradient toward whatever quirk that one prompt teaches.
- Decontamination. Eval prompts that leak into the training pool make your held-out numbers a lie. N-gram exact-match (8 to 13 grams are typical) against every eval set you'll report against, plus an embedding-based near-match sweep. The damage from a single contaminated AIME problem is "your headline number is overstated by an unknown amount."
Stage 5 — Difficulty stratification
The stage that makes the most direct use of the pass-rate frame. For each surviving prompt, sample K=8 rollouts from your current checkpoint (not the base — the one you're about to train from). Bucket by p̂:
- p̂ ∈ [0.0, 0.05] — too hard. Set aside; revisit after a few thousand steps.
- p̂ ∈ [0.05, 0.95] — informative band. The training pool.
- p̂ ∈ [0.95, 1.0] — too easy. Promote to held-out eval; or drop.
Two non-obvious points. First, the buckets aren't fixed — they're calibrated to the current checkpoint's capability. When you re-stratify after 5k steps, prompts move buckets. Second, the "too hard" set is not garbage: it's the curriculum for the next stratification. Hold it; you'll need it.
Stage 6 — Mix
Two axes of mixing, and they interact:
- Domain mix. Math vs code vs tool-use vs general instruct. R1 trained almost pure math+code in the verifier phase, then mixed in 200k general SFT to avoid catastrophic forgetting of conversational ability. Tülu 3 explicitly weights its 940k SFT mixture across ~20 subsets.
- Reward-source mix. Verifier rewards (binary, sparse, sharp) and RM rewards (continuous, smooth, drifty) in the same training batch have advantages on different scales and with different statistics. Mixing them without normalization makes the larger-scale source dominate the gradient, regardless of which one is more informative.
Stage 7 — Online filter (dynamic sampling)
This is DAPO's contribution as a data design pattern. At every training step, after rollouts come back, drop any group with p̂ ∈ {0, 1} before the loss is computed. The trade is: extra rollouts on freshly-sampled prompts to backfill the batch, in exchange for never wasting a gradient step on degenerate groups.
The relationship between stages 5 and 7 is the relationship between "schedule" and "interrupt." Stage 5 sets up the prompt pool with offline difficulty estimates; stage 7 catches the cases where those estimates are stale (the checkpoint has moved since stratification ran) or unlucky (low-K noise put a 0.4 prompt in the 0.0 bucket by chance).
Stage 8 — Recycle
Two distinct mechanisms, both essential at scale:
- Rejection-sampled SFT. Sample many rollouts from the current RL'd policy on a fresh prompt set. Keep only the ones with reward 1 (R1 reports ~600k reasoning keepers, mixed with ~200k non-reasoning SFT data). SFT on those keepers. Why it works: it compresses what the policy has learned into a curriculum the model can re-enter RL from, with a re-centered informative band. Why it can fail: keeping only successes biases the SFT toward easy prompts (mode collapse to the trivial subset).
- Replay buffers. If you're running async RL (lesson 19), rollouts produced under an older policy can still be useful if you correct for the importance ratio. The replay buffer keeps them around for one or two more steps before discarding.
Interactive: how curation moves your effective gradient
The diagram below simulates a prompt pool of 1000 prompts at three difficulty distributions, and a policy whose strength you control. The green curve is the signal density p̂(1−p̂); the blue histogram is the prompt distribution after the curation aggressiveness you set. Watch the KPI on the right — that is the number a data pipeline exists to move.
The four data-driven failure modes
The widget makes one point about compute waste. The pipeline framing makes a second point: every common failure of an RL run that looks like "the algorithm is broken" turns out, on inspection, to be a data design choice that the algorithm honestly reflected. The taxonomy below maps the symptom you see on the dashboard to the pipeline stage that should have caught it — which is also the stage you should instrument first when you see the symptom.
| Symptom | Data root cause | The stage that should have caught it |
|---|---|---|
| Reward signal flatlines after N steps | Prompt pool fully solved; p̂ → 1 for almost everything | Stage 5 re-stratify; stage 8 inject harder pool |
| Responses become absurdly long | RM-rated prompts dominate; RM has length bias | Stage 6 per-source whitening; stage 6 cap mix ratio |
| Mode collapse to a single phrasing | RS-SFT recycle ran on a too-narrow keeper set | Stage 8 RS-SFT must mix in non-keepers or general SFT |
| Held-out gap stays positive but eval climbs to ceiling | Contamination: eval prompt is in training pool | Stage 4 decon; n-gram + embedding match against every reported eval |
| Policy "solves" the verifier in unexpected ways | Verifier-feasibility check passed prompts the verifier can be gamed on | Stage 3 distractor + paraphrase test |
| Training stalls and per-step variance spikes | 40%+ degenerate groups; mostly all-pass or all-fail rollouts | Stage 7 dynamic sampling; revisit stage 5 |
Reading R1 as a sequence of data choices
Lesson 23 walks through R1 as a sequence of algorithm choices. With the pipeline frame in hand, the same recipe re-reads as a sequence of data choices. The four stages correspond to the four roles a data pipeline plays.
Read down the rows: each stage shifts the model's capability, and each data choice moves the prompt pool to keep the informative band centered. Stage 3's purpose, in this frame, is not "more supervised data" — it is distribution shift in the prompt pool. The model can solve too many of the original prompts; rather than spend training budget on those zeros, the recipe re-centers via self-distillation and re-enters RL with the band intact.
A diagnostic checklist for your own pipeline
Per stage, one metric you should be able to read off the dashboard at any time. If you can't, that's the next thing to instrument.
| Stage | Healthy metric | Alarm threshold |
|---|---|---|
| 1 · Source | Prompt-count by source, written down | "It's a mix" with no numbers |
| 2 · License + safety | % rejected, sampled hand-audit | 0% rejected (filter isn't running) |
| 3 · Verifier-feasibility | Distractor + paraphrase pass rates per prompt | > 5% of prompts fail either |
| 4 · Dedup + decon | n-gram overlap to every reported eval | any non-trivial match |
| 5 · Stratify | Histogram of p̂ on current checkpoint | > 40% mass outside [0.05, 0.95] |
| 6 · Mix | Per-source reward mean & std, per-step | RM mean drifting > 0.1 / 1k steps |
| 7 · Online filter | Degenerate-group rate after dynamic sampling | > 10% still degenerate |
| 8 · Recycle | Diversity (n-gram entropy) of RS-SFT keepers | collapse: top-5 templates > 50% mass |
What's not in this lesson and where to look
The pipeline frame keeps the lesson finite by collapsing several real and important sub-topics into single sentences. The ones you will hit in production, with pointers:
- Synthetic prompt generation. Stage 1 calls out "model-synth" in one phrase. The actual mechanics — self-instruct, evol-instruct, persona-hub, generator/solver model splits, difficulty-conditioned generation — deserve their own treatment. Sanity check: a synthetic prompt distribution looks like its generator, not like reality.
- Chat / prompt templating drift. The exact system prompt, BOS/EOS placement, and tool-call schema used at rollout must match the trainer's tokenizer state. A silent template mismatch causes log-prob disagreement (the bug surface lesson 25 names "silent"). Stage 3 should include a templating round-trip check.
- SFT-size vs RL-prompt-count ratio. Tülu 3 trains on ~940k SFT and an order-of-magnitude fewer RLVR prompts. The right ratio is empirical and recipe-specific; the lesson lets the recipes (lesson 23) carry that point.
- Length / token-budget stratification. Long prompts dominate rollout FLOPs. Stratifying by expected output length is a separate axis from difficulty stratification and equally important on bottleneck pages (lesson 28).
- Multi-turn / agentic prompt curation. When the trajectory is a multi-turn rollout with tool calls (lesson 08), the "prompt" is really an episode template plus tool-response distribution. Curating those distributions is a research problem in its own right.
- Replay-buffer importance sampling. The stage 8 sentence on replay glosses over the IS bias/variance tradeoff that async pipelines (lesson 19) introduce.
What this lesson does contribute: a single axis — the signal-density curve p(1−p) — onto which every data choice projects. Lesson 03's KL anchor and lesson 18's verifier define the reward signal; this lesson defines the set of inputs that signal is being computed on. Lesson 13's DAPO is stage 7 of an eight-stage pipeline whose offline stages do most of the work before any rollout runs. Lesson 23's recipes (R1, Tülu 3, InstructGPT) all make implicit data-pipeline choices; here those choices are explicit and named. Lesson 27 listed "difficulty stratification" and "mixing verifiable + preference" as missing — stages 5 and 6 are those concepts, recast as gradient-signal engineering.