Data pipelines & curation — engineering the gradient signal

Lesson 18 told you where the reward number comes from. This lesson asks the question one step upstream: where do the prompts come from, and which ones are worth spending rollouts on? The answer reframes data curation from "good examples to imitate" into something more precise — gradient-signal engineering.

Where this lesson sits

Lesson 18 covered the verifier — the function that emits a number for a (prompt, response) pair. A verifier is useless without prompts to feed it, and not every prompt is worth feeding it. This lesson is the data side of the reward signal: the pipeline that sources, filters, stratifies, mixes, and recycles prompts so the verifier you just built actually produces gradient. After this we leave the reward half of Part III and turn to the systems half (lessons 19–22).

Why data is different in RL than SFT

In supervised fine-tuning, data is the target. You hand the model (x, y*) pairs and the loss −log π(y* | x) tells the model to be the demonstrator. The curation question for SFT is "what behavior do we want?" — and the answer is the response field of your dataset.

In RL post-training, data is only the prompt x. The model produces its own candidate y via rollout. The verifier or reward model scores it. The loss reshapes π toward responses that scored high. Your dataset never says what a good response looks like — the verifier does.

Reframe

RL data curation is not "what should the model say?" It's "which prompts produce useful gradient signal per FLOP?" The unit of analysis is a prompt and the gradient it generates against the current policy. Everything in this lesson follows from that one shift.

This is why a curriculum that works for SFT — "give the model lots of examples of correct behavior" — actively hurts in RL. A prompt the model already solves perfectly is a prompt with zero gradient. The most "high quality" examples for SFT are often the lowest-information examples for RL.

The single most important quantity: per-prompt pass rate

Pick a prompt x. Sample K rollouts y₁, …, y_K from the current policy. Score each with a binary verifier so each r_i ∈ {0, 1}. Let p = P(r=1 | x) be the true pass rate of the prompt under the current policy.

Now write the advantage for the rollouts of this prompt under a group-mean baseline (RLOO, Dr.GRPO, REINFORCE-with-group-baseline). Each A_i = r_i − p. Expected squared advantage:

E[ A² ] = p (1 − p)

That is just the variance of a Bernoulli with mean p. It is also, up to a constant, the magnitude of the per-prompt contribution to the unnormalized policy gradient. The function p(1 − p) is the most important curve in this lesson:

Two consequences fall out immediately and the rest of the lesson is downstream of them:

At p̂ = 0 (all K rollouts fail) or p̂ = 1 (all succeed), the empirical advantage of every rollout in the group is identically zero. The gradient contribution of that prompt is structurally zero, no matter how many rollouts you took.
You already paid for those rollouts. K decode passes hit the inference engine, the verifier ran, log-probs were computed under the trainer and reference — and the gradient is zero. That is not "small signal." It is wasted FLOPs.

Caveat — which estimators is this clean for?

The p(1−p) identity is exact for estimators that center on the group mean and don't divide by group std: RLOO, Dr.GRPO, and REINFORCE-with-group-baseline (lessons 12, 14, 09). Vanilla GRPO (lesson 11) divides each advantage by the group's standard deviation, which normalizes magnitudes within a non-degenerate group — so for GRPO specifically, p(1−p) governs the probability the group survives at all (the std is non-zero), not the gradient magnitude once it does. PPO with a learned critic (lesson 10) substitutes a different baseline entirely. For non-binary or shaped rewards, replace p(1−p) with Var(R | x) — same shape, same conclusion. The curve below is the binary case because that is what RLVR ships in production.

Why this matters more than it sounds

Lesson 28 attributes 60–80% of RL wall-clock to rollout. If a third of your prompts are degenerate (all-pass or all-fail under the current policy), you've spent ~25% of total training time generating zero-gradient tokens. That is the cost a data pipeline exists to recover.

The "edge of capability" is non-stationary

If pass rate is the curation target, the second crucial fact is that pass rate is a function of the current policy. Three points on the same prompt during a single training run:

At step 0, the policy is the base model. A hard AIME prompt has p̂ ≈ 0.02 — degenerate, all-fail.
At step 5,000, the policy has learned to chain steps. The same prompt now has p̂ ≈ 0.4 — squarely in the informative band.
At step 20,000, the policy solves it routinely. p̂ ≈ 0.95 — degenerate, all-pass.

The informative band slides over the course of training. A static prompt pool — even one perfectly difficulty-balanced for the base model — will decay into mostly-degenerate as the policy improves. Curation is not a one-shot dataset construction step; it is a control loop that runs alongside training.

This is the root cause of three otherwise puzzling observations in the literature:

R1-Zero's "aha moment" requires hard prompts. The reasoning behavior emerges only because the prompts stay just out of reach — too easy a pool and there's no pressure for chain-of-thought.
DAPO's "dynamic sampling" exists. It is the online version of this: at training time, oversample and discard groups whose K rewards are all equal (collapses to p̂ ∈ {0, 1} under a binary verifier) before computing the loss. The lesson here is the offline complement — don't generate them in the first place.
R1's stage 3 self-distillation works. Rejection-sampled SFT (keep only successful rollouts and SFT on them) is a way to compress what the model has learned so it can re-enter RL with the informative band re-centered on newly-hard prompts.

The pipeline, linearized

With the gradient-signal frame in hand, the rest of the pipeline writes itself. Eight stages, each one a filter that improves expected gradient per FLOP at some bookkeeping cost.

Stage 1 — Source

Where prompts originate. Four kinds, in rough order of cost-per-prompt and decreasing-quality-variance:

Human-written. Expensive (~$1–10 per prompt), high quality, low scale (10³–10⁵). Used for cold-start SFT and high-stakes verifier targets where ambiguity is costly (InstructGPT's 13k demos, R1's "few thousand" cold-start reasoning chains).
Public benchmarks & competitions. Free, high quality, low-to-medium scale (10⁴–10⁵). MATH, GSM8K, AIME, HumanEval, MBPP. The unspoken cost: if it's public, it might be in pretraining, which moves the contamination problem to stage 4.
Model-synthesized. Cheap per-prompt (a single decode), unlimited scale (10⁶+), quality depends entirely on the generator and the filter that follows. R1's reasoning corpus is essentially this. The danger: a synthetic prompt distribution looks like its generator, not like the real world.
Logged / scraped traffic. Cheap, realistic, license-ambiguous. Most production deployments lean on this for the "general assistant" half of their mix.

The source stage's job is not "pick the best one" but state the distribution explicitly. Every later stage operates on this distribution; if you can't write it down, you'll struggle to debug why training stalled three weeks in.

Stage 2 — License + safety filter

Cheap, mechanical, easy to skip and then expensive to redo. Drop anything that violates license, leaks PII, or contains content you don't want the model to optimize toward. Skipping this stage typically blocks a release for a legal review pass on the entire training pool — every prompt has to be re-cleared, retroactively.

Stage 3 — Verifier-feasibility

The verifier from lesson 18 needs something to compare against. For math: a gold answer that the verifier's normalization can canonicalize. For code: a unit-test suite that distinguishes correct from incorrect solutions on at least the held-out cases. For tool-use: an end-state predicate. A prompt without a robust check is unusable for RLVR — it'll get rewarded for the wrong reasons.

Concretely, a healthy stage-3 check, for each prompt, runs the verifier against (a) the gold answer (should return 1), (b) a hand-picked incorrect distractor (should return 0), and (c) a paraphrase of the gold answer (should usually return 1). Prompts where the verifier fails any of these are verifier-incompatible regardless of how good the prompt itself is.

Stage 4 — Dedup + decontamination

Two distinct jobs that share infrastructure:

Dedup. Near-duplicate prompts within the training pool collapse to one effective example. MinHash with Jaccard threshold in [0.7, 0.8] on character or token n-grams catches the obvious cases (0.8 is the common pretraining default; 0.7 is more aggressive); exact-string dedup catches the embarrassing ones. Why it matters in RL specifically: a duplicated prompt gets sampled more often than it should, biasing the gradient toward whatever quirk that one prompt teaches.
Decontamination. Eval prompts that leak into the training pool make your held-out numbers a lie. N-gram exact-match (8 to 13 grams are typical) against every eval set you'll report against, plus an embedding-based near-match sweep. The damage from a single contaminated AIME problem is "your headline number is overstated by an unknown amount."

A common silent failure

Model-synthesized prompts (stage 1, type 3) regenerated from public benchmarks land in the training pool with surface paraphrasing that defeats n-gram dedup but preserves the answer. The model "solves" eval problems it has memorized. Test this directly: at the end of training, compare per-eval-prompt pass rate to the pretrained base. If a benchmark went from 5% to 95% but training never touched the topic, that's not learning, it's leakage.

Stage 5 — Difficulty stratification

The stage that makes the most direct use of the pass-rate frame. For each surviving prompt, sample K=8 rollouts from your current checkpoint (not the base — the one you're about to train from). Bucket by p̂:

p̂ ∈ [0.0, 0.05] — too hard. Set aside; revisit after a few thousand steps.
p̂ ∈ [0.05, 0.95] — informative band. The training pool.
p̂ ∈ [0.95, 1.0] — too easy. Promote to held-out eval; or drop.

Two non-obvious points. First, the buckets aren't fixed — they're calibrated to the current checkpoint's capability. When you re-stratify after 5k steps, prompts move buckets. Second, the "too hard" set is not garbage: it's the curriculum for the next stratification. Hold it; you'll need it.

Stage 6 — Mix

Two axes of mixing, and they interact:

Domain mix. Math vs code vs tool-use vs general instruct. R1 trained almost pure math+code in the verifier phase, then mixed in 200k general SFT to avoid catastrophic forgetting of conversational ability. Tülu 3 explicitly weights its 940k SFT mixture across ~20 subsets.
Reward-source mix. Verifier rewards (binary, sparse, sharp) and RM rewards (continuous, smooth, drifty) in the same training batch have advantages on different scales and with different statistics. Mixing them without normalization makes the larger-scale source dominate the gradient, regardless of which one is more informative.

The reward-source mix trap

Naive mix: concatenate verifier-rated and RM-rated prompts in one batch, normalize advantages globally. The RM's mean reward drifts up over training (the RM doesn't know the policy is improving); the verifier's mean reward is anchored. After enough steps, the verifier-rated half has negative-mean advantage and is fighting the RM-rated half. Three common fixes: (a) per-source reward whitening — compute mean and std within each source and normalize before mixing; (b) cap the RM contribution by clipping its reward range; (c) run per-source batches (Tülu 3's RLVR phase is verifier-only by step). All three appear in the literature; the right choice depends on how stationary your RM is.

Stage 7 — Online filter (dynamic sampling)

This is DAPO's contribution as a data design pattern. At every training step, after rollouts come back, drop any group with p̂ ∈ {0, 1} before the loss is computed. The trade is: extra rollouts on freshly-sampled prompts to backfill the batch, in exchange for never wasting a gradient step on degenerate groups.

The relationship between stages 5 and 7 is the relationship between "schedule" and "interrupt." Stage 5 sets up the prompt pool with offline difficulty estimates; stage 7 catches the cases where those estimates are stale (the checkpoint has moved since stratification ran) or unlucky (low-K noise put a 0.4 prompt in the 0.0 bucket by chance).

Stage 8 — Recycle

Two distinct mechanisms, both essential at scale:

Rejection-sampled SFT. Sample many rollouts from the current RL'd policy on a fresh prompt set. Keep only the ones with reward 1 (R1 reports ~600k reasoning keepers, mixed with ~200k non-reasoning SFT data). SFT on those keepers. Why it works: it compresses what the policy has learned into a curriculum the model can re-enter RL from, with a re-centered informative band. Why it can fail: keeping only successes biases the SFT toward easy prompts (mode collapse to the trivial subset).
Replay buffers. If you're running async RL (lesson 19), rollouts produced under an older policy can still be useful if you correct for the importance ratio. The replay buffer keeps them around for one or two more steps before discarding.

Interactive: how curation moves your effective gradient

The diagram below simulates a prompt pool of 1000 prompts at three difficulty distributions, and a policy whose strength you control. The green curve is the signal density p̂(1−p̂); the blue histogram is the prompt distribution after the curation aggressiveness you set. Watch the KPI on the right — that is the number a data pipeline exists to move.

Gradient-signal density simulator

Slide policy strength right to simulate training progress (it shifts the prompt pool's effective pass-rate to the right). Slide curation up to filter aggressively toward p̂ ≈ 0.5. Watch the wasted-compute and step-efficiency KPIs.

Policy strength: 20 Curation: 0 Rollouts K: 8

prompts kept

1000

expected signal

—

% compute wasted

—

signal per FLOP

—

What to try. The "% wasted" KPI is the share of K-rollout groups that came back all-pass or all-fail — the DAPO degenerate-group condition. (1) Curation = 0, policy = 20: an untuned pool against an early policy wastes 40%+ to the all-fail tail. (2) Curation = 0, policy = 80: the same untuned pool now wastes 40%+ to the all-pass tail — same problem, opposite side, and exactly what happens to a static pool as training proceeds. (3) Curation = 80, policy = 80: re-stratification recenters the band and waste drops well below 15%. This is the whole reason stage 5 has to be re-run periodically — not once at the start.

The four data-driven failure modes

The widget makes one point about compute waste. The pipeline framing makes a second point: every common failure of an RL run that looks like "the algorithm is broken" turns out, on inspection, to be a data design choice that the algorithm honestly reflected. The taxonomy below maps the symptom you see on the dashboard to the pipeline stage that should have caught it — which is also the stage you should instrument first when you see the symptom.

Symptom	Data root cause	The stage that should have caught it
Reward signal flatlines after N steps	Prompt pool fully solved; p̂ → 1 for almost everything	Stage 5 re-stratify; stage 8 inject harder pool
Responses become absurdly long	RM-rated prompts dominate; RM has length bias	Stage 6 per-source whitening; stage 6 cap mix ratio
Mode collapse to a single phrasing	RS-SFT recycle ran on a too-narrow keeper set	Stage 8 RS-SFT must mix in non-keepers or general SFT
Held-out gap stays positive but eval climbs to ceiling	Contamination: eval prompt is in training pool	Stage 4 decon; n-gram + embedding match against every reported eval
Policy "solves" the verifier in unexpected ways	Verifier-feasibility check passed prompts the verifier can be gamed on	Stage 3 distractor + paraphrase test
Training stalls and per-step variance spikes	40%+ degenerate groups; mostly all-pass or all-fail rollouts	Stage 7 dynamic sampling; revisit stage 5

Reading R1 as a sequence of data choices

Lesson 23 walks through R1 as a sequence of algorithm choices. With the pipeline frame in hand, the same recipe re-reads as a sequence of data choices. The four stages correspond to the four roles a data pipeline plays.

Read down the rows: each stage shifts the model's capability, and each data choice moves the prompt pool to keep the informative band centered. Stage 3's purpose, in this frame, is not "more supervised data" — it is distribution shift in the prompt pool. The model can solve too many of the original prompts; rather than spend training budget on those zeros, the recipe re-centers via self-distillation and re-enters RL with the band intact.

A diagnostic checklist for your own pipeline

Per stage, one metric you should be able to read off the dashboard at any time. If you can't, that's the next thing to instrument.

Stage	Healthy metric	Alarm threshold
1 · Source	Prompt-count by source, written down	"It's a mix" with no numbers
2 · License + safety	% rejected, sampled hand-audit	0% rejected (filter isn't running)
3 · Verifier-feasibility	Distractor + paraphrase pass rates per prompt	> 5% of prompts fail either
4 · Dedup + decon	n-gram overlap to every reported eval	any non-trivial match
5 · Stratify	Histogram of p̂ on current checkpoint	> 40% mass outside [0.05, 0.95]
6 · Mix	Per-source reward mean & std, per-step	RM mean drifting > 0.1 / 1k steps
7 · Online filter	Degenerate-group rate after dynamic sampling	> 10% still degenerate
8 · Recycle	Diversity (n-gram entropy) of RS-SFT keepers	collapse: top-5 templates > 50% mass

What's not in this lesson and where to look

The pipeline frame keeps the lesson finite by collapsing several real and important sub-topics into single sentences. The ones you will hit in production, with pointers:

Synthetic prompt generation. Stage 1 calls out "model-synth" in one phrase. The actual mechanics — self-instruct, evol-instruct, persona-hub, generator/solver model splits, difficulty-conditioned generation — deserve their own treatment. Sanity check: a synthetic prompt distribution looks like its generator, not like reality.
Chat / prompt templating drift. The exact system prompt, BOS/EOS placement, and tool-call schema used at rollout must match the trainer's tokenizer state. A silent template mismatch causes log-prob disagreement (the bug surface lesson 25 names "silent"). Stage 3 should include a templating round-trip check.
SFT-size vs RL-prompt-count ratio. Tülu 3 trains on ~940k SFT and an order-of-magnitude fewer RLVR prompts. The right ratio is empirical and recipe-specific; the lesson lets the recipes (lesson 23) carry that point.
Length / token-budget stratification. Long prompts dominate rollout FLOPs. Stratifying by expected output length is a separate axis from difficulty stratification and equally important on bottleneck pages (lesson 28).
Multi-turn / agentic prompt curation. When the trajectory is a multi-turn rollout with tool calls (lesson 08), the "prompt" is really an episode template plus tool-response distribution. Curating those distributions is a research problem in its own right.
Replay-buffer importance sampling. The stage 8 sentence on replay glosses over the IS bias/variance tradeoff that async pipelines (lesson 19) introduce.

What this lesson does contribute: a single axis — the signal-density curve p(1−p) — onto which every data choice projects. Lesson 03's KL anchor and lesson 18's verifier define the reward signal; this lesson defines the set of inputs that signal is being computed on. Lesson 13's DAPO is stage 7 of an eight-stage pipeline whose offline stages do most of the work before any rollout runs. Lesson 23's recipes (R1, Tülu 3, InstructGPT) all make implicit data-pipeline choices; here those choices are explicit and named. Lesson 27 listed "difficulty stratification" and "mixing verifiable + preference" as missing — stages 5 and 6 are those concepts, recast as gradient-signal engineering.

Takeaway

In RL post-training, data is not a target — it is a gradient-signal substrate. The right curation question is not "what should the model say?" but "where is p(1−p) non-zero under the current policy, and what is the cheapest pipeline that keeps the prompt distribution there?" Eight stages, one feedback loop, periodically re-run as the policy moves.