rl_lessons / 18a · data pipelines lesson 4½ / 9 · part III

Data pipelines & curation — engineering the gradient signal

Lesson 18 told you where the reward number comes from. This lesson asks the question one step upstream: where do the prompts come from, and which ones are worth spending rollouts on? The answer reframes data curation from "good examples to imitate" into something more precise — gradient-signal engineering.

Where this lesson sits
Lesson 18 covered the verifier — the function that emits a number for a (prompt, response) pair. A verifier is useless without prompts to feed it, and not every prompt is worth feeding it. This lesson is the data side of the reward signal: the pipeline that sources, filters, stratifies, mixes, and recycles prompts so the verifier you just built actually produces gradient. After this we leave the reward half of Part III and turn to the systems half (lessons 19–22).

Why data is different in RL than SFT

In supervised fine-tuning, data is the target. You hand the model (x, y*) pairs and the loss −log π(y* | x) tells the model to be the demonstrator. The curation question for SFT is "what behavior do we want?" — and the answer is the response field of your dataset.

In RL post-training, data is only the prompt x. The model produces its own candidate y via rollout. The verifier or reward model scores it. The loss reshapes π toward responses that scored high. Your dataset never says what a good response looks like — the verifier does.

Reframe
RL data curation is not "what should the model say?" It's "which prompts produce useful gradient signal per FLOP?" The unit of analysis is a prompt and the gradient it generates against the current policy. Everything in this lesson follows from that one shift.

This is why a curriculum that works for SFT — "give the model lots of examples of correct behavior" — actively hurts in RL. A prompt the model already solves perfectly is a prompt with zero gradient. The most "high quality" examples for SFT are often the lowest-information examples for RL.

The single most important quantity: per-prompt pass rate

Pick a prompt x. Sample K rollouts y1, …, yK from the current policy. Score each with a binary verifier so each ri ∈ {0, 1}. Let p = P(r=1 | x) be the true pass rate of the prompt under the current policy.

Now write the advantage for the rollouts of this prompt under a group-mean baseline (RLOO, Dr.GRPO, REINFORCE-with-group-baseline). Each Ai = ri − p. Expected squared advantage:

E[ A² ] = p (1 − p)

That is just the variance of a Bernoulli with mean p. It is also, up to a constant, the magnitude of the per-prompt contribution to the unnormalized policy gradient. The function p(1 − p) is the most important curve in this lesson:

0 0.5 1 per-prompt pass rate p̂ 0.25 0 signal p̂(1−p̂) too hard p̂≈0 → no signal too easy p̂≈1 → no signal informative band most gradient lives here peak at p̂=0.5

Two consequences fall out immediately and the rest of the lesson is downstream of them:

  1. At p̂ = 0 (all K rollouts fail) or p̂ = 1 (all succeed), the empirical advantage of every rollout in the group is identically zero. The gradient contribution of that prompt is structurally zero, no matter how many rollouts you took.
  2. You already paid for those rollouts. K decode passes hit the inference engine, the verifier ran, log-probs were computed under the trainer and reference — and the gradient is zero. That is not "small signal." It is wasted FLOPs.
Caveat — which estimators is this clean for?
The p(1−p) identity is exact for estimators that center on the group mean and don't divide by group std: RLOO, Dr.GRPO, and REINFORCE-with-group-baseline (lessons 12, 14, 09). Vanilla GRPO (lesson 11) divides each advantage by the group's standard deviation, which normalizes magnitudes within a non-degenerate group — so for GRPO specifically, p(1−p) governs the probability the group survives at all (the std is non-zero), not the gradient magnitude once it does. PPO with a learned critic (lesson 10) substitutes a different baseline entirely. For non-binary or shaped rewards, replace p(1−p) with Var(R | x) — same shape, same conclusion. The curve below is the binary case because that is what RLVR ships in production.
Why this matters more than it sounds
Lesson 28 attributes 60–80% of RL wall-clock to rollout. If a third of your prompts are degenerate (all-pass or all-fail under the current policy), you've spent ~25% of total training time generating zero-gradient tokens. That is the cost a data pipeline exists to recover.

The "edge of capability" is non-stationary

If pass rate is the curation target, the second crucial fact is that pass rate is a function of the current policy. Three points on the same prompt during a single training run:

The informative band slides over the course of training. A static prompt pool — even one perfectly difficulty-balanced for the base model — will decay into mostly-degenerate as the policy improves. Curation is not a one-shot dataset construction step; it is a control loop that runs alongside training.

This is the root cause of three otherwise puzzling observations in the literature:

The pipeline, linearized

With the gradient-signal frame in hand, the rest of the pipeline writes itself. Eight stages, each one a filter that improves expected gradient per FLOP at some bookkeeping cost.

1. Source human · synth · public 2. License + safety drop unshippable 3. Verifier-feasible gold answer · sandbox 4. Dedup + decon MinHash · n-gram 5. Stratify probe p̂ on base 6. Mix domain · reward type 7. Online filter DAPO dynamic sampling 8. Recycle RS-SFT · replay replay · self-distilled prompts re-enter the pipeline offline (stages 1–6) built once per checkpoint era online (stages 7–8) runs alongside every step each stage tightens the distribution toward p(x) ≈ 0.5 under the current policy

Stage 1 — Source

Where prompts originate. Four kinds, in rough order of cost-per-prompt and decreasing-quality-variance:

The source stage's job is not "pick the best one" but state the distribution explicitly. Every later stage operates on this distribution; if you can't write it down, you'll struggle to debug why training stalled three weeks in.

Stage 2 — License + safety filter

Cheap, mechanical, easy to skip and then expensive to redo. Drop anything that violates license, leaks PII, or contains content you don't want the model to optimize toward. Skipping this stage typically blocks a release for a legal review pass on the entire training pool — every prompt has to be re-cleared, retroactively.

Stage 3 — Verifier-feasibility

The verifier from lesson 18 needs something to compare against. For math: a gold answer that the verifier's normalization can canonicalize. For code: a unit-test suite that distinguishes correct from incorrect solutions on at least the held-out cases. For tool-use: an end-state predicate. A prompt without a robust check is unusable for RLVR — it'll get rewarded for the wrong reasons.

Concretely, a healthy stage-3 check, for each prompt, runs the verifier against (a) the gold answer (should return 1), (b) a hand-picked incorrect distractor (should return 0), and (c) a paraphrase of the gold answer (should usually return 1). Prompts where the verifier fails any of these are verifier-incompatible regardless of how good the prompt itself is.

Stage 4 — Dedup + decontamination

Two distinct jobs that share infrastructure:

A common silent failure
Model-synthesized prompts (stage 1, type 3) regenerated from public benchmarks land in the training pool with surface paraphrasing that defeats n-gram dedup but preserves the answer. The model "solves" eval problems it has memorized. Test this directly: at the end of training, compare per-eval-prompt pass rate to the pretrained base. If a benchmark went from 5% to 95% but training never touched the topic, that's not learning, it's leakage.

Stage 5 — Difficulty stratification

The stage that makes the most direct use of the pass-rate frame. For each surviving prompt, sample K=8 rollouts from your current checkpoint (not the base — the one you're about to train from). Bucket by :

Two non-obvious points. First, the buckets aren't fixed — they're calibrated to the current checkpoint's capability. When you re-stratify after 5k steps, prompts move buckets. Second, the "too hard" set is not garbage: it's the curriculum for the next stratification. Hold it; you'll need it.

Stage 6 — Mix

Two axes of mixing, and they interact:

The reward-source mix trap
Naive mix: concatenate verifier-rated and RM-rated prompts in one batch, normalize advantages globally. The RM's mean reward drifts up over training (the RM doesn't know the policy is improving); the verifier's mean reward is anchored. After enough steps, the verifier-rated half has negative-mean advantage and is fighting the RM-rated half. Three common fixes: (a) per-source reward whitening — compute mean and std within each source and normalize before mixing; (b) cap the RM contribution by clipping its reward range; (c) run per-source batches (Tülu 3's RLVR phase is verifier-only by step). All three appear in the literature; the right choice depends on how stationary your RM is.

Stage 7 — Online filter (dynamic sampling)

This is DAPO's contribution as a data design pattern. At every training step, after rollouts come back, drop any group with p̂ ∈ {0, 1} before the loss is computed. The trade is: extra rollouts on freshly-sampled prompts to backfill the batch, in exchange for never wasting a gradient step on degenerate groups.

The relationship between stages 5 and 7 is the relationship between "schedule" and "interrupt." Stage 5 sets up the prompt pool with offline difficulty estimates; stage 7 catches the cases where those estimates are stale (the checkpoint has moved since stratification ran) or unlucky (low-K noise put a 0.4 prompt in the 0.0 bucket by chance).

Stage 8 — Recycle

Two distinct mechanisms, both essential at scale:

Interactive: how curation moves your effective gradient

The diagram below simulates a prompt pool of 1000 prompts at three difficulty distributions, and a policy whose strength you control. The green curve is the signal density p̂(1−p̂); the blue histogram is the prompt distribution after the curation aggressiveness you set. Watch the KPI on the right — that is the number a data pipeline exists to move.

Gradient-signal density simulator
Slide policy strength right to simulate training progress (it shifts the prompt pool's effective pass-rate to the right). Slide curation up to filter aggressively toward p̂ ≈ 0.5. Watch the wasted-compute and step-efficiency KPIs.
prompts kept
1000
expected signal
% compute wasted
signal per FLOP
What to try. The "% wasted" KPI is the share of K-rollout groups that came back all-pass or all-fail — the DAPO degenerate-group condition. (1) Curation = 0, policy = 20: an untuned pool against an early policy wastes 40%+ to the all-fail tail. (2) Curation = 0, policy = 80: the same untuned pool now wastes 40%+ to the all-pass tail — same problem, opposite side, and exactly what happens to a static pool as training proceeds. (3) Curation = 80, policy = 80: re-stratification recenters the band and waste drops well below 15%. This is the whole reason stage 5 has to be re-run periodically — not once at the start.

The four data-driven failure modes

The widget makes one point about compute waste. The pipeline framing makes a second point: every common failure of an RL run that looks like "the algorithm is broken" turns out, on inspection, to be a data design choice that the algorithm honestly reflected. The taxonomy below maps the symptom you see on the dashboard to the pipeline stage that should have caught it — which is also the stage you should instrument first when you see the symptom.

SymptomData root causeThe stage that should have caught it
Reward signal flatlines after N steps Prompt pool fully solved; p̂ → 1 for almost everything Stage 5 re-stratify; stage 8 inject harder pool
Responses become absurdly long RM-rated prompts dominate; RM has length bias Stage 6 per-source whitening; stage 6 cap mix ratio
Mode collapse to a single phrasing RS-SFT recycle ran on a too-narrow keeper set Stage 8 RS-SFT must mix in non-keepers or general SFT
Held-out gap stays positive but eval climbs to ceiling Contamination: eval prompt is in training pool Stage 4 decon; n-gram + embedding match against every reported eval
Policy "solves" the verifier in unexpected ways Verifier-feasibility check passed prompts the verifier can be gamed on Stage 3 distractor + paraphrase test
Training stalls and per-step variance spikes 40%+ degenerate groups; mostly all-pass or all-fail rollouts Stage 7 dynamic sampling; revisit stage 5

Reading R1 as a sequence of data choices

Lesson 23 walks through R1 as a sequence of algorithm choices. With the pipeline frame in hand, the same recipe re-reads as a sequence of data choices. The four stages correspond to the four roles a data pipeline plays.

1 · cold SFT ~ few thousand curated CoT stage 1 (human source) seed reasoning prior 2 · GRPO (verifier) math + code corpus + online filter stages 3,5,7 verifier · stratify · DAPO 3 · RS-SFT ~600k reasoning keepers + ~200k non-reasoning SFT stage 8 (recycle) re-center capability 4 · GRPO (mixed) verifier + RM per-source whitening stage 6 (mix) general assistant informative band slides over training: post-stage 1 peak p̂ ≈ 0.15 post-stage 2 peak p̂ ≈ 0.40 post-stage 3 re-centered ≈ 0.30 post-stage 4 peak p̂ ≈ 0.55

Read down the rows: each stage shifts the model's capability, and each data choice moves the prompt pool to keep the informative band centered. Stage 3's purpose, in this frame, is not "more supervised data" — it is distribution shift in the prompt pool. The model can solve too many of the original prompts; rather than spend training budget on those zeros, the recipe re-centers via self-distillation and re-enters RL with the band intact.

A diagnostic checklist for your own pipeline

Per stage, one metric you should be able to read off the dashboard at any time. If you can't, that's the next thing to instrument.

StageHealthy metricAlarm threshold
1 · SourcePrompt-count by source, written down"It's a mix" with no numbers
2 · License + safety% rejected, sampled hand-audit0% rejected (filter isn't running)
3 · Verifier-feasibilityDistractor + paraphrase pass rates per prompt> 5% of prompts fail either
4 · Dedup + deconn-gram overlap to every reported evalany non-trivial match
5 · StratifyHistogram of p̂ on current checkpoint> 40% mass outside [0.05, 0.95]
6 · MixPer-source reward mean & std, per-stepRM mean drifting > 0.1 / 1k steps
7 · Online filterDegenerate-group rate after dynamic sampling> 10% still degenerate
8 · RecycleDiversity (n-gram entropy) of RS-SFT keeperscollapse: top-5 templates > 50% mass

What's not in this lesson and where to look

The pipeline frame keeps the lesson finite by collapsing several real and important sub-topics into single sentences. The ones you will hit in production, with pointers:

What this lesson does contribute: a single axis — the signal-density curve p(1−p) — onto which every data choice projects. Lesson 03's KL anchor and lesson 18's verifier define the reward signal; this lesson defines the set of inputs that signal is being computed on. Lesson 13's DAPO is stage 7 of an eight-stage pipeline whose offline stages do most of the work before any rollout runs. Lesson 23's recipes (R1, Tülu 3, InstructGPT) all make implicit data-pipeline choices; here those choices are explicit and named. Lesson 27 listed "difficulty stratification" and "mixing verifiable + preference" as missing — stages 5 and 6 are those concepts, recast as gradient-signal engineering.

Takeaway
In RL post-training, data is not a target — it is a gradient-signal substrate. The right curation question is not "what should the model say?" but "where is p(1−p) non-zero under the current policy, and what is the cheapest pipeline that keeps the prompt distribution there?" Eight stages, one feedback loop, periodically re-run as the policy moves.