Quality & validation

Quality work happens in the silver layer — after dedup (lesson 06) and before tokenization (lesson 07). It splits into two distinct jobs: structural validation (does the data conform to a contract?) and content filtering (is this record worth training on?).

Where we are

Lesson 06 deduplicated the corpus. Lesson 07 tokenized and packed the survivors. This lesson is the gate that sits between them: it rejects records that are malformed or low-quality before they consume tokenization compute, GPU scoring time, or — worst — training tokens.

Two distinct kinds of checks

Pipeline quality work is often lumped together under "data cleaning," but that conflates two very different activities with different failure modes and different remedies.

	Structural validation	Content quality filtering
Question asked	Does this record conform to the schema / contract?	Is this record worth training on?
Answer type	Binary — valid or not	Scored — better or worse
Failure action	Hard gate: fail the build, quarantine the batch	Soft gate: tune the threshold, route for review
When to run	At ingest / schema change	After struct. validation, cheapest checks first
Tools	Great Expectations, Pandera, Pydantic	heuristics, classifiers, reward models

1 · Structural validation — data contracts

A data contract is a formal assertion about the shape of your data: required fields are present, types match, values fall within allowed ranges, enum columns only take known values, referential integrity holds. These are invariants — if they're violated, the downstream transform is operating on garbage and will produce garbage, silently.

Contract checks are cheap (column statistics, row counts, type checks) and should run as a hard gate: if the check fails, you halt the pipeline, quarantine the offending batch, and page the engineer. You do not let malformed records flow forward and hope the model learns around them.

  Contract assertions (examples)
  ─────────────────────────────
  required:   prompt ≠ null, response ≠ null, source ∈ {annotation, synth, distill}
  types:      token_count :: int, reward_score :: float ∈ [−1, 1]
  ranges:     len(prompt) ≥ 10, len(response) ≥ 20
  referential: split_id must exist in splits_manifest table
  statistical: null_rate(reward_score) < 2%, row_count within 5% of yesterday

  Failure → quarantine batch → alert → pipeline halts.  No silent pass-through.

Libraries express contracts as code so they can be version-controlled alongside the pipeline:

Great Expectations — suite of "expectations" that run as a validation step in your DAG; produces an HTML data-quality report.
Pandera — schema decorators on DataFrames; raises at the function boundary that produced bad data.
Pydantic — row-level validation; parse each record into a typed model and reject at ingest.

Hard gate means hard

A contract violation that "only" affects 0.3% of rows is still a build failure. Small violations are often the leading edge of an upstream schema change, an annotation tool bug, or a synthetic-generation prompt drift. Treat them as such.

2 · Content quality filtering — deciding what to keep

Structural validation tells you a record is well-formed. Content quality filtering tells you whether it's actually useful for training. This is a policy decision expressed as a scored threshold: records above the cut stay; records below are dropped or sent for review.

Heuristic filters — fast, cheap, imperfect

Length bounds — too short (a one-word response) or too long (runaway generation) signals low utility. Tuned per regime: SFT responses have different expected lengths than RL rollouts.
Language identification — keep only the target language(s) unless you're building a multilingual set. FastText langid at <1 ms/record.
Repetition / n-gram ratio — a response that repeats the same 5-gram 40 times is a degenerate generation. Measure: fraction of tokens that appear in the top-k n-gram types.
Perplexity (LM score) — score against a reference language model; extremely low perplexity (memorized boilerplate) or extremely high perplexity (garbled text) both warrant filtering. Used in FineWeb and Dolma.

Model-based scorers — slow, expensive, accurate

Reward model score — a trained RM predicts human preference for the response. High RM score = response a human would prefer. Used as a quality cutoff or to rank examples for curriculum ordering.
Toxicity / safety classifier — a fast discriminator (e.g. a fine-tuned BERT) flags harmful content. Runs at ~50 ms/record on CPU vs seconds for a large RM on GPU.
LLM-as-judge — a frontier model rates the example on axes like helpfulness, factuality, instruction-following. Expensive (API cost or GPU time) but calibrated on nuanced quality signals. Reserve for the final cut or for preference-pair creation.

Ordering: predicate pushdown for quality stages

The same principle from lesson 04 (predicate pushdown: filter early, read less) and lesson 05 (filter before the shuffle) applies inside the quality pipeline: run cheap filters first so expensive scorers never see rows that would have been dropped anyway.

  Ordered filter pipeline (cheapest → most expensive)
  ────────────────────────────────────────────────────
  1. Schema / contract check          (~0.001 ms/row · CPU)
  2. Length bounds                    (~0.01  ms/row · CPU)
  3. Language ID                      (~0.5   ms/row · CPU)
  4. Repetition / n-gram ratio        (~1     ms/row · CPU)
  5. Perplexity filter                (~5     ms/row · CPU)
  6. Toxicity classifier              (~50    ms/row · CPU)
  7. Reward-model score               (~200   ms/row · GPU) ← most expensive, runs last

  Each stage drops records; later stages process a smaller set.
  Running stage 7 first would cost 200× more for the same output.

The cost implication is significant. If your RM scores 1 M records at 200 ms each, that is 55 GPU-hours. If a 1 ms length filter drops 30% first, you save 16 GPU-hours before touching a GPU at all.

Hard vs soft gates

Structural violations are always a hard gate. Content quality thresholds are a soft gate: you set a cutoff, but where you set it is a tunable policy choice. A batch where 40% of records are below your RM threshold might be worth reviewing rather than dropping — especially if the batch represents a rare domain. Route low-scoring records to a review queue; don't silently discard information you may later wish you had.

Over-filtering: the other failure mode

It is possible to filter your dataset down to almost nothing — or, subtler, to introduce a distribution bias by aggressively removing low-RM-score records. If your RM scores are calibrated on a different distribution than your target task, you may be removing exactly the examples that would transfer well. Monitor yield at each stage and set floor thresholds on what fraction must survive.

Quality vs quantity — the LIMA intuition

For post-training, a smaller high-quality set almost always beats a larger noisy one. The LIMA paper demonstrated that 1,000 carefully curated examples can produce competitive SFT behavior — the training signal is in the quality of the gradient, not the count of the steps. This holds across regimes: for preference data, a clean 50 k set of genuine human judgments outperforms a noisy 5 M set of synthetic auto-labeled pairs.

The practical consequence: invest engineering time in raising quality thresholds, not in ingesting more raw data. More raw data that passes poor quality gates just adds noise to the gradient. See the companion RL lesson 18a on signal-density curation for the gradient-signal framing of the same intuition.

Interactive · quality funnel

A corpus of 1,000,000 records flows through ordered filter stages. Adjust the threshold sliders to see survivors at each stage, final yield, and the compute cost of model-based scoring. Notice how tightening the reward-model threshold raises quality but cuts yield — and how the per-record cost of the RM means it dominates total GPU spend.

Quality funnel simulator

Records flow left-to-right through cheapest-first filters. Sliders set what fraction survives each stage. GPU cost is charged only to records that reach the RM scorer — earlier drops are free.

min-length pass rate: 85% language-ID pass rate: 78% repetition filter pass rate: 90% toxicity filter pass rate: 95% RM-score keep percentile: top 60%

Input

1,000,000

Final yield

—

Yield %

—

GPU-hours (RM)

—

GPU cost saved

—

RM cost @ $3/h

—

Takeaway

What to carry to lesson 09

Quality work is two separate jobs: structural contracts (hard gates, fail the build) and content filtering (soft gates, tune the threshold). Run cheap heuristics first so expensive GPU scorers only see records that will survive. For post-training, quality beats quantity — a tighter RM cutoff is almost always the right call until yield drops to a level that starves the trainer. Lesson 09 (orchestration) is where you'll see how each of these stages becomes a DAG task with dependencies, retries, and data-versioned outputs — so a failed quality gate can be re-run in isolation without re-processing the whole corpus.