Quality & validation
Quality work happens in the silver layer — after dedup (lesson 06) and before tokenization (lesson 07). It splits into two distinct jobs: structural validation (does the data conform to a contract?) and content filtering (is this record worth training on?).
Two distinct kinds of checks
Pipeline quality work is often lumped together under "data cleaning," but that conflates two very different activities with different failure modes and different remedies.
| Structural validation | Content quality filtering | |
|---|---|---|
| Question asked | Does this record conform to the schema / contract? | Is this record worth training on? |
| Answer type | Binary — valid or not | Scored — better or worse |
| Failure action | Hard gate: fail the build, quarantine the batch | Soft gate: tune the threshold, route for review |
| When to run | At ingest / schema change | After struct. validation, cheapest checks first |
| Tools | Great Expectations, Pandera, Pydantic | heuristics, classifiers, reward models |
1 · Structural validation — data contracts
A data contract is a formal assertion about the shape of your data: required fields are present, types match, values fall within allowed ranges, enum columns only take known values, referential integrity holds. These are invariants — if they're violated, the downstream transform is operating on garbage and will produce garbage, silently.
Contract checks are cheap (column statistics, row counts, type checks) and should run as a hard gate: if the check fails, you halt the pipeline, quarantine the offending batch, and page the engineer. You do not let malformed records flow forward and hope the model learns around them.
Contract assertions (examples)
─────────────────────────────
required: prompt ≠ null, response ≠ null, source ∈ {annotation, synth, distill}
types: token_count :: int, reward_score :: float ∈ [−1, 1]
ranges: len(prompt) ≥ 10, len(response) ≥ 20
referential: split_id must exist in splits_manifest table
statistical: null_rate(reward_score) < 2%, row_count within 5% of yesterday
Failure → quarantine batch → alert → pipeline halts. No silent pass-through.
Libraries express contracts as code so they can be version-controlled alongside the pipeline:
- Great Expectations — suite of "expectations" that run as a validation step in your DAG; produces an HTML data-quality report.
- Pandera — schema decorators on DataFrames; raises at the function boundary that produced bad data.
- Pydantic — row-level validation; parse each record into a typed model and reject at ingest.
2 · Content quality filtering — deciding what to keep
Structural validation tells you a record is well-formed. Content quality filtering tells you whether it's actually useful for training. This is a policy decision expressed as a scored threshold: records above the cut stay; records below are dropped or sent for review.
Heuristic filters — fast, cheap, imperfect
- Length bounds — too short (a one-word response) or too long (runaway generation) signals low utility. Tuned per regime: SFT responses have different expected lengths than RL rollouts.
- Language identification — keep only the target language(s) unless you're building a multilingual set. FastText langid at <1 ms/record.
- Repetition / n-gram ratio — a response that repeats the same 5-gram 40 times is a degenerate generation. Measure: fraction of tokens that appear in the top-k n-gram types.
- Perplexity (LM score) — score against a reference language model; extremely low perplexity (memorized boilerplate) or extremely high perplexity (garbled text) both warrant filtering. Used in FineWeb and Dolma.
Model-based scorers — slow, expensive, accurate
- Reward model score — a trained RM predicts human preference for the response. High RM score = response a human would prefer. Used as a quality cutoff or to rank examples for curriculum ordering.
- Toxicity / safety classifier — a fast discriminator (e.g. a fine-tuned BERT) flags harmful content. Runs at ~50 ms/record on CPU vs seconds for a large RM on GPU.
- LLM-as-judge — a frontier model rates the example on axes like helpfulness, factuality, instruction-following. Expensive (API cost or GPU time) but calibrated on nuanced quality signals. Reserve for the final cut or for preference-pair creation.
Ordering: predicate pushdown for quality stages
The same principle from lesson 04 (predicate pushdown: filter early, read less) and lesson 05 (filter before the shuffle) applies inside the quality pipeline: run cheap filters first so expensive scorers never see rows that would have been dropped anyway.
Ordered filter pipeline (cheapest → most expensive) ──────────────────────────────────────────────────── 1. Schema / contract check (~0.001 ms/row · CPU) 2. Length bounds (~0.01 ms/row · CPU) 3. Language ID (~0.5 ms/row · CPU) 4. Repetition / n-gram ratio (~1 ms/row · CPU) 5. Perplexity filter (~5 ms/row · CPU) 6. Toxicity classifier (~50 ms/row · CPU) 7. Reward-model score (~200 ms/row · GPU) ← most expensive, runs last Each stage drops records; later stages process a smaller set. Running stage 7 first would cost 200× more for the same output.
The cost implication is significant. If your RM scores 1 M records at 200 ms each, that is 55 GPU-hours. If a 1 ms length filter drops 30% first, you save 16 GPU-hours before touching a GPU at all.
Hard vs soft gates
Structural violations are always a hard gate. Content quality thresholds are a soft gate: you set a cutoff, but where you set it is a tunable policy choice. A batch where 40% of records are below your RM threshold might be worth reviewing rather than dropping — especially if the batch represents a rare domain. Route low-scoring records to a review queue; don't silently discard information you may later wish you had.
Quality vs quantity — the LIMA intuition
For post-training, a smaller high-quality set almost always beats a larger noisy one. The LIMA paper demonstrated that 1,000 carefully curated examples can produce competitive SFT behavior — the training signal is in the quality of the gradient, not the count of the steps. This holds across regimes: for preference data, a clean 50 k set of genuine human judgments outperforms a noisy 5 M set of synthetic auto-labeled pairs.
The practical consequence: invest engineering time in raising quality thresholds, not in ingesting more raw data. More raw data that passes poor quality gates just adds noise to the gradient. See the companion RL lesson 18a on signal-density curation for the gradient-signal framing of the same intuition.
Interactive · quality funnel
A corpus of 1,000,000 records flows through ordered filter stages. Adjust the threshold sliders to see survivors at each stage, final yield, and the compute cost of model-based scoring. Notice how tightening the reward-model threshold raises quality but cuts yield — and how the per-record cost of the RM means it dominates total GPU spend.