data_engineering / 08 · quality & validation lesson 8 / 11

Quality & validation

Quality work happens in the silver layer — after dedup (lesson 06) and before tokenization (lesson 07). It splits into two distinct jobs: structural validation (does the data conform to a contract?) and content filtering (is this record worth training on?).

Where we are
Lesson 06 deduplicated the corpus. Lesson 07 tokenized and packed the survivors. This lesson is the gate that sits between them: it rejects records that are malformed or low-quality before they consume tokenization compute, GPU scoring time, or — worst — training tokens.

Two distinct kinds of checks

Pipeline quality work is often lumped together under "data cleaning," but that conflates two very different activities with different failure modes and different remedies.

Structural validationContent quality filtering
Question askedDoes this record conform to the schema / contract?Is this record worth training on?
Answer typeBinary — valid or notScored — better or worse
Failure actionHard gate: fail the build, quarantine the batchSoft gate: tune the threshold, route for review
When to runAt ingest / schema changeAfter struct. validation, cheapest checks first
ToolsGreat Expectations, Pandera, Pydanticheuristics, classifiers, reward models

1 · Structural validation — data contracts

A data contract is a formal assertion about the shape of your data: required fields are present, types match, values fall within allowed ranges, enum columns only take known values, referential integrity holds. These are invariants — if they're violated, the downstream transform is operating on garbage and will produce garbage, silently.

Contract checks are cheap (column statistics, row counts, type checks) and should run as a hard gate: if the check fails, you halt the pipeline, quarantine the offending batch, and page the engineer. You do not let malformed records flow forward and hope the model learns around them.

  Contract assertions (examples)
  ─────────────────────────────
  required:   prompt ≠ null, response ≠ null, source ∈ {annotation, synth, distill}
  types:      token_count :: int, reward_score :: float ∈ [−1, 1]
  ranges:     len(prompt) ≥ 10, len(response) ≥ 20
  referential: split_id must exist in splits_manifest table
  statistical: null_rate(reward_score) < 2%, row_count within 5% of yesterday

  Failure → quarantine batch → alert → pipeline halts.  No silent pass-through.

Libraries express contracts as code so they can be version-controlled alongside the pipeline:

Hard gate means hard
A contract violation that "only" affects 0.3% of rows is still a build failure. Small violations are often the leading edge of an upstream schema change, an annotation tool bug, or a synthetic-generation prompt drift. Treat them as such.

2 · Content quality filtering — deciding what to keep

Structural validation tells you a record is well-formed. Content quality filtering tells you whether it's actually useful for training. This is a policy decision expressed as a scored threshold: records above the cut stay; records below are dropped or sent for review.

Heuristic filters — fast, cheap, imperfect

Model-based scorers — slow, expensive, accurate

Ordering: predicate pushdown for quality stages

The same principle from lesson 04 (predicate pushdown: filter early, read less) and lesson 05 (filter before the shuffle) applies inside the quality pipeline: run cheap filters first so expensive scorers never see rows that would have been dropped anyway.

  Ordered filter pipeline (cheapest → most expensive)
  ────────────────────────────────────────────────────
  1. Schema / contract check          (~0.001 ms/row · CPU)
  2. Length bounds                    (~0.01  ms/row · CPU)
  3. Language ID                      (~0.5   ms/row · CPU)
  4. Repetition / n-gram ratio        (~1     ms/row · CPU)
  5. Perplexity filter                (~5     ms/row · CPU)
  6. Toxicity classifier              (~50    ms/row · CPU)
  7. Reward-model score               (~200   ms/row · GPU) ← most expensive, runs last

  Each stage drops records; later stages process a smaller set.
  Running stage 7 first would cost 200× more for the same output.

The cost implication is significant. If your RM scores 1 M records at 200 ms each, that is 55 GPU-hours. If a 1 ms length filter drops 30% first, you save 16 GPU-hours before touching a GPU at all.

Hard vs soft gates

Structural violations are always a hard gate. Content quality thresholds are a soft gate: you set a cutoff, but where you set it is a tunable policy choice. A batch where 40% of records are below your RM threshold might be worth reviewing rather than dropping — especially if the batch represents a rare domain. Route low-scoring records to a review queue; don't silently discard information you may later wish you had.

Over-filtering: the other failure mode
It is possible to filter your dataset down to almost nothing — or, subtler, to introduce a distribution bias by aggressively removing low-RM-score records. If your RM scores are calibrated on a different distribution than your target task, you may be removing exactly the examples that would transfer well. Monitor yield at each stage and set floor thresholds on what fraction must survive.

Quality vs quantity — the LIMA intuition

For post-training, a smaller high-quality set almost always beats a larger noisy one. The LIMA paper demonstrated that 1,000 carefully curated examples can produce competitive SFT behavior — the training signal is in the quality of the gradient, not the count of the steps. This holds across regimes: for preference data, a clean 50 k set of genuine human judgments outperforms a noisy 5 M set of synthetic auto-labeled pairs.

The practical consequence: invest engineering time in raising quality thresholds, not in ingesting more raw data. More raw data that passes poor quality gates just adds noise to the gradient. See the companion RL lesson 18a on signal-density curation for the gradient-signal framing of the same intuition.

Interactive · quality funnel

A corpus of 1,000,000 records flows through ordered filter stages. Adjust the threshold sliders to see survivors at each stage, final yield, and the compute cost of model-based scoring. Notice how tightening the reward-model threshold raises quality but cuts yield — and how the per-record cost of the RM means it dominates total GPU spend.

Quality funnel simulator
Records flow left-to-right through cheapest-first filters. Sliders set what fraction survives each stage. GPU cost is charged only to records that reach the RM scorer — earlier drops are free.
Input
1,000,000
Final yield
Yield %
GPU-hours (RM)
GPU cost saved
RM cost @ $3/h

Takeaway

What to carry to lesson 09
Quality work is two separate jobs: structural contracts (hard gates, fail the build) and content filtering (soft gates, tune the threshold). Run cheap heuristics first so expensive GPU scorers only see records that will survive. For post-training, quality beats quantity — a tighter RM cutoff is almost always the right call until yield drops to a level that starves the trainer. Lesson 09 (orchestration) is where you'll see how each of these stages becomes a DAG task with dependencies, retries, and data-versioned outputs — so a failed quality gate can be re-run in isolation without re-processing the whole corpus.