data_engineering / 06 · dedup & decon lesson 6 / 11

Dedup & decontamination

Duplicates waste compute, teach the model to memorize, and skew the distribution. Test-set leakage makes your benchmark numbers a lie. This lesson shows you exactly how to catch both — and why the core operation is the canonical shuffle you met in lesson 05.

Where we are
Lesson 05 showed that a shuffle — grouping a dataset by key across all workers — is the expensive, all-to-all operation that dominates distributed pipeline cost. Dedup is precisely that: to find duplicates you must group every document by a similarity key, which requires every worker to communicate with every other. Everything from lesson 05 applies here directly. The silver layer we build in this lesson feeds into lesson 07's tokenization stage.

Why duplicates hurt, three ways

Training on a corpus with duplicates causes three distinct problems, each serious on its own:

ProblemMechanismSymptom
MemorizationA document seen k times contributes k gradient updates on identical tokens. The model learns to regurgitate the exact string rather than generalize.Verbatim output of training text at inference; privacy risk if the string is personal data.
Compute wasteProcessing a duplicate is strictly wasted FLOP — the model is already fit to that example. At scale, a 30% duplicate rate means ~30% of training tokens are repeated work.Higher cost for the same model quality; or lower quality for the same cost.
Distribution skewHigh-duplicate sources (code hosts, forum mirrors, news syndication) are over-represented in the loss, biasing the model toward those styles and topics.Domain imbalance, over-confident predictions on common phrasings.

Dedup is the canonical shuffle

In lesson 05 we learned that a wide (shuffle) operation requires all-to-all data movement: every worker partitions its rows by key and sends each bucket to the worker that owns that key. Dedup is exactly this pattern. To decide whether document A and document B are duplicates, they must land on the same worker. With N documents across W workers, there is no escaping the all-to-all exchange — the only question is how much data you exchange, and that is where the algorithm choice matters enormously.

Worker 0      Worker 1      Worker 2
  doc A  ──┐                           "A and B share the same
  doc D    ├─── all-to-all ───────▶     similarity key — route
  ...      │    (shuffle)               both to the same worker"
  doc B  ──┘
                                       Without this shuffle: O(n²)
                                       pairwise comparisons.
                                       With it + smart hashing: ~O(n).

Step 1 — normalization

Before any hashing, normalize the content so that trivially equivalent documents hash identically. The standard pipeline:

  1. Unicode NFC — collapse composed/decomposed forms of the same glyph.
  2. Lowercasing — "The Cat" and "the cat" are the same document for dedup purposes.
  3. Whitespace collapse — replace all whitespace runs (tabs, newlines, double spaces) with a single space; strip leading/trailing.
  4. Boilerplate stripping (optional but high-value) — remove navigation bars, cookie banners, repeated headers/footers using regex or a DOM parser. This stops the boilerplate from dominating the hash while the real content is different.
  5. PII normalization — replace email addresses, phone numbers, and SSN patterns with placeholder tokens (<EMAIL>, <PHONE>) both to prevent memorization of personal data and so that two documents differing only in a name/email still hash as duplicates.
PII is a dedup concern, not just a privacy concern
A news article scraped from 500 mirrors differs only in a byline or cookie banner. Without normalization, exact hashing sees 500 unique documents. With it, all 500 collapse to one. Do normalization before hashing, not as a post-processing step.

Step 2 — exact dedup

After normalization, hash each document (SHA-256 or xxHash — both fast and collision-resistant enough for this) and insert into a distributed hash set. O(n) time and one narrow pass — no shuffle needed because you're deduplicating against a set you build incrementally. Remove any document whose hash is already in the set.

Exact dedup catches verbatim copies and near-verbatim copies that normalization collapsed. It misses paraphrases, lightly edited versions, and documents with the same ideas in different words — that's what near-duplicate detection is for.

Step 3 — near-duplicate dedup via MinHash + LSH

Near-duplicate detection is the expensive part. The naive approach — compare every pair — is O(n2), which is prohibitive at tens of millions of documents. The standard solution is MinHash + Locality-Sensitive Hashing (LSH), which reduces it to roughly O(n) by hashing similar documents into the same bucket with high probability.

The chain has four steps:

3a — Shingles (k-grams)

Represent each document as a set of k-grams (consecutive k-token or k-character windows). With k = 5, "the quick brown fox" produces shingles: {the quick brown, quick brown fox, ...}. Two documents that share most shingles are near-duplicates. Similarity between two sets A and B is captured by Jaccard similarity:

J(A, B) = |A ∩ B| / |A ∪ B|

J = 1 means identical shingle sets (near-exact duplicate); J = 0 means no overlap at all. The practical threshold for "near-duplicate" is typically J ≥ 0.7 – 0.8.

3b — MinHash signature

Computing exact Jaccard for every pair is the O(n2) problem. MinHash sidesteps it: for each of h independent hash functions, compute the minimum hash value over all shingles in the document. The resulting vector of h minimums is the MinHash signature. The key property: the probability that two documents agree on a single MinHash position equals their Jaccard similarity.

Pr[min(h(A)) = min(h(B))] = J(A, B)

This is an unbiased estimator. The h-position signature reduces each document to a compact fixed-length vector regardless of document size.

3c — LSH banding

Even comparing signatures pairwise is O(n2). LSH banding converts it to a bucket problem. Split the signature of length L = b × r into b bands of r rows each. Hash each band independently. Two documents become a candidate pair if any one of their b bands hashes to the same bucket. This is the shuffle: all documents go into buckets (group-by on band hashes), and each worker handles the candidate pairs in its buckets.

The probability that a pair with Jaccard similarity s becomes a candidate pair (the S-curve) is:

P(candidate | s) = 1 − (1 − sr)b

This curve is steep: near 0 for s below a threshold, near 1 above it. The effective threshold — where the S-curve crosses 0.5 — is approximately:

threshold ≈ (1/b)1/r

You control the threshold and sensitivity by tuning b and r. More bands b lowers the threshold (catches looser near-dups, more false-positive candidates to verify). More rows per band r raises the threshold (stricter matching, fewer candidates).

3d — Candidate verification

For each candidate pair produced by banding, compute the exact Jaccard (or edit distance) and apply a hard threshold. Discard one document from each pair above the threshold. Because LSH already grouped the likely pairs, this verification step is cheap — it's proportional to the number of true near-duplicates, not O(n2).

Interactive · MinHash / LSH S-curve explorer

Tune the banding parameters and see how the detection probability curve shifts. The goal is a steep S-curve whose threshold sits at or below the similarity you care about — catching near-dups while rejecting distinct documents.

MinHash / LSH parameter explorer
Adjust the number of bands b and rows per band r. The signature length is b × r. Move the Jaccard slider to read off the candidate probability at that similarity. The S-curve updates live.
Signature length
Threshold ≈ (1/b)^(1/r)
P(candidate) at s
P(miss) at s

Step 4 — semantic dedup (brief)

MinHash/LSH catches lexical near-duplicates — documents that share a lot of the same words. It cannot detect paraphrases that express the same content in different vocabulary. Semantic dedup handles those: embed each document with a dense encoder, build an approximate-nearest-neighbor (ANN) index (FAISS, ScaNN, or similar), and remove documents whose embedding cosine similarity exceeds a threshold. The cost is higher (embedding every document) but it catches reformulations that fool MinHash entirely. In practice, you run exact dedup then MinHash/LSH first, then optionally run semantic dedup on the remainder if the dataset is large and paraphrase-heavy.

Decontamination — preventing eval-set leakage

Decontamination solves a different problem from dedup: ensuring that benchmark test questions or passages do not appear in the training set. If they do, the model has "seen the answer" and benchmark scores are inflated — your numbers are lying.

The standard approach is n-gram overlap:

  1. Build a set (or inverted index) of all n-grams in the eval/test sets. 13-grams are common for open-ended text; 8-grams for shorter questions.
  2. For each training document, compute what fraction of its n-grams appear in the eval set.
  3. Flag or remove any document above a threshold (e.g. a 13-gram exact match anywhere in the document is often enough to flag it).

The key dependency: you need the eval sets on hand before you finalize the training set. This connects directly to lesson 03's provenance requirement — eval set versions must be tracked and reproducible, or you can't know whether a future training run is contaminated.

Real failure: eval leakage inflating benchmarks
Several high-profile post-training releases have been found to have benchmark contamination after the fact — test questions from MMLU, HumanEval, or GSM8K leaking into the pretraining or fine-tuning corpus. The published score is then not a measure of generalization; it's a measure of memorization. Downstream users who trust those numbers make wrong decisions: choosing a model that performs poorly on truly unseen tasks. The fix — n-gram decontamination against the held-out eval sets — is cheap. Skipping it is not a time saving; it's a scientific error.

Putting it all together: the dedup/decon stage in the pipeline

  bronze (raw Parquet)
       │
       ▼  narrow pass (embarrassingly parallel)
  normalize + PII-normalize
       │
       ▼  narrow pass
  exact dedup (hash → distributed set, drop matches)
       │
       ▼  WIDE / SHUFFLE  ◀── this is the expensive operation
  MinHash → LSH banding → candidate pairs per bucket
       │
       ▼  narrow pass (per candidate pair)
  Jaccard verify → remove near-duplicates
       │
       ▼  optional narrow pass
  semantic dedup (embed + ANN)
       │
       ▼  narrow pass
  decontamination (n-gram overlap vs eval sets)
       │
       ▼
  silver (clean, deduped, decontaminated Parquet)

The stage has exactly one wide operation (the LSH banding shuffle). Everything else is embarrassingly parallel. Optimize the shuffle: push exact dedup before it so the document count is already reduced; use tight MinHash signatures (only as many hash functions as you need); partition the banded hashes into enough buckets that each worker's candidate list is manageable. If your corpus is 1 TB before exact dedup, it might be 700 GB after — that is the data you pay to shuffle, not the original 1 TB.

Takeaway

What to carry to lesson 07
Dedup and decontamination are silver-layer concerns: run them before tokenization, not after. The core algorithm — MinHash + LSH banding — is a controlled shuffle whose cost you manage by tuning the signature length and band structure. Exact dedup first (cheap, narrow) so the shuffle sees less data. Decontamination requires the eval sets on hand — another reason provenance (lesson 03) matters. The output of this stage is a clean, deduplicated, decontaminated Parquet corpus ready for lesson 07's tokenization and sequence-packing step.