Dedup & decontamination
Duplicates waste compute, teach the model to memorize, and skew the distribution. Test-set leakage makes your benchmark numbers a lie. This lesson shows you exactly how to catch both — and why the core operation is the canonical shuffle you met in lesson 05.
Why duplicates hurt, three ways
Training on a corpus with duplicates causes three distinct problems, each serious on its own:
| Problem | Mechanism | Symptom |
|---|---|---|
| Memorization | A document seen k times contributes k gradient updates on identical tokens. The model learns to regurgitate the exact string rather than generalize. | Verbatim output of training text at inference; privacy risk if the string is personal data. |
| Compute waste | Processing a duplicate is strictly wasted FLOP — the model is already fit to that example. At scale, a 30% duplicate rate means ~30% of training tokens are repeated work. | Higher cost for the same model quality; or lower quality for the same cost. |
| Distribution skew | High-duplicate sources (code hosts, forum mirrors, news syndication) are over-represented in the loss, biasing the model toward those styles and topics. | Domain imbalance, over-confident predictions on common phrasings. |
Dedup is the canonical shuffle
In lesson 05 we learned that a wide (shuffle) operation requires all-to-all data movement: every worker partitions its rows by key and sends each bucket to the worker that owns that key. Dedup is exactly this pattern. To decide whether document A and document B are duplicates, they must land on the same worker. With N documents across W workers, there is no escaping the all-to-all exchange — the only question is how much data you exchange, and that is where the algorithm choice matters enormously.
Worker 0 Worker 1 Worker 2
doc A ──┐ "A and B share the same
doc D ├─── all-to-all ───────▶ similarity key — route
... │ (shuffle) both to the same worker"
doc B ──┘
Without this shuffle: O(n²)
pairwise comparisons.
With it + smart hashing: ~O(n).
Step 1 — normalization
Before any hashing, normalize the content so that trivially equivalent documents hash identically. The standard pipeline:
- Unicode NFC — collapse composed/decomposed forms of the same glyph.
- Lowercasing — "The Cat" and "the cat" are the same document for dedup purposes.
- Whitespace collapse — replace all whitespace runs (tabs, newlines, double spaces) with a single space; strip leading/trailing.
- Boilerplate stripping (optional but high-value) — remove navigation bars, cookie banners, repeated headers/footers using regex or a DOM parser. This stops the boilerplate from dominating the hash while the real content is different.
- PII normalization — replace email addresses, phone numbers, and SSN patterns with placeholder tokens (
<EMAIL>,<PHONE>) both to prevent memorization of personal data and so that two documents differing only in a name/email still hash as duplicates.
Step 2 — exact dedup
After normalization, hash each document (SHA-256 or xxHash — both fast and collision-resistant enough for this) and insert into a distributed hash set. O(n) time and one narrow pass — no shuffle needed because you're deduplicating against a set you build incrementally. Remove any document whose hash is already in the set.
Exact dedup catches verbatim copies and near-verbatim copies that normalization collapsed. It misses paraphrases, lightly edited versions, and documents with the same ideas in different words — that's what near-duplicate detection is for.
Step 3 — near-duplicate dedup via MinHash + LSH
Near-duplicate detection is the expensive part. The naive approach — compare every pair — is O(n2), which is prohibitive at tens of millions of documents. The standard solution is MinHash + Locality-Sensitive Hashing (LSH), which reduces it to roughly O(n) by hashing similar documents into the same bucket with high probability.
The chain has four steps:
3a — Shingles (k-grams)
Represent each document as a set of k-grams (consecutive k-token or k-character windows). With k = 5, "the quick brown fox" produces shingles: {the quick brown, quick brown fox, ...}. Two documents that share most shingles are near-duplicates. Similarity between two sets A and B is captured by Jaccard similarity:
J(A, B) = |A ∩ B| / |A ∪ B|
J = 1 means identical shingle sets (near-exact duplicate); J = 0 means no overlap at all. The practical threshold for "near-duplicate" is typically J ≥ 0.7 – 0.8.
3b — MinHash signature
Computing exact Jaccard for every pair is the O(n2) problem. MinHash sidesteps it: for each of h independent hash functions, compute the minimum hash value over all shingles in the document. The resulting vector of h minimums is the MinHash signature. The key property: the probability that two documents agree on a single MinHash position equals their Jaccard similarity.
Pr[min(h(A)) = min(h(B))] = J(A, B)
This is an unbiased estimator. The h-position signature reduces each document to a compact fixed-length vector regardless of document size.
3c — LSH banding
Even comparing signatures pairwise is O(n2). LSH banding converts it to a bucket problem. Split the signature of length L = b × r into b bands of r rows each. Hash each band independently. Two documents become a candidate pair if any one of their b bands hashes to the same bucket. This is the shuffle: all documents go into buckets (group-by on band hashes), and each worker handles the candidate pairs in its buckets.
The probability that a pair with Jaccard similarity s becomes a candidate pair (the S-curve) is:
P(candidate | s) = 1 − (1 − sr)b
This curve is steep: near 0 for s below a threshold, near 1 above it. The effective threshold — where the S-curve crosses 0.5 — is approximately:
threshold ≈ (1/b)1/r
You control the threshold and sensitivity by tuning b and r. More bands b lowers the threshold (catches looser near-dups, more false-positive candidates to verify). More rows per band r raises the threshold (stricter matching, fewer candidates).
3d — Candidate verification
For each candidate pair produced by banding, compute the exact Jaccard (or edit distance) and apply a hard threshold. Discard one document from each pair above the threshold. Because LSH already grouped the likely pairs, this verification step is cheap — it's proportional to the number of true near-duplicates, not O(n2).
Interactive · MinHash / LSH S-curve explorer
Tune the banding parameters and see how the detection probability curve shifts. The goal is a steep S-curve whose threshold sits at or below the similarity you care about — catching near-dups while rejecting distinct documents.
Step 4 — semantic dedup (brief)
MinHash/LSH catches lexical near-duplicates — documents that share a lot of the same words. It cannot detect paraphrases that express the same content in different vocabulary. Semantic dedup handles those: embed each document with a dense encoder, build an approximate-nearest-neighbor (ANN) index (FAISS, ScaNN, or similar), and remove documents whose embedding cosine similarity exceeds a threshold. The cost is higher (embedding every document) but it catches reformulations that fool MinHash entirely. In practice, you run exact dedup then MinHash/LSH first, then optionally run semantic dedup on the remainder if the dataset is large and paraphrase-heavy.
Decontamination — preventing eval-set leakage
Decontamination solves a different problem from dedup: ensuring that benchmark test questions or passages do not appear in the training set. If they do, the model has "seen the answer" and benchmark scores are inflated — your numbers are lying.
The standard approach is n-gram overlap:
- Build a set (or inverted index) of all n-grams in the eval/test sets. 13-grams are common for open-ended text; 8-grams for shorter questions.
- For each training document, compute what fraction of its n-grams appear in the eval set.
- Flag or remove any document above a threshold (e.g. a 13-gram exact match anywhere in the document is often enough to flag it).
The key dependency: you need the eval sets on hand before you finalize the training set. This connects directly to lesson 03's provenance requirement — eval set versions must be tracked and reproducible, or you can't know whether a future training run is contaminated.
Putting it all together: the dedup/decon stage in the pipeline
bronze (raw Parquet)
│
▼ narrow pass (embarrassingly parallel)
normalize + PII-normalize
│
▼ narrow pass
exact dedup (hash → distributed set, drop matches)
│
▼ WIDE / SHUFFLE ◀── this is the expensive operation
MinHash → LSH banding → candidate pairs per bucket
│
▼ narrow pass (per candidate pair)
Jaccard verify → remove near-duplicates
│
▼ optional narrow pass
semantic dedup (embed + ANN)
│
▼ narrow pass
decontamination (n-gram overlap vs eval sets)
│
▼
silver (clean, deduped, decontaminated Parquet)
The stage has exactly one wide operation (the LSH banding shuffle). Everything else is embarrassingly parallel. Optimize the shuffle: push exact dedup before it so the document count is already reduced; use tight MinHash signatures (only as many hash functions as you need); partition the banded hashes into enough buckets that each worker's candidate list is manageable. If your corpus is 1 TB before exact dedup, it might be 700 GB after — that is the data you pay to shuffle, not the original 1 TB.