Orientation
A 3-minute map: the three data regimes, why data engineering is a role of its own, and how the eleven lessons fit together.
Three regimes, three units of data
Everything in post-training is one of three regimes, and each defines a different unit of data. The whole pipeline exists to manufacture these units cleanly and at scale.
| Regime | Unit of data | Where it comes from | Lesson |
|---|---|---|---|
| SFT (supervised) | (prompt, response) | Human demos, distillation, synthetic | 01 |
| Preference (DPO/RLHF) | (prompt, chosen, rejected) | Pairwise human/AI judgments | 01 |
| RL / RLVR | (prompt, verifier) → rollouts | Prompt bank + the model itself, live | 01, 10 |
The first two are static: you build the dataset once, then train. The third is dynamic — the data is generated by the policy during training, so the pipeline runs inside the training loop. That difference is the spine of this series: Parts I–II build the static pipeline; Part III turns it online.
Why this is a separate discipline
Model engineers optimize the loss; data engineers optimize the substrate the loss reads. The skills barely overlap. A data pipeline lives or dies on questions the trainer never asks:
- Format & layout — Parquet or JSONL? Partitioned how? (lesson 04) Pick wrong and every job re-scans terabytes you didn't need.
- Distributed transformation — Spark, Ray, or a single box? (lesson 05) The shuffle, not the map, is where the time and money go.
- Dedup & decontamination — exact and near-duplicates, and test-set leakage (lesson 06). Invisible until your eval numbers are a lie.
- Reproducibility — same inputs, same bytes out, every time (lessons 02, 09). Without it you cannot debug a regression.
The shape you'll see in every lesson
SOURCES BRONZE SILVER GOLD
annotate ──▶ raw ──▶ clean · dedup · decon ──▶ tokenized
synth immutable quality-gated packed
logs provenance (lessons 06–08) (lesson 07)
scrape (lesson 03) │
(lesson 03) ▲ ▼
│ TRAINER
└────────── RL: rollouts re-enter ◀────── SFT/DPO/RL
as data every step (lesson 10)
Bronze/silver/gold is the medallion layout (lesson 02): raw data is immutable and append-only; each downstream layer is a deterministic function of the one before it. That property — every layer reproducible from the last — is what lets you re-run, backfill, and audit a dataset months later.
How the lessons build
- Part I (01–03) — the substrate: the data units, the ETL skeleton, and ingestion.
- Part II (04–09) — one pipeline stage per lesson, in flow order: storage → transformation → dedup → tokenization/packing → quality → orchestration.
- Part III (10–11) — the RL online dataplane, then the cost/throughput/monitoring model for the whole system.
Start with lesson 01: the data itself, before any pipe touches it.