Orientation

A 3-minute map: the three data regimes, why data engineering is a role of its own, and how the eleven lessons fit together.

The one-sentence version

Post-training data does not arrive as training batches — it arrives as mess, and turning the mess into batches is an ETL problem with the same shape every time: ingest → clean → dedup → tokenize → pack → serve, run either as a nightly batch job or, in RL, as an online loop.

Three regimes, three units of data

Everything in post-training is one of three regimes, and each defines a different unit of data. The whole pipeline exists to manufacture these units cleanly and at scale.

Regime	Unit of data	Where it comes from	Lesson
SFT (supervised)	`(prompt, response)`	Human demos, distillation, synthetic	01
Preference (DPO/RLHF)	`(prompt, chosen, rejected)`	Pairwise human/AI judgments	01
RL / RLVR	`(prompt, verifier)` → rollouts	Prompt bank + the model itself, live	01, 10

The first two are static: you build the dataset once, then train. The third is dynamic — the data is generated by the policy during training, so the pipeline runs inside the training loop. That difference is the spine of this series: Parts I–II build the static pipeline; Part III turns it online.

Why this is a separate discipline

Model engineers optimize the loss; data engineers optimize the substrate the loss reads. The skills barely overlap. A data pipeline lives or dies on questions the trainer never asks:

Format & layout — Parquet or JSONL? Partitioned how? (lesson 04) Pick wrong and every job re-scans terabytes you didn't need.
Distributed transformation — Spark, Ray, or a single box? (lesson 05) The shuffle, not the map, is where the time and money go.
Dedup & decontamination — exact and near-duplicates, and test-set leakage (lesson 06). Invisible until your eval numbers are a lie.
Reproducibility — same inputs, same bytes out, every time (lessons 02, 09). Without it you cannot debug a regression.

Companion lesson

The RL series' data-curation lesson (18a) answers which examples to keep to maximize gradient signal. This series answers the orthogonal question: how to build the pipeline that ingests, cleans, and serves them. Read them together.

The shape you'll see in every lesson

  SOURCES        BRONZE          SILVER                      GOLD
  annotate  ──▶  raw       ──▶   clean · dedup · decon  ──▶  tokenized
  synth          immutable       quality-gated               packed
  logs           provenance      (lessons 06–08)             (lesson 07)
  scrape         (lesson 03)                                      │
  (lesson 03)        ▲                                            ▼
                     │                                        TRAINER
                     └────────── RL: rollouts re-enter ◀──────  SFT/DPO/RL
                                 as data every step (lesson 10)

Bronze/silver/gold is the medallion layout (lesson 02): raw data is immutable and append-only; each downstream layer is a deterministic function of the one before it. That property — every layer reproducible from the last — is what lets you re-run, backfill, and audit a dataset months later.

How the lessons build

Part I (01–03) — the substrate: the data units, the ETL skeleton, and ingestion.
Part II (04–09) — one pipeline stage per lesson, in flow order: storage → transformation → dedup → tokenization/packing → quality → orchestration.
Part III (10–11) — the RL online dataplane, then the cost/throughput/monitoring model for the whole system.

Start with lesson 01: the data itself, before any pipe touches it.