data_engineering / 00 · orientation orientation · ~3 min

Orientation

A 3-minute map: the three data regimes, why data engineering is a role of its own, and how the eleven lessons fit together.

The one-sentence version
Post-training data does not arrive as training batches — it arrives as mess, and turning the mess into batches is an ETL problem with the same shape every time: ingest → clean → dedup → tokenize → pack → serve, run either as a nightly batch job or, in RL, as an online loop.

Three regimes, three units of data

Everything in post-training is one of three regimes, and each defines a different unit of data. The whole pipeline exists to manufacture these units cleanly and at scale.

RegimeUnit of dataWhere it comes fromLesson
SFT (supervised)(prompt, response)Human demos, distillation, synthetic01
Preference (DPO/RLHF)(prompt, chosen, rejected)Pairwise human/AI judgments01
RL / RLVR(prompt, verifier) → rolloutsPrompt bank + the model itself, live01, 10

The first two are static: you build the dataset once, then train. The third is dynamic — the data is generated by the policy during training, so the pipeline runs inside the training loop. That difference is the spine of this series: Parts I–II build the static pipeline; Part III turns it online.

Why this is a separate discipline

Model engineers optimize the loss; data engineers optimize the substrate the loss reads. The skills barely overlap. A data pipeline lives or dies on questions the trainer never asks:

Companion lesson
The RL series' data-curation lesson (18a) answers which examples to keep to maximize gradient signal. This series answers the orthogonal question: how to build the pipeline that ingests, cleans, and serves them. Read them together.

The shape you'll see in every lesson

  SOURCES        BRONZE          SILVER                      GOLD
  annotate  ──▶  raw       ──▶   clean · dedup · decon  ──▶  tokenized
  synth          immutable       quality-gated               packed
  logs           provenance      (lessons 06–08)             (lesson 07)
  scrape         (lesson 03)                                      │
  (lesson 03)        ▲                                            ▼
                     │                                        TRAINER
                     └────────── RL: rollouts re-enter ◀──────  SFT/DPO/RL
                                 as data every step (lesson 10)

Bronze/silver/gold is the medallion layout (lesson 02): raw data is immutable and append-only; each downstream layer is a deterministic function of the one before it. That property — every layer reproducible from the last — is what lets you re-run, backfill, and audit a dataset months later.

How the lessons build

  1. Part I (01–03) — the substrate: the data units, the ETL skeleton, and ingestion.
  2. Part II (04–09) — one pipeline stage per lesson, in flow order: storage → transformation → dedup → tokenization/packing → quality → orchestration.
  3. Part III (10–11) — the RL online dataplane, then the cost/throughput/monitoring model for the whole system.

Start with lesson 01: the data itself, before any pipe touches it.