The other half of post-training: the ETL pipelines that turn raw text, annotations, and rollouts into training-ready batches — built from first principles, the same way the RL series builds the trainer.
A post-training run is only as good as the data you feed it, and that data does not arrive ready to use. It arrives as messy JSONL dumps, human-annotation exports, scraped corpora, synthetic generations, and — in RL — as a firehose of rollouts the model produces about itself during training. Turning all of that into clean, deduplicated, tokenized, packed batches is a distinct engineering discipline with its own tools (Spark, Ray, Daft, Parquet, orchestrators) and its own failure modes. This series builds that discipline in order: Part I (01–03) defines the substrate — what post-training data is, the ETL skeleton, and how it gets in. Part II (04–09) is the pipeline itself, one stage per lesson — storage formats, distributed transformation, dedup, tokenization & packing, quality gates, orchestration. Part III (10–11) crosses from batch ETL into the online dataplane that RL needs, and ends with the cost, throughput, and monitoring model for the whole thing.
Who this is for
You can read Python and you've trained or fine-tuned a model, but the data plumbing around it — Spark jobs, Parquet layout, dedup at scale, orchestration DAGs — is a black box. By the end you'll be able to design the pipeline for an SFT, preference, or RL dataset and reason about its throughput and cost. Pairs with the RL data-curation lesson, which covers which data to keep; this series covers how to build the pipeline that processes it.
New here? Start with orientation
Read 00 · Orientation first — a 3-minute map of the three data regimes (SFT, preference, RL), why data engineering is its own role, and how the eleven lessons fit together.
The pipeline you're learning
Every post-training dataset flows through the same shape: raw sources land in a bronze layer, get cleaned and curated into silver, and are tokenized and packed into gold batches the trainer consumes. RL adds the dashed loop — the trainer generates its own data every step.
Part I · The substrate (01–03 · what the data is and how it gets in)
Linearly. Each lesson assumes the previous one. By lesson 07 we packed the tokens lesson 04 told us how to store; by lesson 10 the batch pipeline becomes an online one.
Touch every knob. Each interactive widget has a setting that blows up your cost or starves the trainer. Find it — that's the lesson.
Map it to a tool. Every stage names the real tools (Spark, Ray, Daft, Parquet, Dagster) so you can go from the concept to the library you'd actually reach for.