Data Engineering for Post-Training

The other half of post-training: the ETL pipelines that turn raw text, annotations, and rollouts into training-ready batches — built from first principles, the same way the RL series builds the trainer.

A post-training run is only as good as the data you feed it, and that data does not arrive ready to use. It arrives as messy JSONL dumps, human-annotation exports, scraped corpora, synthetic generations, and — in RL — as a firehose of rollouts the model produces about itself during training. Turning all of that into clean, deduplicated, tokenized, packed batches is a distinct engineering discipline with its own tools (Spark, Ray, Daft, Parquet, orchestrators) and its own failure modes. This series builds that discipline in order: Part I (01–03) defines the substrate — what post-training data is, the ETL skeleton, and how it gets in. Part II (04–09) is the pipeline itself, one stage per lesson — storage formats, distributed transformation, dedup, tokenization & packing, quality gates, orchestration. Part III (10–11) crosses from batch ETL into the online dataplane that RL needs, and ends with the cost, throughput, and monitoring model for the whole thing.

Who this is for

You can read Python and you've trained or fine-tuned a model, but the data plumbing around it — Spark jobs, Parquet layout, dedup at scale, orchestration DAGs — is a black box. By the end you'll be able to design the pipeline for an SFT, preference, or RL dataset and reason about its throughput and cost. Pairs with the RL data-curation lesson, which covers which data to keep; this series covers how to build the pipeline that processes it.

New here? Start with orientation

Read 00 · Orientation first — a 3-minute map of the three data regimes (SFT, preference, RL), why data engineering is its own role, and how the eleven lessons fit together.

The pipeline you're learning

Every post-training dataset flows through the same shape: raw sources land in a bronze layer, get cleaned and curated into silver, and are tokenized and packed into gold batches the trainer consumes. RL adds the dashed loop — the trainer generates its own data every step.

Part I · The substrate (01–03 · what the data is and how it gets in)

The data, by regime

The unit of data for each post-training regime: SFT = (prompt, response); preference = (prompt, chosen, rejected); RLVR = (prompt, verifier/answer). The shared schema, and why "the dataset is the product."

The ETL / ELT skeleton

Extract → transform → load. The medallion layout (bronze/silver/gold = raw/curated/training-ready). Batch vs streaming. Idempotency and determinism — the two properties that make a pipeline reproducible.

Ingestion & provenance

Where data comes from (human annotation, synthetic generation, logs, public datasets, scrape) and how it lands. Incremental / CDC ingestion, schema-on-read, and capturing license + provenance at the door — before it's too late.

Part II · The pipeline, one stage per lesson (04–09 · bronze → gold)

Storage & file formats

JSONL vs Parquet vs Arrow; row vs columnar; compression, partitioning, sharding; predicate & column pushdown. Why the format you pick decides your scan cost. Live format / scan-cost comparator.

Transformation at scale — Spark, Ray, Daft

The distributed compute model: map / filter / shuffle over partitions. Why the shuffle is the cost center. Spark vs Ray Data vs Daft vs single-node, and how to choose. Live partition → worker throughput simulator.

Dedup & decontamination

Normalization, PII handling, exact + near-duplicate detection via MinHash / LSH, and n-gram decontamination against eval sets. Why duplicates and test leakage quietly wreck a run. Live LSH bucketing demo.

Tokenization & packing

Tokenize as a pipeline stage. Sequence packing to fill the context window; loss masking for prompt and tool tokens; the padding waste you pay without packing. The throughput math. Live packing-efficiency simulator.

Quality & validation

Schema contracts, expectation checks, and gates that fail the build. Quality scoring and filtering (length, language, toxicity, reward-model score). The quality-vs-quantity trade, and where to put each filter. Live quality funnel.

Orchestration & versioning

DAG orchestrators (Airflow, Dagster, Prefect, Flyte): dependencies, retries, backfills, idempotency. Data versioning and lineage so a dataset is reproducible. Materialization and caching. Live DAG task-state walk-through.

Part III · From batch to online (10–11 · the RL dataplane and the cost model)

The RL online dataplane

In RL the pipeline runs inside the training loop: rollout → verify → buffer → train, every step. Replay buffers, on-policy freshness, streaming trajectory data. Why batch ETL assumptions break and what replaces them. Live freshness / staleness simulator.

Scaling, cost & monitoring (capstone)

The throughput and cost model of the whole pipeline: records/sec, $/M tokens, where the bottleneck sits (I/O vs compute vs shuffle). Monitoring volume, freshness, and quality drift. The end-to-end reference architecture. Live pipeline sizer.

How to use this

Linearly. Each lesson assumes the previous one. By lesson 07 we packed the tokens lesson 04 told us how to store; by lesson 10 the batch pipeline becomes an online one.
Touch every knob. Each interactive widget has a setting that blows up your cost or starves the trainer. Find it — that's the lesson.
Map it to a tool. Every stage names the real tools (Spark, Ray, Daft, Parquet, Dagster) so you can go from the concept to the library you'd actually reach for.