The ETL / ELT skeleton

Before a single byte reaches a trainer, it flows through a pipeline with a fixed shape: extract → transform → load. This lesson formalizes that skeleton, introduces the medallion layout that organizes it, and establishes the two correctness properties — idempotency and determinism — that make it trustworthy months later.

Where we are

Lesson 01 named the three data units: SFT pairs, preference triples, RL prompts. Now we formalize the pipeline those units flow through. Lesson 03 picks up immediately after: it covers ingestion — how raw data actually lands in the bronze layer introduced here.

ETL vs ELT: the trade

ETL (Extract → Transform → Load) is the classical pattern: data is extracted from a source, scrubbed and shaped in a separate compute engine, then loaded into a target store. The transformation happens before storage, so only clean data ever lands.

ELT (Extract → Load → Transform) flips the order: raw data lands first, then transformations are applied in-place inside the storage layer. This is the pattern that won in modern data infrastructure — and it is the right default for post-training pipelines — for three concrete reasons:

Object storage is cheap. Keeping an immutable raw copy in S3 / GCS costs almost nothing. If you discover a bug in your cleaning logic six months later, you can replay from the original bytes instead of re-contacting data providers.
Columnar formats are powerful in-place. Parquet + DuckDB or Spark can run complex SQL directly on the stored files. The "load first" step is not a liability — the storage is queryable.
Post-training pipelines are heavily iterative. Quality criteria, dedup thresholds, and tokenizer versions all change. Landing raw first means you re-run only the transform, not the extract.

When ETL still wins

ETL is appropriate when you have strict regulatory constraints on what raw data may be stored (PII, HIPAA) or when the source is ephemeral (a live API with no replay). In those cases, strip the sensitive fields in the extract step before anything is persisted. For most post-training data — annotation exports, synthetic generations, public corpora — ELT is the right default.

The medallion architecture

The canonical ELT layout for post-training data has three named layers. Each is a deterministic function of the previous one — you can always reconstruct a downstream layer by re-running the stage that produced it.

	Bronze	Silver	Gold
Contents	Exactly as ingested: raw JSONL, annotation exports, scrape dumps, rollout logs. No mutations.	Cleaned, deduplicated, decontaminated records with a conformed, versioned schema.	Tokenized, sequence-packed tensors ready to feed the training loop.
Mutability	Immutable, append-only. Old files are never overwritten.	Overwritten on each pipeline run (idempotent rewrite, not append).	Overwritten on each run. May be ephemeral (re-generated on demand).
Schema	Schema-on-read; whatever the source emitted. May vary per batch.	Strict, enforced schema. Type coercions and renames happen here.	Model- and context-length specific. Changes when the tokenizer or packing config changes.
Who reads it	The ingestion stage (lesson 03) writes it; the silver transform reads it.	Gold packaging reads it; ad-hoc analysts query it for data profiling.	The trainer exclusively. Not for human inspection.
Mapped to	Raw	Curated	Training-ready

Batch vs streaming vs micro-batch

Most static post-training pipelines (SFT, preference) are batch: the full dataset is processed once, or on a nightly/weekly schedule. Batch is simpler: the DAG has a clear start and end, failures are easy to retry, and the state space is small.

Micro-batch (e.g., Spark Structured Streaming with a trigger interval) is appropriate when new data arrives continuously — say, a human-annotation platform delivers judgments as they're completed — but you don't need sub-second latency. A micro-batch job wakes every few minutes, processes the new arrivals, and appends to bronze.

True streaming (Kafka → Flink or Ray Data streaming) applies when data must be processed with low latency and the pipeline must never "stop." This is the RL case: the policy generates rollouts continuously; they must be verified, buffered, and fed back to the trainer before the policy weights drift too far. Forward-reference: the RL online dataplane is lesson 10, and it breaks several assumptions we make here for batch pipelines.

Rule of thumb

Static dataset for SFT or DPO? Use batch. Continuous annotation feed? Use micro-batch. RL training loop generating its own data? That's the streaming case in lesson 10. Pick the simplest model that meets your freshness requirement.

Idempotency and determinism

A pipeline is correct if you can re-run it — for a backfill, to reproduce a bug, to audit a dataset six months later — and get exactly the right answer. Two properties together make that possible.

Idempotency

A stage is idempotent if running it multiple times produces the same result as running it once. The opposite of idempotent is a stage that appends on each run: re-run it twice and you have duplicates; re-run it ten times and the dataset is ten times the intended size.

  NON-IDEMPOTENT (append-on-rerun):

  run 1:  bronze/raw/batch_20240901.jsonl ─append→ silver/clean.jsonl   # 1 M rows
  run 2:  bronze/raw/batch_20240901.jsonl ─append→ silver/clean.jsonl   # 2 M rows ← BUG

  IDEMPOTENT FIX (overwrite / key-upsert):

  run 1:  bronze/raw/batch_20240901.jsonl ─overwrite→ silver/clean/dt=20240901/  # 1 M rows
  run 2:  bronze/raw/batch_20240901.jsonl ─overwrite→ silver/clean/dt=20240901/  # 1 M rows ✓

  Content-addressed alternative: hash(record) → key; upsert on key.
  Re-running with the same input produces the same keys → no new rows inserted.

The idempotency patterns are: (a) partition-scoped overwrite — write results to a date-partitioned path and overwrite the entire partition; (b) content-addressed upsert — assign a deterministic key from the record content, then upsert rather than append; (c) atomic swap — write to a staging path, then rename/swap into the production path.

Determinism

A stage is deterministic if the same inputs always produce byte-identical outputs. Idempotency is a weaker property: you can have an idempotent stage that is not deterministic (each overwrite produces a different shuffle order). Determinism requires pinning:

Library versions — tokenizer, dedup library, cleaning rules. Lock in requirements.txt / pyproject.toml.
Random seeds — any sampling or shuffling must be seeded with a fixed, recorded value.
Sort order — distributed jobs can return rows in arbitrary order; explicit ORDER BY id before writing makes output stable.
External state — avoid reading from a mutable source (a live database, a model checkpoint) mid-pipeline without snapshotting it first.

Why you need both

Idempotency alone lets you re-run safely without duplicating data. Determinism alone guarantees the same output order — but if a second run appends instead of overwrites, the data still doubles. You need idempotency so re-runs don't grow the dataset and determinism so the bytes match. Together they make backfills and audits possible.

The pipeline is a DAG

Each stage depends on the outputs of earlier stages and produces outputs consumed by later stages. That structure is a directed acyclic graph (DAG). Cycles are impossible by definition: gold cannot depend on silver if silver depends on gold. The DAG structure is what allows an orchestrator to:

Determine which stages can run in parallel (independent branches).
Skip a stage whose output is already fresher than its inputs (incremental computation).
Re-run only the affected downstream stages when an upstream stage changes.
Retry a failed stage without re-running its predecessors.

The tooling for managing that DAG — Airflow, Dagster, Prefect, Flyte — is the subject of lesson 09. For now, the key mental model is: design each stage as a pure function (inputs → outputs, no hidden state, no side effects outside the declared output path), and the DAG takes care of the rest.

Forward reference

Lesson 09 covers orchestration in depth: how to declare the DAG in code, how retries and backfills interact with idempotency, and how to version both the pipeline code and the data it produces so a dataset is fully reproducible months later.

Putting it together: the full pipeline as a DAG of pure stages

  SOURCES
    │  (lesson 03: connectors, schema-on-read, provenance capture)
    ▼
  BRONZE  ──  raw, immutable, append-only
    │  stage: ingest
    │  pattern: append new partitions only; never modify existing ones
    ▼
  SILVER  ──  clean, deduped, decontaminated, quality-gated
    │  stage: transform (lessons 05–08)
    │  pattern: overwrite partition; keyed upsert; or atomic swap
    ▼
  GOLD  ──  tokenized, packed, model-ready
    │  stage: tokenize + pack (lesson 07)
    │  pattern: overwrite; version by tokenizer hash + packing config
    ▼
  TRAINER  (SFT · DPO · RL)

  Each arrow is a pure function: same inputs + pinned env → same bytes.
  The DAG orchestrator (lesson 09) runs them in dependency order.

Takeaway — setting up lesson 03

The medallion skeleton gives every post-training pipeline the same three-layer shape: bronze is the source of truth, silver is the curated view, gold is what the trainer reads. The two correctness invariants — idempotency and determinism — are what let you trust a dataset you built six months ago, backfill a partition after fixing a bug, and reproduce a training run exactly. Lesson 03 (ingestion) picks up at the very first arrow: how raw data from annotation platforms, synthetic generators, public corpora, and RL rollouts actually lands in bronze — with provenance intact from the moment it arrives.