The ETL / ELT skeleton
Before a single byte reaches a trainer, it flows through a pipeline with a fixed shape: extract → transform → load. This lesson formalizes that skeleton, introduces the medallion layout that organizes it, and establishes the two correctness properties — idempotency and determinism — that make it trustworthy months later.
ETL vs ELT: the trade
ETL (Extract → Transform → Load) is the classical pattern: data is extracted from a source, scrubbed and shaped in a separate compute engine, then loaded into a target store. The transformation happens before storage, so only clean data ever lands.
ELT (Extract → Load → Transform) flips the order: raw data lands first, then transformations are applied in-place inside the storage layer. This is the pattern that won in modern data infrastructure — and it is the right default for post-training pipelines — for three concrete reasons:
- Object storage is cheap. Keeping an immutable raw copy in S3 / GCS costs almost nothing. If you discover a bug in your cleaning logic six months later, you can replay from the original bytes instead of re-contacting data providers.
- Columnar formats are powerful in-place. Parquet + DuckDB or Spark can run complex SQL directly on the stored files. The "load first" step is not a liability — the storage is queryable.
- Post-training pipelines are heavily iterative. Quality criteria, dedup thresholds, and tokenizer versions all change. Landing raw first means you re-run only the transform, not the extract.
The medallion architecture
The canonical ELT layout for post-training data has three named layers. Each is a deterministic function of the previous one — you can always reconstruct a downstream layer by re-running the stage that produced it.
| Bronze | Silver | Gold | |
|---|---|---|---|
| Contents | Exactly as ingested: raw JSONL, annotation exports, scrape dumps, rollout logs. No mutations. | Cleaned, deduplicated, decontaminated records with a conformed, versioned schema. | Tokenized, sequence-packed tensors ready to feed the training loop. |
| Mutability | Immutable, append-only. Old files are never overwritten. | Overwritten on each pipeline run (idempotent rewrite, not append). | Overwritten on each run. May be ephemeral (re-generated on demand). |
| Schema | Schema-on-read; whatever the source emitted. May vary per batch. | Strict, enforced schema. Type coercions and renames happen here. | Model- and context-length specific. Changes when the tokenizer or packing config changes. |
| Who reads it | The ingestion stage (lesson 03) writes it; the silver transform reads it. | Gold packaging reads it; ad-hoc analysts query it for data profiling. | The trainer exclusively. Not for human inspection. |
| Mapped to | Raw | Curated | Training-ready |
Batch vs streaming vs micro-batch
Most static post-training pipelines (SFT, preference) are batch: the full dataset is processed once, or on a nightly/weekly schedule. Batch is simpler: the DAG has a clear start and end, failures are easy to retry, and the state space is small.
Micro-batch (e.g., Spark Structured Streaming with a trigger interval) is appropriate when new data arrives continuously — say, a human-annotation platform delivers judgments as they're completed — but you don't need sub-second latency. A micro-batch job wakes every few minutes, processes the new arrivals, and appends to bronze.
True streaming (Kafka → Flink or Ray Data streaming) applies when data must be processed with low latency and the pipeline must never "stop." This is the RL case: the policy generates rollouts continuously; they must be verified, buffered, and fed back to the trainer before the policy weights drift too far. Forward-reference: the RL online dataplane is lesson 10, and it breaks several assumptions we make here for batch pipelines.
Idempotency and determinism
A pipeline is correct if you can re-run it — for a backfill, to reproduce a bug, to audit a dataset six months later — and get exactly the right answer. Two properties together make that possible.
Idempotency
A stage is idempotent if running it multiple times produces the same result as running it once. The opposite of idempotent is a stage that appends on each run: re-run it twice and you have duplicates; re-run it ten times and the dataset is ten times the intended size.
NON-IDEMPOTENT (append-on-rerun): run 1: bronze/raw/batch_20240901.jsonl ─append→ silver/clean.jsonl # 1 M rows run 2: bronze/raw/batch_20240901.jsonl ─append→ silver/clean.jsonl # 2 M rows ← BUG IDEMPOTENT FIX (overwrite / key-upsert): run 1: bronze/raw/batch_20240901.jsonl ─overwrite→ silver/clean/dt=20240901/ # 1 M rows run 2: bronze/raw/batch_20240901.jsonl ─overwrite→ silver/clean/dt=20240901/ # 1 M rows ✓ Content-addressed alternative: hash(record) → key; upsert on key. Re-running with the same input produces the same keys → no new rows inserted.
The idempotency patterns are: (a) partition-scoped overwrite — write results to a date-partitioned path and overwrite the entire partition; (b) content-addressed upsert — assign a deterministic key from the record content, then upsert rather than append; (c) atomic swap — write to a staging path, then rename/swap into the production path.
Determinism
A stage is deterministic if the same inputs always produce byte-identical outputs. Idempotency is a weaker property: you can have an idempotent stage that is not deterministic (each overwrite produces a different shuffle order). Determinism requires pinning:
- Library versions — tokenizer, dedup library, cleaning rules. Lock in requirements.txt / pyproject.toml.
- Random seeds — any sampling or shuffling must be seeded with a fixed, recorded value.
- Sort order — distributed jobs can return rows in arbitrary order; explicit
ORDER BY idbefore writing makes output stable. - External state — avoid reading from a mutable source (a live database, a model checkpoint) mid-pipeline without snapshotting it first.
The pipeline is a DAG
Each stage depends on the outputs of earlier stages and produces outputs consumed by later stages. That structure is a directed acyclic graph (DAG). Cycles are impossible by definition: gold cannot depend on silver if silver depends on gold. The DAG structure is what allows an orchestrator to:
- Determine which stages can run in parallel (independent branches).
- Skip a stage whose output is already fresher than its inputs (incremental computation).
- Re-run only the affected downstream stages when an upstream stage changes.
- Retry a failed stage without re-running its predecessors.
The tooling for managing that DAG — Airflow, Dagster, Prefect, Flyte — is the subject of lesson 09. For now, the key mental model is: design each stage as a pure function (inputs → outputs, no hidden state, no side effects outside the declared output path), and the DAG takes care of the rest.
Putting it together: the full pipeline as a DAG of pure stages
SOURCES
│ (lesson 03: connectors, schema-on-read, provenance capture)
▼
BRONZE ── raw, immutable, append-only
│ stage: ingest
│ pattern: append new partitions only; never modify existing ones
▼
SILVER ── clean, deduped, decontaminated, quality-gated
│ stage: transform (lessons 05–08)
│ pattern: overwrite partition; keyed upsert; or atomic swap
▼
GOLD ── tokenized, packed, model-ready
│ stage: tokenize + pack (lesson 07)
│ pattern: overwrite; version by tokenizer hash + packing config
▼
TRAINER (SFT · DPO · RL)
Each arrow is a pure function: same inputs + pinned env → same bytes.
The DAG orchestrator (lesson 09) runs them in dependency order.