data_engineering / 02 · etl skeleton lesson 2 / 11

The ETL / ELT skeleton

Before a single byte reaches a trainer, it flows through a pipeline with a fixed shape: extract → transform → load. This lesson formalizes that skeleton, introduces the medallion layout that organizes it, and establishes the two correctness properties — idempotency and determinism — that make it trustworthy months later.

Where we are
Lesson 01 named the three data units: SFT pairs, preference triples, RL prompts. Now we formalize the pipeline those units flow through. Lesson 03 picks up immediately after: it covers ingestion — how raw data actually lands in the bronze layer introduced here.

ETL vs ELT: the trade

ETL (Extract → Transform → Load) is the classical pattern: data is extracted from a source, scrubbed and shaped in a separate compute engine, then loaded into a target store. The transformation happens before storage, so only clean data ever lands.

ELT (Extract → Load → Transform) flips the order: raw data lands first, then transformations are applied in-place inside the storage layer. This is the pattern that won in modern data infrastructure — and it is the right default for post-training pipelines — for three concrete reasons:

When ETL still wins
ETL is appropriate when you have strict regulatory constraints on what raw data may be stored (PII, HIPAA) or when the source is ephemeral (a live API with no replay). In those cases, strip the sensitive fields in the extract step before anything is persisted. For most post-training data — annotation exports, synthetic generations, public corpora — ELT is the right default.

The medallion architecture

The canonical ELT layout for post-training data has three named layers. Each is a deterministic function of the previous one — you can always reconstruct a downstream layer by re-running the stage that produced it.

SOURCES annotate · synth logs · scrape BRONZE raw · immutable append-only provenance SILVER clean · dedup decon · quality conformed schema GOLD tokenized packed batches training-ready TRAINER SFT · DPO · RL ingest clean dedup tokenize pack batches lesson 03 lessons 06–08 lesson 07 (storage: 04) RL: rollouts re-enter as data every step (lesson 10)
BronzeSilverGold
ContentsExactly as ingested: raw JSONL, annotation exports, scrape dumps, rollout logs. No mutations.Cleaned, deduplicated, decontaminated records with a conformed, versioned schema.Tokenized, sequence-packed tensors ready to feed the training loop.
MutabilityImmutable, append-only. Old files are never overwritten.Overwritten on each pipeline run (idempotent rewrite, not append).Overwritten on each run. May be ephemeral (re-generated on demand).
SchemaSchema-on-read; whatever the source emitted. May vary per batch.Strict, enforced schema. Type coercions and renames happen here.Model- and context-length specific. Changes when the tokenizer or packing config changes.
Who reads itThe ingestion stage (lesson 03) writes it; the silver transform reads it.Gold packaging reads it; ad-hoc analysts query it for data profiling.The trainer exclusively. Not for human inspection.
Mapped toRawCuratedTraining-ready

Batch vs streaming vs micro-batch

Most static post-training pipelines (SFT, preference) are batch: the full dataset is processed once, or on a nightly/weekly schedule. Batch is simpler: the DAG has a clear start and end, failures are easy to retry, and the state space is small.

Micro-batch (e.g., Spark Structured Streaming with a trigger interval) is appropriate when new data arrives continuously — say, a human-annotation platform delivers judgments as they're completed — but you don't need sub-second latency. A micro-batch job wakes every few minutes, processes the new arrivals, and appends to bronze.

True streaming (Kafka → Flink or Ray Data streaming) applies when data must be processed with low latency and the pipeline must never "stop." This is the RL case: the policy generates rollouts continuously; they must be verified, buffered, and fed back to the trainer before the policy weights drift too far. Forward-reference: the RL online dataplane is lesson 10, and it breaks several assumptions we make here for batch pipelines.

Rule of thumb
Static dataset for SFT or DPO? Use batch. Continuous annotation feed? Use micro-batch. RL training loop generating its own data? That's the streaming case in lesson 10. Pick the simplest model that meets your freshness requirement.

Idempotency and determinism

A pipeline is correct if you can re-run it — for a backfill, to reproduce a bug, to audit a dataset six months later — and get exactly the right answer. Two properties together make that possible.

Idempotency

A stage is idempotent if running it multiple times produces the same result as running it once. The opposite of idempotent is a stage that appends on each run: re-run it twice and you have duplicates; re-run it ten times and the dataset is ten times the intended size.

  NON-IDEMPOTENT (append-on-rerun):

  run 1:  bronze/raw/batch_20240901.jsonl ─append→ silver/clean.jsonl   # 1 M rows
  run 2:  bronze/raw/batch_20240901.jsonl ─append→ silver/clean.jsonl   # 2 M rows ← BUG

  IDEMPOTENT FIX (overwrite / key-upsert):

  run 1:  bronze/raw/batch_20240901.jsonl ─overwrite→ silver/clean/dt=20240901/  # 1 M rows
  run 2:  bronze/raw/batch_20240901.jsonl ─overwrite→ silver/clean/dt=20240901/  # 1 M rows ✓

  Content-addressed alternative: hash(record) → key; upsert on key.
  Re-running with the same input produces the same keys → no new rows inserted.

The idempotency patterns are: (a) partition-scoped overwrite — write results to a date-partitioned path and overwrite the entire partition; (b) content-addressed upsert — assign a deterministic key from the record content, then upsert rather than append; (c) atomic swap — write to a staging path, then rename/swap into the production path.

Determinism

A stage is deterministic if the same inputs always produce byte-identical outputs. Idempotency is a weaker property: you can have an idempotent stage that is not deterministic (each overwrite produces a different shuffle order). Determinism requires pinning:

Why you need both
Idempotency alone lets you re-run safely without duplicating data. Determinism alone guarantees the same output order — but if a second run appends instead of overwrites, the data still doubles. You need idempotency so re-runs don't grow the dataset and determinism so the bytes match. Together they make backfills and audits possible.

The pipeline is a DAG

Each stage depends on the outputs of earlier stages and produces outputs consumed by later stages. That structure is a directed acyclic graph (DAG). Cycles are impossible by definition: gold cannot depend on silver if silver depends on gold. The DAG structure is what allows an orchestrator to:

The tooling for managing that DAG — Airflow, Dagster, Prefect, Flyte — is the subject of lesson 09. For now, the key mental model is: design each stage as a pure function (inputs → outputs, no hidden state, no side effects outside the declared output path), and the DAG takes care of the rest.

Forward reference
Lesson 09 covers orchestration in depth: how to declare the DAG in code, how retries and backfills interact with idempotency, and how to version both the pipeline code and the data it produces so a dataset is fully reproducible months later.

Putting it together: the full pipeline as a DAG of pure stages

  SOURCES
    │  (lesson 03: connectors, schema-on-read, provenance capture)
    ▼
  BRONZE  ──  raw, immutable, append-only
    │  stage: ingest
    │  pattern: append new partitions only; never modify existing ones
    ▼
  SILVER  ──  clean, deduped, decontaminated, quality-gated
    │  stage: transform (lessons 05–08)
    │  pattern: overwrite partition; keyed upsert; or atomic swap
    ▼
  GOLD  ──  tokenized, packed, model-ready
    │  stage: tokenize + pack (lesson 07)
    │  pattern: overwrite; version by tokenizer hash + packing config
    ▼
  TRAINER  (SFT · DPO · RL)

  Each arrow is a pure function: same inputs + pinned env → same bytes.
  The DAG orchestrator (lesson 09) runs them in dependency order.
Takeaway — setting up lesson 03
The medallion skeleton gives every post-training pipeline the same three-layer shape: bronze is the source of truth, silver is the curated view, gold is what the trainer reads. The two correctness invariants — idempotency and determinism — are what let you trust a dataset you built six months ago, backfill a partition after fixing a bug, and reproduce a training run exactly. Lesson 03 (ingestion) picks up at the very first arrow: how raw data from annotation platforms, synthetic generators, public corpora, and RL rollouts actually lands in bronze — with provenance intact from the moment it arrives.