Ingestion & provenance

Before any transformation can happen the data has to land. This lesson covers where post-training data comes from, how it arrives in the bronze layer, and — most critically — why the metadata that describes what it is and where it came from must be captured at the door or it is lost forever.

Where we are

Lesson 02 established the medallion layout: bronze = raw, immutable, append-only. Lesson 03 answers the prior question — how does anything get into bronze in the first place? By the end you will know the five source archetypes, the ingestion patterns that keep bronze reliable, and why provenance is a first-class column, not an afterthought.

The five source archetypes

Post-training data arrives from a small number of source types. Each has a characteristic volume, quality, cost, and risk profile that shapes how you ingest and how much you trust it.

Source	Volume	Quality	Cost	Main risk
Human annotation SFT demos, preference pairs, ratings	Low–Medium K–M rows	High	Very high $5–100/example	Schema drift from labeling-spec changes; annotator disagreement not captured
Synthetic generation LLM instructions/responses, distillation, self-instruct	High M–B rows	Medium	Low $0.001–0.01/example	Mode collapse; teacher-model contamination; license of the generating model
Production logs & telemetry User prompts, thumbs up/down, edits	Very high M–B events/day	Low–Medium (implicit labels)	Near-zero (already collected)	PII; consent; survivorship bias in logged signals
Public datasets Hugging Face Hub, academic corpora	Medium–High varies widely	Medium	Very low (download cost only)	License ambiguity; unknown quality; test-set contamination
Web scrape Common Crawl, targeted crawlers	Very high B–T tokens	Low	Low (infra cost only)	PII at scale; copyright; TOS violations; requires heavy downstream cleaning

The tradeoff is consistent: human annotation and web scrape are at opposite corners — one is expensive and clean, the other is cheap and noisy. Synthetic data is the middle ground that post-training pipelines lean on heavily, but it carries hidden risks (mode collapse, contamination from the teacher model) that require their own mitigations.

The fan-in: sources landing in bronze

Regardless of source type, every ingest job terminates at the same destination: an append-only bronze partition stamped with provenance. The diagram below shows the five source types converging on bronze, each edge annotated with the metadata it must carry.

Ingestion patterns

Full load vs incremental load

A full load re-ingests the entire source on every run. It is simple and correct but expensive for large sources. Use it when the source is small or when the upstream system does not expose a change boundary.

An incremental load ingests only records newer than a watermark (a timestamp, an offset, a sequence ID). It is the default for any source that grows continuously — production logs, annotation queues, synthetic-generation jobs. The watermark is persisted in the pipeline state so a restart does not re-ingest old data.

Change-data capture (CDC) is the incremental pattern for log streams and database replicas. Instead of polling, the source emits a stream of insert/update/delete events (e.g. Debezium off a Postgres WAL, or a Kafka topic of model telemetry). The ingestion job consumes this stream and appends landed records to bronze — never modifying what is already there, because bronze is immutable.

Append-only landing and schema-on-read

Bronze is always append-only. Once a record lands it is never updated in place. If the source sends a correction, the correction lands as a new record alongside the original; reconciliation is a silver-layer concern. This is what "immutable" meant in lesson 02, and it is what makes bronze a trustworthy audit trail.

The corollary is schema-on-read: land the raw bytes (JSON, JSONL, Avro, whatever the source emits) without enforcing a strict schema at ingest time. Parse and validate later, in the silver transformation. This avoids a common failure mode where a schema-on-write ingestor rejects records because the source added a new field, causing a silent data gap. The raw bytes are the ground truth; the schema is the pipeline's interpretation of them, and it can evolve without losing history.

Idempotent landing: content-hash dedupe at the door

Ingestion jobs fail and retry. Networks deliver duplicates. A source dataset can be re-uploaded by a vendor. Without a guard, bronze fills with duplicate records that poison every downstream stage.

The guard is a content hash computed over the payload bytes (SHA-256 or xxHash is typical). Before writing a record to bronze, the job checks whether that hash already exists in the landing manifest. If it does, the record is skipped — idempotent. This is the "dedupe-at-the-door" principle: coarse, cheap, and based only on exact identity. It is not the same as near-duplicate detection (that is lesson 06's job); it only prevents the same bytes from landing twice.

Tie to lesson 02

Lesson 02 defined idempotency as the property that running a pipeline step multiple times produces the same result as running it once. Content-hash deduplication at the door is how ingestion earns that property: re-running the ingest job on the same source files produces the same bronze partition, no extras.

Provenance: capture it at the door

Every record that lands in bronze must carry a set of provenance fields — metadata that describes what the record is, where it came from, and what constraints apply to its use. These fields are not optional extras; they are columns written at ingest time and propagated through every downstream layer unchanged.

{
  "record_id":    "sha256:a3f8...",          # content hash = dedup key
  "source_type":  "human_annotation",        # one of the five archetypes
  "source_id":    "vendor=scale/batch=2024-11-04",
  "origin_url":   "s3://landing/scale/2024-11-04/batch_007.jsonl",
  "license":      "CC-BY-4.0",
  "consent_flag": true,                      # user consent obtained?
  "pii_flag":     false,                     # known/suspected PII?
  "ingest_ts":    "2024-11-04T18:22:01Z",   # wall-clock at landing
  "pipeline_run": "ingest-20241104-1822",    # for lineage / audit
  "payload":      { ... }                    # the actual record, untouched
}

The rule is absolute: you can never reconstruct provenance after the fact. If a record lands in bronze without a license field, you cannot go back to the source two months later and ask what license applied to that batch — the source may have changed, the vendor may have different terms by then, or the record may have been anonymized and the link to origin broken. The provenance must be captured at the moment of ingest, when the link to the original source is live.

Why does this matter in practice?

Legal use gating. If a license turns out to be incompatible with your use case (e.g. a dataset that prohibits commercial use), you need to be able to filter every record from that source out of your training set. That requires source_id to be present on every record.
Decontamination and audit. When an evaluation set is found to overlap with your training data, you need to trace exactly which ingestion batch contributed the contaminated records. Without pipeline_run and origin_url, this is impossible at scale.
Source removal. If a data vendor changes their terms or a human-annotation batch is found to have annotator-quality issues, you need to drop every record from that source — not guess which ones they are. source_id makes this a single filter predicate.
PII compliance. Right-to-erasure requests require knowing which records contain data from a specific user. Without pii_flag and origin_url at the record level, you cannot respond to these requests programmatically.

The trap: "we'll add provenance later"

The most common provenance mistake is landing the raw payload in bronze and treating license/source/consent as pipeline-level configuration that can be added in the silver transformation. It cannot. By the time silver runs, records from multiple sources have been mixed and re-partitioned. The link between a record and its origin exists only at the moment the record crosses the wire from the source system. Miss that window and you have a bronze lake of data you cannot legally audit, selectively remove, or trace. The cost of capturing provenance at ingest is a few extra columns. The cost of not having it is discovered the first time a legal or compliance team asks for a source-level breakdown.

Interactive · a landed bronze record

Select a source type to see what a representative bronze record looks like when it lands — including the provenance metadata the ingestion job writes alongside the payload.

Synthetic data: the hidden license risk

A synthetic record generated by calling a proprietary model API (e.g. GPT-4, Claude) may inherit the generating model's terms of service — which can prohibit using outputs to train a competing model. The generator_model provenance field is not decorative; it gates whether that record is legally usable. This information exists only at generation time. If you land the output without recording which model produced it, you cannot selectively exclude restricted outputs later.

What ingestion does NOT do

Ingestion's scope is narrow by design. The ingest job does not:

Validate payload schema (that is a silver-layer quality gate, lesson 08).
Detect near-duplicates across records (that is lesson 06's MinHash/LSH step).
Normalize or clean content (silver transformation, lesson 05).
Tokenize (gold layer, lesson 07).
Score quality (lesson 08).

Its only jobs are: land the raw bytes, stamp provenance, compute the content hash, and skip if already seen. Everything else is downstream. This narrow scope is what keeps bronze an honest audit trail — if the ingest job also "fixed" things, you would lose the original signal and your bronze would not be truly raw.

Takeaway

What to carry to lesson 04

Data arrives from five source archetypes that span several orders of magnitude in volume, quality, cost, and risk. All of them land in bronze via an append-only, idempotent ingest job that writes the raw payload unchanged and stamps provenance — source, license, consent, PII flag, timestamp — at the moment of landing. That provenance is irreplaceable: miss it and you cannot legally gate, selectively drop, or audit by source later. Lesson 04 takes over once the data is in bronze and asks the next question: how should bronze (and silver) be physically stored? The answer — JSONL vs Parquet, row vs columnar, partitioning — determines how cheaply and quickly every downstream stage can read what the ingest job just landed.