data_engineering / 03 · ingestion lesson 3 / 11

Ingestion & provenance

Before any transformation can happen the data has to land. This lesson covers where post-training data comes from, how it arrives in the bronze layer, and — most critically — why the metadata that describes what it is and where it came from must be captured at the door or it is lost forever.

Where we are
Lesson 02 established the medallion layout: bronze = raw, immutable, append-only. Lesson 03 answers the prior question — how does anything get into bronze in the first place? By the end you will know the five source archetypes, the ingestion patterns that keep bronze reliable, and why provenance is a first-class column, not an afterthought.

The five source archetypes

Post-training data arrives from a small number of source types. Each has a characteristic volume, quality, cost, and risk profile that shapes how you ingest and how much you trust it.

SourceVolumeQualityCostMain risk
Human annotation
SFT demos, preference pairs, ratings
Low–Medium
K–M rows
High Very high
$5–100/example
Schema drift from labeling-spec changes; annotator disagreement not captured
Synthetic generation
LLM instructions/responses, distillation, self-instruct
High
M–B rows
Medium Low
$0.001–0.01/example
Mode collapse; teacher-model contamination; license of the generating model
Production logs & telemetry
User prompts, thumbs up/down, edits
Very high
M–B events/day
Low–Medium
(implicit labels)
Near-zero
(already collected)
PII; consent; survivorship bias in logged signals
Public datasets
Hugging Face Hub, academic corpora
Medium–High
varies widely
Medium Very low
(download cost only)
License ambiguity; unknown quality; test-set contamination
Web scrape
Common Crawl, targeted crawlers
Very high
B–T tokens
Low Low
(infra cost only)
PII at scale; copyright; TOS violations; requires heavy downstream cleaning

The tradeoff is consistent: human annotation and web scrape are at opposite corners — one is expensive and clean, the other is cheap and noisy. Synthetic data is the middle ground that post-training pipelines lean on heavily, but it carries hidden risks (mode collapse, contamination from the teacher model) that require their own mitigations.

The fan-in: sources landing in bronze

Regardless of source type, every ingest job terminates at the same destination: an append-only bronze partition stamped with provenance. The diagram below shows the five source types converging on bronze, each edge annotated with the metadata it must carry.

Human Annotation SFT demos · prefs · ratings Synthetic Generation distill · self-instruct · LLM Prod Logs & Telemetry prompts · thumbs · edits Public Datasets HF Hub · academic corpora Web Scrape Common Crawl · crawlers BRONZE raw · immutable provenance-stamped INGESTION idempotent landing content-hash dedupe schema-on-read provenance attached license / PII flags set ingest timestamp written spec-version teacher model id consent flag license TOS / PII risk landed record

Ingestion patterns

Full load vs incremental load

A full load re-ingests the entire source on every run. It is simple and correct but expensive for large sources. Use it when the source is small or when the upstream system does not expose a change boundary.

An incremental load ingests only records newer than a watermark (a timestamp, an offset, a sequence ID). It is the default for any source that grows continuously — production logs, annotation queues, synthetic-generation jobs. The watermark is persisted in the pipeline state so a restart does not re-ingest old data.

Change-data capture (CDC) is the incremental pattern for log streams and database replicas. Instead of polling, the source emits a stream of insert/update/delete events (e.g. Debezium off a Postgres WAL, or a Kafka topic of model telemetry). The ingestion job consumes this stream and appends landed records to bronze — never modifying what is already there, because bronze is immutable.

Append-only landing and schema-on-read

Bronze is always append-only. Once a record lands it is never updated in place. If the source sends a correction, the correction lands as a new record alongside the original; reconciliation is a silver-layer concern. This is what "immutable" meant in lesson 02, and it is what makes bronze a trustworthy audit trail.

The corollary is schema-on-read: land the raw bytes (JSON, JSONL, Avro, whatever the source emits) without enforcing a strict schema at ingest time. Parse and validate later, in the silver transformation. This avoids a common failure mode where a schema-on-write ingestor rejects records because the source added a new field, causing a silent data gap. The raw bytes are the ground truth; the schema is the pipeline's interpretation of them, and it can evolve without losing history.

Idempotent landing: content-hash dedupe at the door

Ingestion jobs fail and retry. Networks deliver duplicates. A source dataset can be re-uploaded by a vendor. Without a guard, bronze fills with duplicate records that poison every downstream stage.

The guard is a content hash computed over the payload bytes (SHA-256 or xxHash is typical). Before writing a record to bronze, the job checks whether that hash already exists in the landing manifest. If it does, the record is skipped — idempotent. This is the "dedupe-at-the-door" principle: coarse, cheap, and based only on exact identity. It is not the same as near-duplicate detection (that is lesson 06's job); it only prevents the same bytes from landing twice.

Tie to lesson 02
Lesson 02 defined idempotency as the property that running a pipeline step multiple times produces the same result as running it once. Content-hash deduplication at the door is how ingestion earns that property: re-running the ingest job on the same source files produces the same bronze partition, no extras.

Provenance: capture it at the door

Every record that lands in bronze must carry a set of provenance fields — metadata that describes what the record is, where it came from, and what constraints apply to its use. These fields are not optional extras; they are columns written at ingest time and propagated through every downstream layer unchanged.

{
  "record_id":    "sha256:a3f8...",          # content hash = dedup key
  "source_type":  "human_annotation",        # one of the five archetypes
  "source_id":    "vendor=scale/batch=2024-11-04",
  "origin_url":   "s3://landing/scale/2024-11-04/batch_007.jsonl",
  "license":      "CC-BY-4.0",
  "consent_flag": true,                      # user consent obtained?
  "pii_flag":     false,                     # known/suspected PII?
  "ingest_ts":    "2024-11-04T18:22:01Z",   # wall-clock at landing
  "pipeline_run": "ingest-20241104-1822",    # for lineage / audit
  "payload":      { ... }                    # the actual record, untouched
}

The rule is absolute: you can never reconstruct provenance after the fact. If a record lands in bronze without a license field, you cannot go back to the source two months later and ask what license applied to that batch — the source may have changed, the vendor may have different terms by then, or the record may have been anonymized and the link to origin broken. The provenance must be captured at the moment of ingest, when the link to the original source is live.

Why does this matter in practice?

The trap: "we'll add provenance later"
The most common provenance mistake is landing the raw payload in bronze and treating license/source/consent as pipeline-level configuration that can be added in the silver transformation. It cannot. By the time silver runs, records from multiple sources have been mixed and re-partitioned. The link between a record and its origin exists only at the moment the record crosses the wire from the source system. Miss that window and you have a bronze lake of data you cannot legally audit, selectively remove, or trace. The cost of capturing provenance at ingest is a few extra columns. The cost of not having it is discovered the first time a legal or compliance team asks for a source-level breakdown.

Interactive · a landed bronze record

Select a source type to see what a representative bronze record looks like when it lands — including the provenance metadata the ingestion job writes alongside the payload.

Bronze record inspector
Each source type has different provenance fields that matter. The payload stays raw and untouched; only the wrapper metadata is written by the ingestion job.
Ingest pattern
License risk
PII risk
Quality signal
Synthetic data: the hidden license risk
A synthetic record generated by calling a proprietary model API (e.g. GPT-4, Claude) may inherit the generating model's terms of service — which can prohibit using outputs to train a competing model. The generator_model provenance field is not decorative; it gates whether that record is legally usable. This information exists only at generation time. If you land the output without recording which model produced it, you cannot selectively exclude restricted outputs later.

What ingestion does NOT do

Ingestion's scope is narrow by design. The ingest job does not:

Its only jobs are: land the raw bytes, stamp provenance, compute the content hash, and skip if already seen. Everything else is downstream. This narrow scope is what keeps bronze an honest audit trail — if the ingest job also "fixed" things, you would lose the original signal and your bronze would not be truly raw.

Takeaway

What to carry to lesson 04
Data arrives from five source archetypes that span several orders of magnitude in volume, quality, cost, and risk. All of them land in bronze via an append-only, idempotent ingest job that writes the raw payload unchanged and stamps provenance — source, license, consent, PII flag, timestamp — at the moment of landing. That provenance is irreplaceable: miss it and you cannot legally gate, selectively drop, or audit by source later. Lesson 04 takes over once the data is in bronze and asks the next question: how should bronze (and silver) be physically stored? The answer — JSONL vs Parquet, row vs columnar, partitioning — determines how cheaply and quickly every downstream stage can read what the ingest job just landed.