Ingestion & provenance
Before any transformation can happen the data has to land. This lesson covers where post-training data comes from, how it arrives in the bronze layer, and — most critically — why the metadata that describes what it is and where it came from must be captured at the door or it is lost forever.
The five source archetypes
Post-training data arrives from a small number of source types. Each has a characteristic volume, quality, cost, and risk profile that shapes how you ingest and how much you trust it.
| Source | Volume | Quality | Cost | Main risk |
|---|---|---|---|---|
| Human annotation SFT demos, preference pairs, ratings |
Low–Medium K–M rows |
High | Very high $5–100/example |
Schema drift from labeling-spec changes; annotator disagreement not captured |
| Synthetic generation LLM instructions/responses, distillation, self-instruct |
High M–B rows |
Medium | Low $0.001–0.01/example |
Mode collapse; teacher-model contamination; license of the generating model |
| Production logs & telemetry User prompts, thumbs up/down, edits |
Very high M–B events/day |
Low–Medium (implicit labels) |
Near-zero (already collected) |
PII; consent; survivorship bias in logged signals |
| Public datasets Hugging Face Hub, academic corpora |
Medium–High varies widely |
Medium | Very low (download cost only) |
License ambiguity; unknown quality; test-set contamination |
| Web scrape Common Crawl, targeted crawlers |
Very high B–T tokens |
Low | Low (infra cost only) |
PII at scale; copyright; TOS violations; requires heavy downstream cleaning |
The tradeoff is consistent: human annotation and web scrape are at opposite corners — one is expensive and clean, the other is cheap and noisy. Synthetic data is the middle ground that post-training pipelines lean on heavily, but it carries hidden risks (mode collapse, contamination from the teacher model) that require their own mitigations.
The fan-in: sources landing in bronze
Regardless of source type, every ingest job terminates at the same destination: an append-only bronze partition stamped with provenance. The diagram below shows the five source types converging on bronze, each edge annotated with the metadata it must carry.
Ingestion patterns
Full load vs incremental load
A full load re-ingests the entire source on every run. It is simple and correct but expensive for large sources. Use it when the source is small or when the upstream system does not expose a change boundary.
An incremental load ingests only records newer than a watermark (a timestamp, an offset, a sequence ID). It is the default for any source that grows continuously — production logs, annotation queues, synthetic-generation jobs. The watermark is persisted in the pipeline state so a restart does not re-ingest old data.
Change-data capture (CDC) is the incremental pattern for log streams and database replicas. Instead of polling, the source emits a stream of insert/update/delete events (e.g. Debezium off a Postgres WAL, or a Kafka topic of model telemetry). The ingestion job consumes this stream and appends landed records to bronze — never modifying what is already there, because bronze is immutable.
Append-only landing and schema-on-read
Bronze is always append-only. Once a record lands it is never updated in place. If the source sends a correction, the correction lands as a new record alongside the original; reconciliation is a silver-layer concern. This is what "immutable" meant in lesson 02, and it is what makes bronze a trustworthy audit trail.
The corollary is schema-on-read: land the raw bytes (JSON, JSONL, Avro, whatever the source emits) without enforcing a strict schema at ingest time. Parse and validate later, in the silver transformation. This avoids a common failure mode where a schema-on-write ingestor rejects records because the source added a new field, causing a silent data gap. The raw bytes are the ground truth; the schema is the pipeline's interpretation of them, and it can evolve without losing history.
Idempotent landing: content-hash dedupe at the door
Ingestion jobs fail and retry. Networks deliver duplicates. A source dataset can be re-uploaded by a vendor. Without a guard, bronze fills with duplicate records that poison every downstream stage.
The guard is a content hash computed over the payload bytes (SHA-256 or xxHash is typical). Before writing a record to bronze, the job checks whether that hash already exists in the landing manifest. If it does, the record is skipped — idempotent. This is the "dedupe-at-the-door" principle: coarse, cheap, and based only on exact identity. It is not the same as near-duplicate detection (that is lesson 06's job); it only prevents the same bytes from landing twice.
Provenance: capture it at the door
Every record that lands in bronze must carry a set of provenance fields — metadata that describes what the record is, where it came from, and what constraints apply to its use. These fields are not optional extras; they are columns written at ingest time and propagated through every downstream layer unchanged.
{
"record_id": "sha256:a3f8...", # content hash = dedup key
"source_type": "human_annotation", # one of the five archetypes
"source_id": "vendor=scale/batch=2024-11-04",
"origin_url": "s3://landing/scale/2024-11-04/batch_007.jsonl",
"license": "CC-BY-4.0",
"consent_flag": true, # user consent obtained?
"pii_flag": false, # known/suspected PII?
"ingest_ts": "2024-11-04T18:22:01Z", # wall-clock at landing
"pipeline_run": "ingest-20241104-1822", # for lineage / audit
"payload": { ... } # the actual record, untouched
}
The rule is absolute: you can never reconstruct provenance after the fact. If a record lands in bronze without a license field, you cannot go back to the source two months later and ask what license applied to that batch — the source may have changed, the vendor may have different terms by then, or the record may have been anonymized and the link to origin broken. The provenance must be captured at the moment of ingest, when the link to the original source is live.
Why does this matter in practice?
- Legal use gating. If a license turns out to be incompatible with your use case (e.g. a dataset that prohibits commercial use), you need to be able to filter every record from that source out of your training set. That requires
source_idto be present on every record. - Decontamination and audit. When an evaluation set is found to overlap with your training data, you need to trace exactly which ingestion batch contributed the contaminated records. Without
pipeline_runandorigin_url, this is impossible at scale. - Source removal. If a data vendor changes their terms or a human-annotation batch is found to have annotator-quality issues, you need to drop every record from that source — not guess which ones they are.
source_idmakes this a single filter predicate. - PII compliance. Right-to-erasure requests require knowing which records contain data from a specific user. Without
pii_flagandorigin_urlat the record level, you cannot respond to these requests programmatically.
Interactive · a landed bronze record
Select a source type to see what a representative bronze record looks like when it lands — including the provenance metadata the ingestion job writes alongside the payload.
generator_model provenance field is not decorative; it gates whether that record is legally usable. This information exists only at generation time. If you land the output without recording which model produced it, you cannot selectively exclude restricted outputs later.
What ingestion does NOT do
Ingestion's scope is narrow by design. The ingest job does not:
- Validate payload schema (that is a silver-layer quality gate, lesson 08).
- Detect near-duplicates across records (that is lesson 06's MinHash/LSH step).
- Normalize or clean content (silver transformation, lesson 05).
- Tokenize (gold layer, lesson 07).
- Score quality (lesson 08).
Its only jobs are: land the raw bytes, stamp provenance, compute the content hash, and skip if already seen. Everything else is downstream. This narrow scope is what keeps bronze an honest audit trail — if the ingest job also "fixed" things, you would lose the original signal and your bronze would not be truly raw.