Orchestration & versioning
You've built every stage of the pipeline — ingest, store, transform, dedup, tokenize, quality-gate. Now you have to wire them into a schedulable, retryable, versioned DAG that can fail gracefully, recover from bugs, and prove that a training run can be reproduced months later.
Why an orchestrator — you can't just run stages by hand
Once the pipeline has more than two stages, the "just run it by hand" model breaks. Tasks have dependencies (tokenize can't start until dedup finishes), schedules (nightly, or triggered by new data), and failure modes (a quality-gate job crashes at 3 AM). You need something that encodes the dependency graph, retries failed tasks, skips tasks whose inputs haven't changed, and sends an alert — not a shell script run in a tmux session.
An orchestrator is a system that executes a directed acyclic graph (DAG) of tasks, where each node is a unit of work and each edge is a dependency. It provides:
- Scheduling — run on a cron, on a trigger, or on demand.
- Dependency tracking — task B doesn't start until task A finishes successfully.
- Retries with backoff — transient failures (a flaky API, a network hiccup) are retried automatically, not silently dropped.
- Backfills — re-run a historical date range after fixing a bug without touching future runs.
- Observability — a UI showing which tasks are running, which failed, and why.
Four orchestrators and their flavor
| Tool | Orientation | Sweet spot | Watch out |
|---|---|---|---|
| Airflow | Task-centric, schedule-first. DAG = Python file; tasks are Operators. | Mature ecosystem; every data source has an operator. Standard choice for batch ETL with fixed schedules. | No native data lineage; little awareness of what a task produces. Wiring tasks to data is manual. The scheduler can become a bottleneck on very large DAGs. |
| Dagster | Asset-centric, data-aware. Tasks are assets with typed ins/outs; the graph is over data, not tasks. | Best-in-class data lineage and materialization tracking. First-class support for caching by input hash. Excellent for post-training pipelines where the dataset is the artifact. | Steeper learning curve than Airflow. More opinionated — you model your data, not just your code. |
| Prefect | Pythonic, dynamic flows. Decorate functions with @task and @flow; the graph is inferred from Python call structure. |
Low ceremony for teams that want Airflow's capabilities without the operator/scheduler boilerplate. Good for dynamic fan-out (one task spawns N downstream tasks at runtime). | Dynamic graphs can be hard to visualize statically. Relatively young compared to Airflow. |
| Flyte | Kubernetes-native, strongly typed. Tasks are containers; the type system enforces data contracts between them. | ML-oriented: tasks that run distributed Spark/Ray jobs, GPU training steps, or model evaluations. Strong reproducibility guarantees — inputs and outputs are content-addressed. | Requires a Kubernetes cluster. More infrastructure to stand up than Airflow or Prefect. |
The industry trend is visible in this table: Airflow is task-centric (what code runs), while Dagster is asset/data-aware (what data is produced). For post-training pipelines — where the dataset is the product — the asset model is a better fit. You want the orchestrator to know that the silver deduped Parquet is stale because its upstream bronze JSONL changed, not just that a task named "dedup" failed.
Core concepts
Idempotency (re-run safety)
As lesson 02 established, every task must be idempotent: running it twice on the same inputs produces the same output, with no side-effects from the second run. In practice this means: write to a temp location and atomic-rename into place; use INSERT OVERWRITE not INSERT INTO; hash the output before writing so you can detect "nothing changed." Without idempotency, a retry doubles your data or corrupts it.
Retries with exponential backoff
Transient failures — a GCS 503, a Spark executor that got preempted — are normal at scale. Configure every task with a retry count (typically 2–3) and exponential backoff (wait 1 min, then 4 min, then 16 min). If the task still fails after all retries, fail the DAG run — do not silently skip it and let a half-built dataset reach training.
Backfills — fixing the past
You discover a bug in the quality filter that let through malformed examples for the last two weeks. You fix the bug and need to reprocess those two weeks of data without touching any other runs. This is a backfill: you re-run a DAG for a specific historical range of logical dates. All three properties — idempotency, determinism, and the immutable bronze layer — are what make this safe. You re-read the same raw data, re-apply the corrected logic, and overwrite only the silver/gold output for those dates.
Materialization & caching
If a task's inputs haven't changed — same source data, same code version — there is no reason to re-run it. Content-addressed caching computes a hash over all inputs (file hashes + code hash) and skips the task if the output for that hash already exists. Dagster does this natively for software-defined assets. Flyte content-addresses every input by default. Airflow requires external tooling. The payoff is large: on a re-run after a single-stage bug fix, every stage upstream of the fix is skipped instantly — only the changed stage and everything downstream of it re-executes.
Data versioning & lineage
A training run should pin an exact, reproducible snapshot of the dataset it consumed — not just a path like s3://bucket/silver/ that silently changes under you. Options:
- DVC — Git for data: each dataset version is a content-addressed pointer stored in Git, with the bytes stored in a remote.
git checkout v42 && dvc pullreconstructs the exact dataset. - LakeFS — Git branching semantics over an object store. You can branch, commit, and merge datasets the way you branch code. A training run tags a commit; rollback is instant.
- Delta Lake / Iceberg — Transactional table formats with built-in MVCC. Every write creates a new snapshot; you can
SELECT ... VERSION AS OF 42to query any historical state. - Dataset hashes — The simplest option: hash the manifest (list of files + their checksums) at pipeline completion and record it in your experiment tracker alongside the model checkpoint.
This connects back to lesson 01's principle: the dataset is the product. A model is only reproducible if you can exactly reconstruct the data it trained on. A checkpoint without a pinned dataset version is an artifact you can never fully understand or compare.
s3://silver/ path you both used is not the same data anymore — it has been incrementally updated, backfilled, and patched. You have two model checkpoints and no way to isolate the cause. Dataset versioning is not a nice-to-have; it's the only mechanism that makes a regression diagnosable.
Failure handling in practice
When a task fails inside a DAG run, the orchestrator's behavior should be:
- Retry with backoff. Most failures are transient. Give the task 2–3 attempts before escalating.
- Fail the DAG run if retries are exhausted. Do not skip the failed task and run downstream tasks on incomplete data — this is how half-built datasets reach training silently.
- Alert. Page the on-call engineer. A failed pipeline that nobody knows about is worse than a pipeline that was never built.
- On re-run, skip cached upstream tasks. Once you've fixed the bug, re-triggering the DAG should re-run only the failed task and its downstream dependents — not the entire pipeline. Content-addressed caching makes this automatic.
DAG run: ingest ──▶ clean ──▶ dedup ──▶ tokenize ──▶ pack ──▶ validate ──▶ publish
│ FAIL (retry 1, 2, 3 → all fail)
▼
DAG fails here. Downstream tasks do NOT run.
Upstream tasks (ingest, clean) are cached.
After bug fix + re-run:
ingest ──▶ clean ──(cached)──▶ dedup ──▶ tokenize ──▶ pack ──▶ validate ──▶ publish
(skip) (skip) (re-run) (re-run) (re-run) (re-run) (re-run)
Interactive · DAG run simulator
Watch a 7-task pipeline run, inject a failure at any stage, and see how retries, caching, and partial re-runs work in practice. The kpi-grid tracks how much wall-clock the cache saves.
Takeaway
s3://silver/" is not. Lesson 10 asks: what does orchestration look like when the pipeline runs not nightly but every training step? In RL, the rollout buffer, the verifier, and the trainer are all nodes in a DAG that cycles continuously — and the batch-ETL assumptions above all have to be revisited.