data_engineering / 09 · orchestration lesson 9 / 11

Orchestration & versioning

You've built every stage of the pipeline — ingest, store, transform, dedup, tokenize, quality-gate. Now you have to wire them into a schedulable, retryable, versioned DAG that can fail gracefully, recover from bugs, and prove that a training run can be reproduced months later.

Where we are
Lessons 03–08 built each stage in isolation. This lesson is the glue: the orchestrator that sequences the stages, handles failures, caches unchanged work, and versions the output so a dataset is a first-class artifact — not just a directory of files that happened to be there when training started.

Why an orchestrator — you can't just run stages by hand

Once the pipeline has more than two stages, the "just run it by hand" model breaks. Tasks have dependencies (tokenize can't start until dedup finishes), schedules (nightly, or triggered by new data), and failure modes (a quality-gate job crashes at 3 AM). You need something that encodes the dependency graph, retries failed tasks, skips tasks whose inputs haven't changed, and sends an alert — not a shell script run in a tmux session.

An orchestrator is a system that executes a directed acyclic graph (DAG) of tasks, where each node is a unit of work and each edge is a dependency. It provides:

Four orchestrators and their flavor

ToolOrientationSweet spotWatch out
Airflow Task-centric, schedule-first. DAG = Python file; tasks are Operators. Mature ecosystem; every data source has an operator. Standard choice for batch ETL with fixed schedules. No native data lineage; little awareness of what a task produces. Wiring tasks to data is manual. The scheduler can become a bottleneck on very large DAGs.
Dagster Asset-centric, data-aware. Tasks are assets with typed ins/outs; the graph is over data, not tasks. Best-in-class data lineage and materialization tracking. First-class support for caching by input hash. Excellent for post-training pipelines where the dataset is the artifact. Steeper learning curve than Airflow. More opinionated — you model your data, not just your code.
Prefect Pythonic, dynamic flows. Decorate functions with @task and @flow; the graph is inferred from Python call structure. Low ceremony for teams that want Airflow's capabilities without the operator/scheduler boilerplate. Good for dynamic fan-out (one task spawns N downstream tasks at runtime). Dynamic graphs can be hard to visualize statically. Relatively young compared to Airflow.
Flyte Kubernetes-native, strongly typed. Tasks are containers; the type system enforces data contracts between them. ML-oriented: tasks that run distributed Spark/Ray jobs, GPU training steps, or model evaluations. Strong reproducibility guarantees — inputs and outputs are content-addressed. Requires a Kubernetes cluster. More infrastructure to stand up than Airflow or Prefect.

The industry trend is visible in this table: Airflow is task-centric (what code runs), while Dagster is asset/data-aware (what data is produced). For post-training pipelines — where the dataset is the product — the asset model is a better fit. You want the orchestrator to know that the silver deduped Parquet is stale because its upstream bronze JSONL changed, not just that a task named "dedup" failed.

Core concepts

Idempotency (re-run safety)

As lesson 02 established, every task must be idempotent: running it twice on the same inputs produces the same output, with no side-effects from the second run. In practice this means: write to a temp location and atomic-rename into place; use INSERT OVERWRITE not INSERT INTO; hash the output before writing so you can detect "nothing changed." Without idempotency, a retry doubles your data or corrupts it.

Retries with exponential backoff

Transient failures — a GCS 503, a Spark executor that got preempted — are normal at scale. Configure every task with a retry count (typically 2–3) and exponential backoff (wait 1 min, then 4 min, then 16 min). If the task still fails after all retries, fail the DAG run — do not silently skip it and let a half-built dataset reach training.

Backfills — fixing the past

You discover a bug in the quality filter that let through malformed examples for the last two weeks. You fix the bug and need to reprocess those two weeks of data without touching any other runs. This is a backfill: you re-run a DAG for a specific historical range of logical dates. All three properties — idempotency, determinism, and the immutable bronze layer — are what make this safe. You re-read the same raw data, re-apply the corrected logic, and overwrite only the silver/gold output for those dates.

Materialization & caching

If a task's inputs haven't changed — same source data, same code version — there is no reason to re-run it. Content-addressed caching computes a hash over all inputs (file hashes + code hash) and skips the task if the output for that hash already exists. Dagster does this natively for software-defined assets. Flyte content-addresses every input by default. Airflow requires external tooling. The payoff is large: on a re-run after a single-stage bug fix, every stage upstream of the fix is skipped instantly — only the changed stage and everything downstream of it re-executes.

Data versioning & lineage

A training run should pin an exact, reproducible snapshot of the dataset it consumed — not just a path like s3://bucket/silver/ that silently changes under you. Options:

This connects back to lesson 01's principle: the dataset is the product. A model is only reproducible if you can exactly reconstruct the data it trained on. A checkpoint without a pinned dataset version is an artifact you can never fully understand or compare.

What happens without dataset versioning
A model you shipped last month regresses on a benchmark this month. You want to know: was it the data change or the code change? Without a versioned, pinned dataset, you cannot answer that question. The s3://silver/ path you both used is not the same data anymore — it has been incrementally updated, backfilled, and patched. You have two model checkpoints and no way to isolate the cause. Dataset versioning is not a nice-to-have; it's the only mechanism that makes a regression diagnosable.

Failure handling in practice

When a task fails inside a DAG run, the orchestrator's behavior should be:

  1. Retry with backoff. Most failures are transient. Give the task 2–3 attempts before escalating.
  2. Fail the DAG run if retries are exhausted. Do not skip the failed task and run downstream tasks on incomplete data — this is how half-built datasets reach training silently.
  3. Alert. Page the on-call engineer. A failed pipeline that nobody knows about is worse than a pipeline that was never built.
  4. On re-run, skip cached upstream tasks. Once you've fixed the bug, re-triggering the DAG should re-run only the failed task and its downstream dependents — not the entire pipeline. Content-addressed caching makes this automatic.
DAG run:  ingest ──▶ clean ──▶ dedup ──▶ tokenize ──▶ pack ──▶ validate ──▶ publish
                               │ FAIL (retry 1, 2, 3 → all fail)
                               ▼
                         DAG fails here. Downstream tasks do NOT run.
                         Upstream tasks (ingest, clean) are cached.

After bug fix + re-run:
          ingest ──▶ clean ──(cached)──▶ dedup ──▶ tokenize ──▶ pack ──▶ validate ──▶ publish
          (skip)    (skip)              (re-run) (re-run)   (re-run) (re-run)     (re-run)

Interactive · DAG run simulator

Watch a 7-task pipeline run, inject a failure at any stage, and see how retries, caching, and partial re-runs work in practice. The kpi-grid tracks how much wall-clock the cache saves.

DAG run simulator
Click Run to execute all tasks in order. Use the controls to inject a failure at a chosen task or to simulate a re-run where upstream tasks are already cached.
Tasks run
Tasks cached
Tasks failed
Wall-clock saved

Takeaway

What to carry to lesson 10
An orchestrator turns a collection of pipeline stages into a schedulable, retryable, auditable DAG. Idempotency (lesson 02) is the foundation that makes retries and backfills safe. Content-addressed caching makes re-runs cheap — only changed stages re-execute. Data versioning closes the loop: a training run pinned to an exact dataset version is reproducible; one that just uses "whatever is in s3://silver/" is not. Lesson 10 asks: what does orchestration look like when the pipeline runs not nightly but every training step? In RL, the rollout buffer, the verifier, and the trainer are all nodes in a DAG that cycles continuously — and the batch-ETL assumptions above all have to be revisited.