Scaling, cost & monitoring
The capstone: think in records/sec and $/M-tokens end to end, find and widen the bottleneck, and instrument the pipeline so a silent upstream failure never corrupts a training run.
Think in throughput, not just time
A pipeline has one useful unit: records per second (or GB/s) per stage, and a derived unit: $/M tokens processed end to end. Time matters because it drives cluster cost; the right way to cut cost is almost never "run the same job more slowly" — it is to make each stage faster so the cluster is smaller or runs shorter.
The governing principle is bottleneck (critical-path) reasoning: total wall-clock is dominated by the slowest stage, and no amount of parallelism in the fast stages helps. Add workers where the bottleneck is; everywhere else you are paying for idle cores.
Ttotal = max(Tingest, Ttransform, Tdedup, Ttokenize, Tscore) — the bottleneck stage gates everythingIn practice the stages are not perfectly parallel, but the intuition holds: profile first, widen the single longest bar, repeat.
Bottleneck taxonomy
Four kinds of bottlenecks appear in every post-training pipeline. Each has a characteristic symptom, a root cause, and a fix that maps to a prior lesson.
| Symptom | Bound | Root cause | Fix | Lesson |
|---|---|---|---|---|
| High scan cost, workers idle while reading | I/O-bound | Row format (JSONL), no partition pruning, over-wide columns selected | Columnar Parquet + predicate pushdown + partition pruning; read only the columns you need | 04 |
| Stage takes 10× longer than the map; network saturated | Shuffle-bound | All-to-all data movement in dedup/join; too many partitions or too large a shuffle payload | Filter and project before the shuffle; fewer, larger partitions; MinHash so only hashes move (not full records) | 05, 06 |
| CPU pinned at 100%, GPUs idle, queue grows | CPU-bound | Tokenizer or heuristic quality filters are single-threaded; insufficient workers for the tokenize/pack stage | More workers; parallelize tokenizer across cores; move cheap heuristics before expensive model calls | 07 |
| GPU utilization low despite large batch; queue of un-scored records grows | GPU-bound | Model-in-loop scoring (reward model, LLM filter, RL rollouts) not batched; CPU preprocessing starves the GPU | Batch GPU calls; overlap CPU and GPU stages with Ray Data streaming; right-size GPU instance type | 05, 08, 10 |
Cost levers
Once you know the bottleneck, five levers determine cost in roughly decreasing order of impact:
- Read less data — columnar + pushdown is not a minor optimization; it is often a 10–50× scan reduction (lesson 04). Every byte not read is a byte not paid for and not processed.
- Spot / preemptible instances — most pipeline stages are idempotent (lesson 02) and can checkpoint. Running on spot instances cuts compute cost by 60–80% at the price of occasional restarts, which a well-designed orchestrator (lesson 09) handles transparently.
- Right-size partitions — partitions that are too small create scheduling and file-open overhead; too large and you lose parallelism and create stragglers (lesson 05). Aim for 100–500 MB per partition.
- Cache and materialize intermediate results — if the quality-gate output is expensive to produce (reward-model scored silver layer), materialize it as a versioned Parquet snapshot (lesson 09) rather than recomputing it every training run.
- Autoscale, don't overprovision — a cluster sized for peak ingest sits idle 22 hours a day. Orchestrators with autoscaling spin up workers only when a DAG stage needs them and release them immediately after.
The end-to-end reference architecture
Every post-training pipeline is a variant of the same shape. The batch path runs nightly or on demand; the RL online path runs continuously inside the training loop.
BATCH PATH (SFT / DPO / offline RL)
─────────────────────────────────────────────────────────────────────────────
SOURCES BRONZE SILVER GOLD
annotate ──┐ ┌──────────┐ ETL ┌────────────────────┐ pack ┌────────┐
synth ──┼──▶│ raw │──────▶ │ clean │──────▶│ token- │──▶ TRAINER
logs ──┤ │ JSONL │ Spark │ dedup (MinHash/LSH)│ │ ized │ SFT/DPO
scrape ──┘ │ immut. │ Ray │ decon (n-gram) │ │ packed │
│ proven. │ Daft │ quality gate │ │ Parquet│
└──────────┘ │ (heuristic+model) │ └────────┘
lesson 03 └────────────────────┘ lesson 07
lessons 05, 06, 08
Orchestrated by Airflow / Dagster / Prefect (lesson 09)
Versioned, lineage-tracked, reproducible (lesson 02)
RL ONLINE OVERLAY (lesson 10) — runs inside the training loop every step
─────────────────────────────────────────────────────────────────────────────
PROMPT BANK ──▶ ROLLOUT WORKERS ──▶ VERIFY ──▶ REPLAY BUFFER ──▶ TRAINER
(policy model) (rule/ updates
model) weights
▲ │
└───────────────── weight sync ─────────────────┘
Batch ETL feeds the prompt bank; rollouts re-enter as data every step.
The two paths share the quality-gate and tokenize/pack stages — gold Parquet from the batch path is the same format the RL trainer consumes, which is why getting the batch path right (lessons 04–09) also sets up the RL path (lesson 10) correctly.
Monitoring and data observability
A pipeline that runs without monitoring is not a pipeline — it is a time bomb. Silent data quality degradation is the most common cause of mysterious training regressions. Four signals cover the vast majority of real failures:
- Volume — count records at each stage boundary. If dedup is passing 30% fewer records than last week and no rule changed, something upstream changed. Alert on drops and spikes above a percentage threshold (e.g. ±20%).
- Freshness / lateness — track the max ingestion timestamp of the newest record in each layer. A freshness lag growing beyond the SLA means an upstream source stopped or the ingest job is silently failing partial runs.
- Schema drift — a field changes type, is renamed, or disappears. In a typed system (Parquet with an enforced schema contract) this raises an exception; in a schema-on-read JSON world it propagates as silent nulls or wrong-typed values until a model produces garbage. Validate schema at every stage boundary, not just at ingest.
- Distribution / quality drift — the distribution of scores, token lengths, language mix, or pass-rate through quality gates shifts. This is harder to detect but more dangerous: the schema is fine, the counts are normal, but the data is different enough to send training in a bad direction.
response to completion in their export. The ingest job runs without error because it is schema-on-read. Bronze lands with the new field name. The transform stage reads response, finds nulls for all new records, drops them silently in the quality gate's null-length filter. Volume looks normal because old records still flow. The gold dataset is now half the size it should be — but the training job starts anyway, sees an underfit loss, and the team spends a week debugging the optimizer. The fix is schema contracts enforced at the bronze boundary and volume alerts on quality-gate pass-rate: a sudden drop from 85% to 40% would have flagged the issue within one pipeline run.
Interactive · end-to-end pipeline sizer
Set your dataset size, worker count, and pipeline configuration. The widget computes per-stage time, shows a stacked breakdown, highlights the bottleneck, and reports total wall-clock, effective throughput, and estimated cluster cost. Use it to find the bottleneck — then fix it with the lesson it points to.
Connecting monitoring to the pipeline sizer
The sizer above shows what should happen given your configuration. Monitoring tells you what is happening. When the two diverge — stage throughput drops below the modelled value, quality-gate pass-rate collapses, gold record count falls — you have an incident. The alert should name the stage (from your per-stage volume metrics), and that stage maps directly to the lesson and the fix.
For RL pipelines (lesson 10), monitoring takes on a new urgency: a freshness failure in the replay buffer means the policy is training on stale trajectories that no longer reflect the current weights, which can cause reward hacking or training instability that looks like a hyperparameter problem.