all_lessons / data_intensive_systems / 14 · batch processing lesson 14 / 16 · ~12 min

Batch Processing, Joins, Materialization, and Recompute

Batch systems turn bounded input into derived output. Their strength is deterministic recomputation; their cost is scan, shuffle, materialization, and time.

First principle

Batch processing is the controlled act of deriving one dataset from another when the input is finite enough to scan and the output is valuable enough to materialize.

input files -> map/filter -> shuffle by key -> join/group -> output files narrow ops wide op materialized result The shuffle is where parallel local work becomes distributed coordination over data.

Unix philosophy at cluster scale

Batch processing inherits a simple idea: read immutable input, transform it, write output. Unix tools compose through files and pipes. MapReduce and modern dataflow engines scale the idea across machines: partition input, run local maps, shuffle related records together, reduce or join, and write durable output.

The key property is reproducibility. If the input snapshot and code version are fixed, the output can be regenerated. That is why batch is so important for analytics, training data, backfills, embeddings, search indexes, and offline metrics. A broken derived view can be deleted and rebuilt from the source of truth.

The shuffle is the tax

Narrow operations such as filtering or mapping can run independently on each partition. Wide operations such as group-by and join require records with the same key to meet on the same machine. That all-to-all movement is the shuffle.

Shuffle cost is network bytes, disk spill, serialization, sorting, skew, and stragglers. A single hot key can make one reducer the bottleneck while the rest of the cluster waits. The practical optimizations follow directly: filter early, project only needed columns, pre-aggregate locally, choose join strategy carefully, and handle skew explicitly.

When someone says a Spark job is slow, ask: scan, shuffle, skew, or output? The answer picks the fix.

Materialization and derived data

Batch jobs materialize derived state: a table, an index, features, embeddings, metrics, aggregates. Materialization makes future reads cheap by paying compute now. The trade-off is freshness and storage: the output is a snapshot, stale until recomputed or incrementally updated.

ML pipelines lean heavily on this. Training examples are materialized so experiments are reproducible. Embedding indexes are rebuilt so retrieval is fast. Offline features are snapshotted so training can match event time. Batch is the backbone of correctness because it can rerun from history.

Trade-offs

ChoiceBuysCosts
Full recomputeSimple, deterministic, fixes accumulated driftExpensive and slow for large datasets
Incremental batchCheaper updates and fresher outputsMore state, more edge cases, harder correctness
Materialized viewFast reads over expensive transformationStale until updated; storage and lineage required
Shuffle joinGeneral and scalableNetwork/disk heavy and sensitive to skew
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

If derived data cannot be recomputed or traced to inputs, it stops being a cache and becomes a second unexplained source of truth.

Design prompts

  1. For a nightly embedding rebuild, what inputs and versions must be recorded for reproducibility?
  2. How do you diagnose whether a batch job is scan-bound or shuffle-bound?
  3. When is full recompute better than incremental maintenance?