Batch Processing, Joins, Materialization, and Recompute
Batch systems turn bounded input into derived output. Their strength is deterministic recomputation; their cost is scan, shuffle, materialization, and time.
Batch processing is the controlled act of deriving one dataset from another when the input is finite enough to scan and the output is valuable enough to materialize.
Unix philosophy at cluster scale
Batch processing inherits a simple idea: read immutable input, transform it, write output. Unix tools compose through files and pipes. MapReduce and modern dataflow engines scale the idea across machines: partition input, run local maps, shuffle related records together, reduce or join, and write durable output.
The key property is reproducibility. If the input snapshot and code version are fixed, the output can be regenerated. That is why batch is so important for analytics, training data, backfills, embeddings, search indexes, and offline metrics. A broken derived view can be deleted and rebuilt from the source of truth.
The shuffle is the tax
Narrow operations such as filtering or mapping can run independently on each partition. Wide operations such as group-by and join require records with the same key to meet on the same machine. That all-to-all movement is the shuffle.
Shuffle cost is network bytes, disk spill, serialization, sorting, skew, and stragglers. A single hot key can make one reducer the bottleneck while the rest of the cluster waits. The practical optimizations follow directly: filter early, project only needed columns, pre-aggregate locally, choose join strategy carefully, and handle skew explicitly.
When someone says a Spark job is slow, ask: scan, shuffle, skew, or output? The answer picks the fix.
Materialization and derived data
Batch jobs materialize derived state: a table, an index, features, embeddings, metrics, aggregates. Materialization makes future reads cheap by paying compute now. The trade-off is freshness and storage: the output is a snapshot, stale until recomputed or incrementally updated.
ML pipelines lean heavily on this. Training examples are materialized so experiments are reproducible. Embedding indexes are rebuilt so retrieval is fast. Offline features are snapshotted so training can match event time. Batch is the backbone of correctness because it can rerun from history.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Full recompute | Simple, deterministic, fixes accumulated drift | Expensive and slow for large datasets |
| Incremental batch | Cheaper updates and fresher outputs | More state, more edge cases, harder correctness |
| Materialized view | Fast reads over expensive transformation | Stale until updated; storage and lineage required |
| Shuffle join | General and scalable | Network/disk heavy and sensitive to skew |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
If derived data cannot be recomputed or traced to inputs, it stops being a cache and becomes a second unexplained source of truth.
Design prompts
- For a nightly embedding rebuild, what inputs and versions must be recorded for reproducibility?
- How do you diagnose whether a batch job is scan-bound or shuffle-bound?
- When is full recompute better than incremental maintenance?