The data plane that keeps the GPUs fed
Lesson 07 built a system that burns ~$8M of GPU-time to pretrain a 70B model. If the dataloader stalls 10% of steps, that is $800k of idle H100s — the data plane is not plumbing, it is a first-class performance system. The whole design reduces to one inequality: data delivery throughput must exceed GPU token-consumption throughput, with margin. Miss it and MFU collapses, and every parallelism decision from lesson 07 is wasted.
1 · The throughput target — what the cluster eats per second
From lesson 07, one synchronized training step processes the global batch across all data-parallel ranks, then takes step_time seconds. So the cluster consumes tokens at a constant, relentless rate:
4 MB/s cluster-wide is nothing — until you remember it must be random (shuffled), disjoint across ranks (no duplicate data), and resumable (exact order after a crash). The bytes are easy; the access pattern is the engineering.
2 · Where the data lives — stream, don't download
A pretraining corpus is petabytes of tokens. It sits in object storage (S3/GCS), not on local NVMe — no node has a petabyte of disk, and copying the corpus to every node is absurd. So each data-parallel rank streams shards from object store on demand, reading only the shards assigned to it (data_engineering 04, 05).
The trick that makes streaming free: double-buffering. While the GPU computes step i (2 s of compute), CPU worker threads prefetch and decode the batches for steps i+1, i+2, … into a host-RAM queue. If the I/O for one batch finishes in well under 2 s, the data wait is fully hidden behind compute and the GPU never sees a stall.
GPU: [ step i compute · 2s ][ step i+1 compute · 2s ][ step i+2 ... ]
CPU: [ prefetch i+1 ][ prefetch i+2 ][ prefetch i+3 ] ← must finish before GPU needs it
└── overlapped: data wait = 0 ──┘
BAD: [ step i ]......WAIT......[ step i+1 ] ← prefetch too slow → GPU idle = $ burned
The design rule: prefetch depth × per-batch fetch time < pipeline budget, and per-batch fetch time must be < step time so the queue never drains. Prefetch 2–4 batches ahead to absorb jitter in object-store latency.
3 · Sharding & shuffle — global shuffle is impossible at PB scale
SGD wants i.i.d. samples, which ideally means a uniform random permutation of the whole corpus each epoch. You cannot globally shuffle a petabyte — it would require random reads scattered across all of object storage, destroying the sequential-read throughput that makes streaming cheap. The standard approximation is a shuffle buffer:
- Pre-shuffle shard order once (cheap — just permute the shard list).
- Fill a buffer of K samples (e.g. K = 10,000) by reading shards sequentially.
- Each step, draw randomly from the buffer; refill the drawn slot from the stream.
This gives near-random sampling within a window of K samples while keeping reads sequential. Bigger K = better mixing, more host RAM.
4 · Tokenization & packing — don't waste FLOPs on padding
Offline vs online tokenization. Tokenize once, offline, and store token IDs (data_engineering 07): the data path then streams 2 bytes/token with zero per-step CPU cost. Tokenizing online — running the BPE tokenizer on raw text every step — moves that CPU cost onto the training host, where it competes with the dataloader's decode threads. At 2M tokens/s a slow tokenizer can become the bottleneck that starves the GPUs, the exact failure §6 hunts for. Offline is the default; online only when the corpus changes faster than you can re-tokenize it.
Packing. Documents have varying lengths; the model trains on fixed-length sequences (say 8,192 tokens). The naive approach pads each short document up to the max length — and padding tokens cost the same FLOPs as real tokens. The 6N rule from lesson 02 doesn't know a token is padding.
5 · Deterministic resumption — the dataloader is part of the checkpoint
Lesson 07 established that at 1,000-GPU scale a node dies every few hours, so the run checkpoints and restarts constantly. A run that dies at step 40,000 must resume the exact data order it would have seen with no crash. Otherwise you re-show data (overfitting a slice) or skip data (a hole in the epoch) — both silently corrupt training and are nearly impossible to debug after the fact.
So the dataloader state is checkpoint state, saved alongside the 16N model/optimizer bytes (data_engineering 09):
- shard cursor per rank — which shards consumed, position within the current shard,
- shuffle seed + the buffer's RNG state — so the same samples come out in the same order,
- epoch / global step — so the shard permutation for the next epoch is reproducible.
Get this right and a restart is invisible to the loss curve. Get it wrong and you discover the corruption only when an eval number looks off, weeks later.
6 · Proving it's not the bottleneck — the diagnostic
Lesson 00's design loop ends by finding the bottleneck step. Applied to the data plane: measure data-wait time per step — GPU idle between the end of one step's compute and the start of the next. If MFU is low, the question is comms (lesson 07) or data? The clean experiment:
7 · Co-design — format trades network bytes against host CPU
The storage format is a knob on the producer–consumer race, and which way to turn it depends on which resource is scarce on your data path:
| Format | Compression (network) | Decode cost (host CPU) | Pick when… |
|---|---|---|---|
| Parquet (columnar, compressed) | high — fewest bytes off object store | higher — decompress + decode | network-bound: limited egress / many ranks |
| Arrow (columnar, in-memory) | medium | low — zero-copy reads | CPU-bound: cheap bandwidth, scarce cores |
| WebDataset (tar shards) | tunable | low — sequential, streaming-friendly | pure sequential streaming at scale |
More compression means fewer bytes/s off the network but more host CPU to decode — and host CPU is also what online tokenization and shuffle-buffer management consume. If you are network-bound, compress harder. If you are CPU-bound (e.g. already tokenizing online), reach for a cheaper-to-decode format and pre-tokenize. The §6 synthetic-data test plus a CPU-utilization read tells you which regime you're in.
Interactive · will the data plane keep up?
Set the consumer (global batch, step time, dp ranks) and the producer (per-rank read bandwidth, whether you tokenize online). The widget computes the required bytes/s per rank against what the rank can actually read, and shows the MFU ceiling if the GPUs end up data-starved.
What carries forward
- The data plane is a performance system, not plumbing. A 10% dataloader stall on an $8M run is $800k of idle H100s — design it to win the producer–consumer race with margin.
- The target: tokens/s = global_batch / step_time must arrive shuffled and disjoint, every second, forever. In bytes it is tiny (~4 MB/s at 2 B/token) — the access pattern, not the bandwidth, is the work.
- Stream from object store, double-buffered. Prefetch N batches so I/O hides behind the step; per-batch fetch must beat step time or the queue drains.
- Shuffle-buffer approximates a global shuffle; shards stay disjoint and deterministic. Skew on one rank stalls all ranks at the all-reduce.
- Tokenize offline; pack documents. 30% padding = 30% wasted FLOPs (~$2.4M); packing recovers it. Online tokenization can become the host-CPU bottleneck.
- Dataloader state (cursor, seed, buffer) is checkpoint state — exact resumption or a corrupted epoch.
- Diagnose with the synthetic-data A/B; co-design the format to whichever you're bound by — network (compress) or CPU (decode cheap).