The data plane that keeps the GPUs fed

Lesson 07 built a system that burns ~$8M of GPU-time to pretrain a 70B model. If the dataloader stalls 10% of steps, that is $800k of idle H100s — the data plane is not plumbing, it is a first-class performance system. The whole design reduces to one inequality: data delivery throughput must exceed GPU token-consumption throughput, with margin. Miss it and MFU collapses, and every parallelism decision from lesson 07 is wasted.

The framing: a producer–consumer race the GPU must win by losing

Lesson 07 made the GPUs the expensive consumer: at 40% MFU on 1,024 H100s they eat tokens at a fixed rate, set by the step time. The data plane is the producer. The only acceptable design point is producer throughput > consumer throughput, so the consumer never waits. Anything less and the GPUs idle at the top of every step — paying full price to do nothing.

1 · The throughput target — what the cluster eats per second

From lesson 07, one synchronized training step processes the global batch across all data-parallel ranks, then takes step_time seconds. So the cluster consumes tokens at a constant, relentless rate:

tokens_consumed/s = global_batch_tokens / step_time

Worked: the rate you must feed, forever

Global batch = 4M tokens, step time = 2 s ⇒ 4e6 / 2 = 2M tokens/s must arrive — pre-tokenized and shuffled — every second, for the 180-day run. Converted to bytes: token IDs stored as int16/uint16 are 2 bytes/token, so 2e6 · 2 = 4 MB/s of token IDs cluster-wide. That sounds trivial — and it is, if you tokenized offline. If you stream raw text and tokenize online, you read ~4–5 bytes/token of UTF-8 plus pay CPU to tokenize: the same 2M tokens/s becomes ~8–10 MB/s of text reads plus a tokenization bill that competes with the dataloader for host CPU. Same token rate, very different data path.

4 MB/s cluster-wide is nothing — until you remember it must be random (shuffled), disjoint across ranks (no duplicate data), and resumable (exact order after a crash). The bytes are easy; the access pattern is the engineering.

2 · Where the data lives — stream, don't download

A pretraining corpus is petabytes of tokens. It sits in object storage (S3/GCS), not on local NVMe — no node has a petabyte of disk, and copying the corpus to every node is absurd. So each data-parallel rank streams shards from object store on demand, reading only the shards assigned to it (data_engineering 04, 05).

The trick that makes streaming free: double-buffering. While the GPU computes step i (2 s of compute), CPU worker threads prefetch and decode the batches for steps i+1, i+2, … into a host-RAM queue. If the I/O for one batch finishes in well under 2 s, the data wait is fully hidden behind compute and the GPU never sees a stall.

  GPU:   [ step i compute · 2s ][ step i+1 compute · 2s ][ step i+2 ... ]
  CPU:   [ prefetch i+1 ][ prefetch i+2 ][ prefetch i+3 ]   ← must finish before GPU needs it
         └── overlapped: data wait = 0 ──┘
  BAD:   [ step i ]......WAIT......[ step i+1 ]   ← prefetch too slow → GPU idle = $ burned

The design rule: prefetch depth × per-batch fetch time < pipeline budget, and per-batch fetch time must be < step time so the queue never drains. Prefetch 2–4 batches ahead to absorb jitter in object-store latency.

3 · Sharding & shuffle — global shuffle is impossible at PB scale

SGD wants i.i.d. samples, which ideally means a uniform random permutation of the whole corpus each epoch. You cannot globally shuffle a petabyte — it would require random reads scattered across all of object storage, destroying the sequential-read throughput that makes streaming cheap. The standard approximation is a shuffle buffer:

Pre-shuffle shard order once (cheap — just permute the shard list).
Fill a buffer of K samples (e.g. K = 10,000) by reading shards sequentially.
Each step, draw randomly from the buffer; refill the drawn slot from the stream.

This gives near-random sampling within a window of K samples while keeping reads sequential. Bigger K = better mixing, more host RAM.

Skew is a synchronized-step killer — a callback to lesson 07's collectives

Shard assignment must be disjoint across dp ranks (a token shown twice is a silent epoch bug) and deterministic (for resumption, §5). But the subtle trap is skew: lesson 07's step ends in an all-reduce, a barrier where every rank waits for the slowest. If one rank draws an oversized shard, or its object-store read hiccups, that one rank stalls and all 1,024 GPUs wait at the collective. A straggling dataloader on a single rank taxes the entire cluster. Balance shard sizes; cap per-batch fetch time; treat the slowest rank as your real throughput.

4 · Tokenization & packing — don't waste FLOPs on padding

Offline vs online tokenization. Tokenize once, offline, and store token IDs (data_engineering 07): the data path then streams 2 bytes/token with zero per-step CPU cost. Tokenizing online — running the BPE tokenizer on raw text every step — moves that CPU cost onto the training host, where it competes with the dataloader's decode threads. At 2M tokens/s a slow tokenizer can become the bottleneck that starves the GPUs, the exact failure §6 hunts for. Offline is the default; online only when the corpus changes faster than you can re-tokenize it.

Packing. Documents have varying lengths; the model trains on fixed-length sequences (say 8,192 tokens). The naive approach pads each short document up to the max length — and padding tokens cost the same FLOPs as real tokens. The 6N rule from lesson 02 doesn't know a token is padding.

30% padding = 30% of your $8M run spent on nothing

If the average sequence is 30% padding, then 30% of every 6N-FLOP step computes attention over fake tokens. On the lesson 02 / 07 run that is ~$2.4M of GPU-time producing zero gradient signal — worse than a 10% dataloader stall. Packing concatenates multiple documents into one fixed-length sequence (with attention masks or boundary resets so they don't attend across each other), driving padding toward 0% and recovering that compute. Packing is not a data-quality nicety; it is a first-order MFU lever.

5 · Deterministic resumption — the dataloader is part of the checkpoint

Lesson 07 established that at 1,000-GPU scale a node dies every few hours, so the run checkpoints and restarts constantly. A run that dies at step 40,000 must resume the exact data order it would have seen with no crash. Otherwise you re-show data (overfitting a slice) or skip data (a hole in the epoch) — both silently corrupt training and are nearly impossible to debug after the fact.

So the dataloader state is checkpoint state, saved alongside the 16N model/optimizer bytes (data_engineering 09):

shard cursor per rank — which shards consumed, position within the current shard,
shuffle seed + the buffer's RNG state — so the same samples come out in the same order,
epoch / global step — so the shard permutation for the next epoch is reproducible.

Get this right and a restart is invisible to the loss curve. Get it wrong and you discover the corruption only when an eval number looks off, weeks later.

6 · Proving it's not the bottleneck — the diagnostic

Lesson 00's design loop ends by finding the bottleneck step. Applied to the data plane: measure data-wait time per step — GPU idle between the end of one step's compute and the start of the next. If MFU is low, the question is comms (lesson 07) or data? The clean experiment:

The synthetic-data test

Replace the real dataloader with a cached or synthetic batch served from host RAM at zero I/O cost, and re-measure step time. If step time drops, the data plane was the bottleneck — fix the pipeline (more prefetch workers, faster format, offline tokenization, balanced shards). If step time is unchanged, data was already hidden behind compute and the MFU loss is comms or kernels — go back to lesson 07. This one A/B isolates the data plane from everything else and tells you exactly where to spend effort.

7 · Co-design — format trades network bytes against host CPU

The storage format is a knob on the producer–consumer race, and which way to turn it depends on which resource is scarce on your data path:

Format	Compression (network)	Decode cost (host CPU)	Pick when…
Parquet (columnar, compressed)	high — fewest bytes off object store	higher — decompress + decode	network-bound: limited egress / many ranks
Arrow (columnar, in-memory)	medium	low — zero-copy reads	CPU-bound: cheap bandwidth, scarce cores
WebDataset (tar shards)	tunable	low — sequential, streaming-friendly	pure sequential streaming at scale

More compression means fewer bytes/s off the network but more host CPU to decode — and host CPU is also what online tokenization and shuffle-buffer management consume. If you are network-bound, compress harder. If you are CPU-bound (e.g. already tokenizing online), reach for a cheaper-to-decode format and pre-tokenize. The §6 synthetic-data test plus a CPU-utilization read tells you which regime you're in.

Interactive · will the data plane keep up?

Set the consumer (global batch, step time, dp ranks) and the producer (per-rank read bandwidth, whether you tokenize online). The widget computes the required bytes/s per rank against what the rank can actually read, and shows the MFU ceiling if the GPUs end up data-starved.

Data-plane feasibility / will it keep up?

Assumptions: token IDs at 2 bytes/token; online tokenization reads raw text at ~5× the byte rate and is modeled as a 5× inflation of required bytes/s; required bytes are split evenly across dp ranks; "achieved" is the sustained sequential read each rank gets from object store. If achieved < required the GPUs starve, effective step time inflates by required/achieved, and the MFU ceiling falls by that ratio (lesson 07's 40% MFU baseline). Idle-cost uses 1,024 H100s × dp/512 scaling at ~$2/GPU-hr. Order-of-magnitude, per lesson 02's ±30% contract.

global batch (M tok) 4 step time (s) 2 dp ranks 128 per-rank read (MB/s) 200 online tokenize off

required MB/s / rank

–

achieved MB/s / rank

–

status

–

eff. MFU ceiling

–

What carries forward

The data plane is a performance system, not plumbing. A 10% dataloader stall on an $8M run is $800k of idle H100s — design it to win the producer–consumer race with margin.
The target: tokens/s = global_batch / step_time must arrive shuffled and disjoint, every second, forever. In bytes it is tiny (~4 MB/s at 2 B/token) — the access pattern, not the bandwidth, is the work.
Stream from object store, double-buffered. Prefetch N batches so I/O hides behind the step; per-batch fetch must beat step time or the queue drains.
Shuffle-buffer approximates a global shuffle; shards stay disjoint and deterministic. Skew on one rank stalls all ranks at the all-reduce.
Tokenize offline; pack documents. 30% padding = 30% wasted FLOPs (~$2.4M); packing recovers it. Online tokenization can become the host-CPU bottleneck.
Dataloader state (cursor, seed, buffer) is checkpoint state — exact resumption or a corrupted epoch.
Diagnose with the synthetic-data A/B; co-design the format to whichever you're bound by — network (compress) or CPU (decode cheap).