system_ml / 15 · caching allocator lesson 15 / 19

CUDA memory & the caching allocator

Why cudaMalloc is too slow to call per-op, what PyTorch does instead, and why your perfectly-fitting model still OOMs sometimes.

The two-line problem

cudaMalloc(N) takes roughly 30–200 microseconds per call — it goes into the driver, possibly synchronises the device, may even talk to the OS for backing pages. Per-op allocation would dominate the framework overhead from lesson 13, and gradient/activation tensors are allocated and freed every step. If every aten::add's output tensor required a cudaMalloc, eager-mode PyTorch would be slower than CPU NumPy.

So PyTorch (and JAX and TensorFlow and friends) implements a caching allocator: a big pool of GPU memory owned by the framework, sub-allocated from in microseconds by a custom data structure that lives in the host process. Every time you "free" a tensor, it goes back into the pool — not back to CUDA — and is reused for the next tensor of compatible size.

What the allocator actually looks like

PyTorch's CUDACachingAllocator maintains, per device and per stream:

When you do torch.empty(N):

  1. Round N up by the configured rounding rule (minimum 512 B for small allocations; large allocations are rounded by power-of-2 divisions configured via PYTORCH_CUDA_ALLOC_CONF).
  2. Search the free set for a block of at least that size. If found, take it.
  3. Otherwise, take a large free block and split it.
  4. Otherwise, ask CUDA for a new segment via cudaMalloc.

When you "free" a tensor (its Python refcount drops to zero):

  1. The block returns to its free list.
  2. If it's adjacent to other free blocks in the same segment, they coalesce into one larger free block.
  3. The segment is not released to CUDA — it stays in PyTorch's pool for next time.

This is why the cost in the typical case is "find a free block in a size-ordered set" — a few hundred nanoseconds. The driver doesn't see it.

Animated · the memory strip, alloc and free events

A single 128 MB segment laid out horizontally. Each button enqueues an alloc or free event; the timeline animates the block being inserted, split off a free block, or coalesced with its neighbours on free. Watch fragmentation build up as you alternate small and large allocations, then disappear when adjacent blocks merge.

Memory pool · alloc / free events on one segment
Colored blocks = allocations (each color = one tensor). Dark gaps = free space. Click an alloc button to fire a request; the simulator finds best-fit, splits the free block, animates the insertion. "free random" releases one live block — adjacent free blocks coalesce.
live MB
0
free MB
128
largest free block
128 MB
fragmentation
0%

The fragmentation trap

The caching allocator's worst-case is what a textbook OS allocator's worst-case is: external fragmentation. A segment can have lots of free bytes but no single free block of the size you need. The 60 GB of allocator-reserved memory can be "full" of 1 MB free holes when you're trying to fit a 2 GB tensor.

Real-world cause #1: varying batch sizes / sequence lengths. A training step allocates a giant activation tensor of shape (B, T, d). Next step uses B' > B — the old activation block is now too small; allocate a new one, splitting a different segment. Several iterations of this and the pool is full of dead blocks of the wrong sizes.

Real-world cause #2: retained tensors. A debug print, a metric being kept, an unintentional reference in a closure. Those tensors don't free, the surrounding blocks can't coalesce, and the segment is "pinned" with holes.

Symptom: training has been running fine for hours and suddenly OOMs. The "reserved memory" in torch.cuda.memory_summary() is large, but "active memory" is much smaller — most of the reserve is fragmented.

The two-fix toolkit
  1. Empty the cache (rarely): torch.cuda.empty_cache(). Returns all unused segments to CUDA. Cheap to do, but if your allocator is full of used blocks it does nothing — and the next op pays a fresh cudaMalloc.
  2. Switch to expandable segments (often): set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. A modern allocator mode (introduced in PyTorch 2.1) where segments are virtual ranges that grow / shrink via cuMemMap instead of fixed-size cudaMalloc'd slabs. Drastically reduces fragmentation at the cost of using CUDA's virtual-memory API. Recommended by default for variable-shape training.

2D · fixed-pool vs expandable_segments, side by side

The same allocation trace, run twice. Left: traditional fixed-size segments — when a request can't fit in any existing segment, a new fixed segment is allocated from CUDA, wasting its tail. Right: expandable segments — one virtual range that grows by mapping new physical pages on demand, with no fixed pool boundaries. The right-hand pool tracks the high-water mark of live memory; the left's reserved memory drifts upward.

Fixed segments vs expandable_segments
Each square = 1 MB of address space. Bright color = in use, dim = free-but-reserved, dark = unmapped. Click step to issue the next alloc/free in the trace. The two simulators see identical events; only the allocator differs.
step
0
fixed reserved
0
expandable reserved
0
savings

Streams and the per-stream pool

Every CUDA op runs on a stream — an ordered queue of work for the GPU. The default stream is implicit; torch.cuda.Stream objects let you create new ones. Two ops on the same stream are ordered (op B starts after op A completes); two ops on different streams run concurrently if the device has capacity (e.g. compute kernel on one, NCCL collective on another — this is exactly the DDP overlap from lesson 04).

The allocator is stream-aware: when a tensor is freed, it can only be reused on the same stream that last touched it, until the GPU has signalled that any in-flight work on that stream has finished. This is to prevent the case where stream A frees a tensor at the host while a kernel on stream A is still using its bytes; if stream B grabbed the same bytes and wrote to them, you'd race.

Practically: cross-stream sharing of tensors requires explicit synchronisation (record_stream), and "free" doesn't mean "available" until the device says so. Forgetting this is a frequent source of subtle race conditions in custom multi-stream code.

2D · per-stream pool visualiser

Three streams, each with its own pool. Allocations on a stream return blocks from that stream's pool; freed blocks stay assigned to the stream that last used them until a kernel completes there. The bottom event log shows the in-flight kernels per stream — a freed block on a busy stream is highlighted yellow ("held"), unable to be reused even by its own stream until the stream signals.

Per-stream pools · cross-stream sharing has rules
Click "alloc on Sₙ" to put a tensor in stream n's pool. "free + record_stream(other)" lets a different stream pick up the free block once the original stream syncs. Without record_stream, you'd risk a use-after-free across streams.
S0 live / pool
0 / 0
S1 live / pool
0 / 0
S2 live / pool
0 / 0
blocks held (pending)
0

The big picture of HBM accounting

For an H100 with 80 GB HBM, a typical training run uses memory like this:

  ┌────────────────────────────────────────────────────── 80 GB
  │ CUDA context        ~ 0.5–1.5 GB  (driver, kernels, workspace)
  │ Model state         ~ depends    (lesson 01: ~16P bytes / shard)
  │ Optimizer state     ~ included above
  │ Activations         ~ varies    (lesson 01: BT·L·d · k bytes)
  │ NCCL buffers        ~ 100–500 MB
  │ cuDNN workspace     ~ 100–500 MB
  │ Allocator headroom  ~ should be small if well-tuned
  └────────────────────────────────────────────────────── 0

The numbers add up tightly. The first OOM in a training run usually means one of three things:

Diagnostic tooling — what to look at

ToolWhat it tells you
torch.cuda.memory_summary()Reserved (allocator total), active (in-use), and a per-bucket breakdown
torch.cuda.memory_snapshot()Every segment and block with its lifetime — feed to memviz
torch.cuda.memory._record_memory_history()Record allocations with stack traces — for finding what's retained
nvidia-smiDriver's view of total process memory — allocator-reserved + CUDA context + cuDNN workspace + NCCL buffers. Always larger than PyTorch's "reserved"; the gap is the non-PyTorch pieces.
NCCL_DEBUG=WARNNCCL's own buffer allocations and errors

A real debugging session for an "I'm OOMing at step 200" bug is roughly: turn on _record_memory_history, run until OOM, save a snapshot, open it in memviz (a Chrome trace), look at what segments are pinned by what.

Why pinned memory exists

Host RAM that's been "pinned" (page-locked) can be transferred to the GPU via DMA without staging through a pinned bounce buffer first — meaningfully faster H2D copies. DataLoader(pin_memory=True) ensures inputs go through this fast path. The cost is host RAM (pinned pages can't be paged out), so don't pin everything — only the staging buffer for the next batch.

Interactive · the fragmentation simulator

Simulate a sequence of alloc / free operations with varying sizes, and watch the segment fill, fragment, and OOM. Toggle expandable-segments mode to see fragmentation go away (at the cost of more virtual-address-space gymnastics). Try the "vary batch size" workload — it OOMs without expandable segments much earlier than its total live bytes would suggest.

Caching allocator · live bytes vs reserved bytes
Each "step" allocates a few tensors of varying size and frees the previous step's. Grey = free block, blue = active block. When a new alloc can't find a fitting free block, a fresh segment is added (or with expandable segments, the existing one is extended).
step
0
active MB
0
reserved MB
0
fragmentation
0%
Takeaway
The caching allocator turns "thousands of cudaMallocs per second" into a near-zero-overhead pool operation. The price is fragmentation, which appears any time block sizes vary across steps. expandable_segments is the modern workaround. Cross-stream tensor sharing needs record_stream or you get races. Master these three and most OOMs go from mysterious to mechanical.