Tensor cores — mma and fragments
A single hardware instruction that does a small matrix multiply in one cycle. The reason an H100 has 1 PFLOP of bf16 throughput is that every SM has 4 of these, each doing 16×8×16 multiplies per cycle. Every meaningful matmul on a modern GPU lands here.
The numbers, again
From lesson 20: an H100 SM has 4 tensor cores. Each one can do one "small matmul" per cycle. The "size" of that small matmul depends on the data type:
| Data type | Per-instruction shape (m × n × k) | FLOPs/cycle (per tensor core) |
|---|---|---|
| fp16 / bf16 | 16 × 8 × 16 | 2 × 2048 = 4096 |
| fp8 (E4M3 / E5M2) | 16 × 8 × 32 | 2 × 4096 = 8192 |
| int8 | 16 × 8 × 32 | 2 × 4096 = 8192 |
| tf32 | 16 × 8 × 8 | 2 × 1024 = 2048 |
| fp64 | 8 × 8 × 4 | 2 × 256 = 512 |
4 tensor cores × 4096 FLOPs/cycle (bf16) × ~1.755 GHz × 132 SMs ≈ ~990 TFLOPS dense bf16 — matching NVIDIA's marketed H100 SXM5 peak. The "tensor cores deliver ~1 PFLOPS" number you see is exactly this product. (With Hopper's structured-sparsity feature, you can claim ~2× of that on the data-dependent path, but most ML workloads don't rely on it.)
Compare to the CUDA-core ALU path: 128 fp32 ALUs/SM × 2 FLOPs/cycle (FMA) × 1.755 GHz × 132 SMs ≈ 67 TFLOPS. Tensor cores are ~15× faster than CUDA cores on bf16 matmul. Not using them is leaving most of the GPU on the table.
Animated · one mma instruction, fragment by fragment
The canonical visualisation: a 16×16 fragment of A multiplies a 16×16 fragment of B and accumulates into a 16×16 fragment of C, all in a single hardware instruction. Step through to watch one row of A "rake" through B, contributing to the corresponding column of C. The colours show which lanes of the warp contributed to each output element.
The instruction — mma.sync
The PTX instruction that drives a tensor core is called mma.sync ("matrix multiply accumulate, synchronous"). Its shape (m, n, k) reflects the dimensions of the multiply: an m×k matrix times a k×n matrix, accumulating into an m×n matrix.
// 16 × 8 × 16 bf16 multiply, fp32 accumulate
mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 d, a, b, c;
The operands a, b, c, d are not register arrays — they're distributed across the threads of the warp. Each thread holds a small slice of each operand. This is the "fragment" concept.
Fragments — how a 16×16 matrix lives across 32 threads
The fundamental abstraction: a matrix-multiply operand is split into fragments, with each thread of the warp holding a specific subset of the elements. The exact assignment depends on the instruction shape and dtype, but for the canonical 16×16 bf16 matrix:
- 32 threads, 16×16 = 256 elements, so 8 elements per thread.
- Each thread holds a specific subset of the matrix's elements in its registers, in a specific layout the hardware expects.
The PTX ldmatrix instruction is the canonical way to load a fragment from SMEM into the right register layout — it does the "shuffle" needed to assign the right elements to the right threads.
2D · how 32 threads load one 16×16 fragment
The hardware expects a specific lane-to-element mapping. Each cell below is one of 256 elements in a 16×16 matrix; click a thread (0..31) to see which 8 elements it loads. This is the layout produced by ldmatrix — it does the cross-lane shuffle so each thread ends up with the right registers.
Two interfaces — WMMA (legacy) and mma (modern)
NVIDIA exposes tensor cores via two C++ APIs:
WMMA — the high-level wrapper
#include <mma.h>
using namespace nvcuda;
wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::fill_fragment(c_frag, 0.0f);
wmma::load_matrix_sync(a_frag, a_smem, 16);
wmma::load_matrix_sync(b_frag, b_smem, 16);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync(c_global, c_frag, N, wmma::mem_row_major);
The "fragment" type is opaque; you don't see the per-thread layout. WMMA is portable across architectures but only supports a fixed set of shapes (16×16×16, 32×8×16, 8×32×16 on Ampere; more shapes on Hopper). The compiler may not produce the most efficient instructions for shapes that don't perfectly match the architecture.
Direct mma via inline PTX
For maximum performance, CUTLASS and FlashAttention drop down to inline PTX:
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};"
: "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
: "r"(a0), "r"(a1), "r"(a2), "r"(a3),
"r"(b0), "r"(b1),
"f"(c0), "f"(c1), "f"(c2), "f"(c3)
);
Ugly, but exposes every shape and accumulator type the hardware supports. CUTLASS wraps these into nicer C++ templates; user code rarely writes raw PTX.
The pipeline — a tiled matmul with tensor cores
Combine lesson 24's tiling with tensor cores: the inner loop's acc += a[k] * b[k] is replaced by an mma.sync. Per warp:
- Load:
ldmatrixfrom SMEM into A-fragment and B-fragment. - Compute:
mma.syncupdates the C-fragment. - Iterate: walk K dimension, loading new fragments and accumulating.
- Store: write C-fragment from registers back to global memory (or to SMEM for an epilogue).
One warp does a 16×8 output tile per mma.sync. A block with 4 warps doing 4 mma's per cycle does 4 × 16 × 8 = 512 output elements per cycle — multiply by the K depth and you have the throughput.
Async copies — the Hopper trick
For tensor cores to stay fed, you need a continuous stream of A and B fragments. On Ampere+, the cp.async PTX instruction issues an asynchronous SMEM load that runs in the background while the SM does math. You can "prefetch" the next tile while computing on the current one — double buffering at the kernel level.
On Hopper, this gets even better: the TMA (Tensor Memory Accelerator) is a dedicated hardware unit that does the async copy without occupying CUDA cores. The warp issues "TMA, please bring tile X from HBM to SMEM" and goes back to crunching the current tile.
For tensor-core-dense kernels (FlashAttention-3, CUTLASS-3, modern cuBLAS), the loop is roughly:
repeat:
TMA.issue(next_A_tile, next_B_tile) // async, no waiting
mma.sync(C, A_cur, B_cur) // tensor-core compute
TMA.wait() // wait for next tile
swap(A_cur, A_next); swap(B_cur, B_next)
If TMA finishes before the mma's, you're compute-bound. If mma's finish first, you're HBM-bound. The art of writing peak-throughput kernels is balancing those two so neither idles.
Precision — what flavours actually exist
| Architecture | Tensor core formats |
|---|---|
| Volta (V100) | fp16 only |
| Turing (T4) | fp16, int8, int4, int1 |
| Ampere (A100) | + bf16, tf32, fp64 |
| Hopper (H100) | + fp8 (E4M3, E5M2) |
| Blackwell (B100/B200) | + fp4 (E2M1), fp6 |
The ladder from fp32 → bf16 → fp8 → fp4 roughly halves the bytes per operand and doubles the FLOPs per cycle each step. Marketing numbers (dense; structured-sparsity doubles these): A100 ~312 TF bf16 → H100 ~990 TF bf16, ~1980 TF fp8 → B200 ~1.1 PF bf16, ~2.25 PF fp8, ~4.5 PF fp4. The often-quoted "2.25/4.5/9 PF" Blackwell numbers include 2× sparsity on top.
The accumulator is typically fp32 (for bf16/fp8/fp4 inputs it's fp32; for fp16 inputs both fp16 and fp32 accumulators exist, and the fp16 accumulator is a legacy path). The multiply happens in low precision, the running sum stays in high precision. This is the "accumulate-in-fp32 rule" from lesson 14, made hardware-explicit.
3D · the precision ladder, stacked isometric
Each tier is one PTX mma.sync shape: bf16 (m16n8k16), fp16 (same), tf32 (m16n8k8), fp8 (m16n8k32), fp4 (m16n8k64). The volume of each tier represents the FLOPs delivered per cycle. The taller (deeper-k) tiers do more work; the narrower (lower-precision) tiers pay less per operand. Hover or click a tier to see the actual throughput on H100/B200. (Note: the high-level WMMA C++ API exposes a logical 16×16 fragment view, but the underlying hardware instruction is m16n8k.)
Where you'll see this — and where you won't
| Layer | Tensor cores? |
|---|---|
| Linear / matmul | Yes — cuBLAS gemm, Triton tl.dot |
| Attention | Yes — QK^T and softmax(QK^T)·V both |
| Convolution | Yes — cuDNN gemm-based or implicit-gemm |
| LayerNorm | No — reduction, no matmul |
| Activation (GELU, ReLU) | No — elementwise |
| Embedding lookup | No — gather |
So a transformer block is roughly: tensor-core (QKV proj) → tensor-core (attention) → tensor-core (output proj) → CUDA-core (residual + norm + activation) → tensor-core (MLP up) → CUDA-core (activation) → tensor-core (MLP down). Most of the FLOPs in tensor cores; most of the bytes still moving through HBM (lesson 21's roofline).
What you actually need to know — even if you never write PTX
You're unlikely to write mma.sync by hand unless you're working on a serving stack or a kernel library. But knowing what's there matters:
- Use cuBLAS / Triton
tl.dotfor matmuls. They emit the right tensor-core instructions. - Choose dtype with tensor-core support. A bf16 matmul is 15× faster than the fp32 version on H100. If your numerics tolerate it, use it.
- Check tensor-core utilisation in Nsight. The metric
sm__inst_executed_pipe_tensortells you how much of compute used tensor cores. If your matmul-heavy kernel shows low tensor-core activity, something is wrong (wrong dtype, wrong shape, fallback to CUDA cores). - Match shapes to architecture. A 16-aligned matmul size keeps the tensor cores fed; a size-15 matmul wastes the last fragment. Pad if needed.
The closing loop — the whole stack from lesson 01
Look back at the path we've walked:
- Lessons 01–10: we built a 70B model and sharded it across 512 GPUs.
- Lessons 11–12: we served it from a continuous-batching, KV-cached, prefill/decode-disaggregated stack.
- Lessons 13–19: we walked the per-GPU framework — PyTorch dispatcher, allocators, fusion, compilers.
- Lessons 20–27: we walked the CUDA execution model — warps, memory, coalescing, tiling, divergence, occupancy.
- Lesson 28 (here): we ended where modern matmul lives — at the tensor core, where 4 fp32 operands per cycle become 4096.
The 70B model from lesson 01 hits tensor cores billions of times per training step. Every layer above this lesson is, at some level, in service of keeping those tensor cores fed — coalesced HBM reads to bring tiles to SMEM, SMEM tiles to fragments, fragments to the mma.sync. The whole stack is one optimisation problem.
Interactive · tensor-core throughput by dtype
Pick a matrix size and a dtype. The widget computes the theoretical peak time (assuming 100% tensor-core utilisation), the achievable bandwidth-roofline time, and the resulting compute-vs-memory regime.
mma.sync. Each instruction is a small matrix multiply, with operands distributed as fragments across a warp. cuBLAS, cuDNN, CUTLASS, and Triton's tl.dot all emit these. The dtype ladder (fp32 → bf16 → fp8 → fp4) doubles throughput at each step, with the running accumulator staying in fp32. The whole rest of the system — sharding, allocators, compilers, kernels, memory layouts — exists to keep these tensor cores fed.
mma.sync." The companion series vllm/lessons/ covers serving-time specifics (KV cache, PagedAttention, continuous batching), and RL/lessons/ covers RL post-training architecture. Same system, different cross-sections.