gpu_kernel_serving / 09 · tensor cores lesson 9 / 17

Tensor cores — mma and fragments

A single hardware instruction that does a small matrix multiply in one cycle. The reason an H100 has 1 PFLOP of bf16 throughput is that every SM has 4 of these, each doing 16×8×16 multiplies per cycle. Every meaningful matmul on a modern GPU lands here.

The numbers, again

From lesson 20: an H100 SM has 4 tensor cores. Each one can do one "small matmul" per cycle. The "size" of that small matmul depends on the data type:

Data typePer-instruction shape (m × n × k)FLOPs/cycle (per tensor core)
fp16 / bf1616 × 8 × 162 × 2048 = 4096
fp8 (E4M3 / E5M2)16 × 8 × 322 × 4096 = 8192
int816 × 8 × 322 × 4096 = 8192
tf3216 × 8 × 82 × 1024 = 2048
fp648 × 8 × 42 × 256 = 512

4 tensor cores × 4096 FLOPs/cycle (bf16) × ~1.755 GHz × 132 SMs ≈ ~990 TFLOPS dense bf16 — matching NVIDIA's marketed H100 SXM5 peak. The "tensor cores deliver ~1 PFLOPS" number you see is exactly this product. (With Hopper's structured-sparsity feature, you can claim ~2× of that on the data-dependent path, but most ML workloads don't rely on it.)

Compare to the CUDA-core ALU path: 128 fp32 ALUs/SM × 2 FLOPs/cycle (FMA) × 1.755 GHz × 132 SMs ≈ 67 TFLOPS. Tensor cores are ~15× faster than CUDA cores on bf16 matmul. Not using them is leaving most of the GPU on the table.

Animated · one mma instruction, fragment by fragment

The canonical visualisation: a 16×16 fragment of A multiplies a 16×16 fragment of B and accumulates into a 16×16 fragment of C, all in a single hardware instruction. Step through to watch one row of A "rake" through B, contributing to the corresponding column of C. The colours show which lanes of the warp contributed to each output element.

mma · 16×16 fragment multiply, one instruction
A · B → C, with each output C[i,j] = Σ_k A[i,k] · B[k,j]. The animation walks the k dimension; the heatmap shows the running accumulator C.
k slice
FMAs this step
cumulative FMAs
issued instructions

The instruction — mma.sync

The PTX instruction that drives a tensor core is called mma.sync ("matrix multiply accumulate, synchronous"). Its shape (m, n, k) reflects the dimensions of the multiply: an m×k matrix times a k×n matrix, accumulating into an m×n matrix.

// 16 × 8 × 16 bf16 multiply, fp32 accumulate
mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32  d, a, b, c;

The operands a, b, c, d are not register arrays — they're distributed across the threads of the warp. Each thread holds a small slice of each operand. This is the "fragment" concept.

Fragments — how a 16×16 matrix lives across 32 threads

The fundamental abstraction: a matrix-multiply operand is split into fragments, with each thread of the warp holding a specific subset of the elements. The exact assignment depends on the instruction shape and dtype, but for the canonical 16×16 bf16 matrix:

The PTX ldmatrix instruction is the canonical way to load a fragment from SMEM into the right register layout — it does the "shuffle" needed to assign the right elements to the right threads.

2D · how 32 threads load one 16×16 fragment

The hardware expects a specific lane-to-element mapping. Each cell below is one of 256 elements in a 16×16 matrix; click a thread (0..31) to see which 8 elements it loads. This is the layout produced by ldmatrix — it does the cross-lane shuffle so each thread ends up with the right registers.

Fragment load layout · which thread owns which element
Each thread holds 8 elements of the 256-element fragment. Click a thread chip to highlight its elements. The colour gradient shows the canonical row-major bf16 A-fragment layout.
selected thread
T0
elements held
register count
total per warp
256

Two interfaces — WMMA (legacy) and mma (modern)

NVIDIA exposes tensor cores via two C++ APIs:

WMMA — the high-level wrapper

#include <mma.h>
using namespace nvcuda;

wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

wmma::fill_fragment(c_frag, 0.0f);
wmma::load_matrix_sync(a_frag, a_smem, 16);
wmma::load_matrix_sync(b_frag, b_smem, 16);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync(c_global, c_frag, N, wmma::mem_row_major);

The "fragment" type is opaque; you don't see the per-thread layout. WMMA is portable across architectures but only supports a fixed set of shapes (16×16×16, 32×8×16, 8×32×16 on Ampere; more shapes on Hopper). The compiler may not produce the most efficient instructions for shapes that don't perfectly match the architecture.

Direct mma via inline PTX

For maximum performance, CUTLASS and FlashAttention drop down to inline PTX:

asm volatile(
    "mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 "
    "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};"
    : "=f"(d0), "=f"(d1), "=f"(d2), "=f"(d3)
    : "r"(a0), "r"(a1), "r"(a2), "r"(a3),
      "r"(b0), "r"(b1),
      "f"(c0), "f"(c1), "f"(c2), "f"(c3)
);

Ugly, but exposes every shape and accumulator type the hardware supports. CUTLASS wraps these into nicer C++ templates; user code rarely writes raw PTX.

The pipeline — a tiled matmul with tensor cores

Combine lesson 24's tiling with tensor cores: the inner loop's acc += a[k] * b[k] is replaced by an mma.sync. Per warp:

  1. Load: ldmatrix from SMEM into A-fragment and B-fragment.
  2. Compute: mma.sync updates the C-fragment.
  3. Iterate: walk K dimension, loading new fragments and accumulating.
  4. Store: write C-fragment from registers back to global memory (or to SMEM for an epilogue).

One warp does a 16×8 output tile per mma.sync. A block with 4 warps doing 4 mma's per cycle does 4 × 16 × 8 = 512 output elements per cycle — multiply by the K depth and you have the throughput.

Async copies — the Hopper trick

For tensor cores to stay fed, you need a continuous stream of A and B fragments. On Ampere+, the cp.async PTX instruction issues an asynchronous SMEM load that runs in the background while the SM does math. You can "prefetch" the next tile while computing on the current one — double buffering at the kernel level.

On Hopper, this gets even better: the TMA (Tensor Memory Accelerator) is a dedicated hardware unit that does the async copy without occupying CUDA cores. The warp issues "TMA, please bring tile X from HBM to SMEM" and goes back to crunching the current tile.

For tensor-core-dense kernels (FlashAttention-3, CUTLASS-3, modern cuBLAS), the loop is roughly:

repeat:
    TMA.issue(next_A_tile, next_B_tile)  // async, no waiting
    mma.sync(C, A_cur, B_cur)             // tensor-core compute
    TMA.wait()                             // wait for next tile
    swap(A_cur, A_next); swap(B_cur, B_next)

If TMA finishes before the mma's, you're compute-bound. If mma's finish first, you're HBM-bound. The art of writing peak-throughput kernels is balancing those two so neither idles.

Precision — what flavours actually exist

ArchitectureTensor core formats
Volta (V100)fp16 only
Turing (T4)fp16, int8, int4, int1
Ampere (A100)+ bf16, tf32, fp64
Hopper (H100)+ fp8 (E4M3, E5M2)
Blackwell (B100/B200)+ fp4 (E2M1), fp6

The ladder from fp32 → bf16 → fp8 → fp4 roughly halves the bytes per operand and doubles the FLOPs per cycle each step. Marketing numbers (dense; structured-sparsity doubles these): A100 ~312 TF bf16 → H100 ~990 TF bf16, ~1980 TF fp8 → B200 ~1.1 PF bf16, ~2.25 PF fp8, ~4.5 PF fp4. The often-quoted "2.25/4.5/9 PF" Blackwell numbers include 2× sparsity on top.

The accumulator is typically fp32 (for bf16/fp8/fp4 inputs it's fp32; for fp16 inputs both fp16 and fp32 accumulators exist, and the fp16 accumulator is a legacy path). The multiply happens in low precision, the running sum stays in high precision. This is the "accumulate-in-fp32 rule" from lesson 14, made hardware-explicit.

3D · the precision ladder, stacked isometric

Each tier is one PTX mma.sync shape: bf16 (m16n8k16), fp16 (same), tf32 (m16n8k8), fp8 (m16n8k32), fp4 (m16n8k64). The volume of each tier represents the FLOPs delivered per cycle. The taller (deeper-k) tiers do more work; the narrower (lower-precision) tiers pay less per operand. Hover or click a tier to see the actual throughput on H100/B200. (Note: the high-level WMMA C++ API exposes a logical 16×16 fragment view, but the underlying hardware instruction is m16n8k.)

Precision ladder · isometric tiers · each doubles the throughput
From top: fp32 (CUDA cores only) → tf32 → fp16/bf16 → fp8 → fp4. Each step down the ladder doubles peak by halving the operand bytes. Accumulator stays fp32.
selected dtype
instruction shape
peak (dense)
×fp32 speedup

Where you'll see this — and where you won't

LayerTensor cores?
Linear / matmulYes — cuBLAS gemm, Triton tl.dot
AttentionYes — QK^T and softmax(QK^T)·V both
ConvolutionYes — cuDNN gemm-based or implicit-gemm
LayerNormNo — reduction, no matmul
Activation (GELU, ReLU)No — elementwise
Embedding lookupNo — gather

So a transformer block is roughly: tensor-core (QKV proj) → tensor-core (attention) → tensor-core (output proj) → CUDA-core (residual + norm + activation) → tensor-core (MLP up) → CUDA-core (activation) → tensor-core (MLP down). Most of the FLOPs in tensor cores; most of the bytes still moving through HBM (lesson 21's roofline).

What you actually need to know — even if you never write PTX

You're unlikely to write mma.sync by hand unless you're working on a serving stack or a kernel library. But knowing what's there matters:

The closing loop — the whole stack from lesson 01

Look back at the path we've walked:

  1. Lessons 01–10: we built a 70B model and sharded it across 512 GPUs.
  2. Lessons 11–12: we served it from a continuous-batching, KV-cached, prefill/decode-disaggregated stack.
  3. Lessons 13–19: we walked the per-GPU framework — PyTorch dispatcher, allocators, fusion, compilers.
  4. Lessons 20–27: we walked the CUDA execution model — warps, memory, coalescing, tiling, divergence, occupancy.
  5. Lesson 28 (here): we ended where modern matmul lives — at the tensor core, where 4 fp32 operands per cycle become 4096.

The 70B model from lesson 01 hits tensor cores billions of times per training step. Every layer above this lesson is, at some level, in service of keeping those tensor cores fed — coalesced HBM reads to bring tiles to SMEM, SMEM tiles to fragments, fragments to the mma.sync. The whole stack is one optimisation problem.

Interactive · tensor-core throughput by dtype

Pick a matrix size and a dtype. The widget computes the theoretical peak time (assuming 100% tensor-core utilisation), the achievable bandwidth-roofline time, and the resulting compute-vs-memory regime.

Tensor-core peak vs HBM roofline
For a square matmul of size N. Tensor-core peak is the FLOPs-bound time. HBM roofline is the memory-bound time. The kernel takes the larger of the two.
compute time
HBM time
bottleneck
arithmetic intensity
Takeaway
Every matmul-class op on a modern GPU eventually lands on the tensor cores via mma.sync. Each instruction is a small matrix multiply, with operands distributed as fragments across a warp. cuBLAS, cuDNN, CUTLASS, and Triton's tl.dot all emit these. The dtype ladder (fp32 → bf16 → fp8 → fp4) doubles throughput at each step, with the running accumulator staying in fp32. The whole rest of the system — sharding, allocators, compilers, kernels, memory layouts — exists to keep these tensor cores fed.
Where this connects
Part V ends here. The full 28-lesson stack — distributed parallelism (01–10) → inference serving (11–12) → framework/kernels/compilers (13–19) → CUDA fundamentals (20–28) — is one continuous descent from "1024 GPUs training a 70B model" to "32 threads launching one mma.sync." The companion series vllm/lessons/ covers serving-time specifics (KV cache, PagedAttention, continuous batching), and RL/lessons/ covers RL post-training architecture. Same system, different cross-sections.