all_lessons / Triton kernels / lessons / 01 · why triton lesson 01 / 14

Why Triton?

PyTorch can't fuse the operation you want. CUDA takes a week to write and breaks when the shape changes. Triton is the lane in between. This lesson is the case: why the lane exists, what it gives up, and the one-line decision rule for when it's the right tool.

The problem in one example

You profiled a transformer block and one chain of ops is eating 18% of the step:

z = x @ w           # GEMM, calls cuBLAS
z = z + b           # elementwise, separate kernel
z = gelu(z)         # elementwise, separate kernel
y = z * mask        # elementwise, separate kernel

Four kernels, four round-trips to HBM. The GEMM is fine — cuBLAS is hand-tuned. But the three trailing elementwise ops each read all of z from HBM, do a trivial amount of math per byte, and write all of z back. That's pure memory bandwidth waste:

data moved (4 ops)
~7×
data moved (fused)
~2×
launches
4 → 1
arithmetic intensity
3× higher

If you fuse the four ops into one kernel that streams z through registers, you save 5/7 of the HBM traffic and 3/4 of the launch overhead. On an H100 with ~3 TB/s HBM, that's a 2–4× speedup on this region.

Why this is hard without Triton

You have three options and they all hurt in different ways:

OptionProsCons
torch.compile Zero kernel code. Often fuses elementwise chains automatically. Opaque when it doesn't fuse. Black-box debugging. Some patterns it just refuses.
Hand-written CUDA Maximum control. Hit 100% of peak when you know what you're doing. ~1000 lines of C++ for a fused matmul. Shape-fragile. Build system. Two weeks.
Library call (CUTLASS, FlashInfer, …) If a library covers it, this is best. Libraries don't cover novel fusions. The custom op you needed last week isn't there.

Triton is option 4: ~100 lines of Python that compile to a fused kernel, autotune across shapes, and need no build system.

The defining trade-off

CUDA — you address threads __global__ void add(...) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < N) c[i] = a[i] + b[i]; } "each of the 32 threads in my warp does one element" — mma sequences, swizzle, async copy choreography all you Triton — you address tiles @triton.jit def add(a, b, c, N, BLOCK: tl.constexpr): pid = tl.program_id(0) offs = pid*BLOCK + tl.arange(0,BLOCK) m = offs < N tl.store(c+offs, tl.load(a+offs,m)+...) "one program handles a tile of BLOCK elements — compiler picks warps" — mma, swizzle, async copy are the compiler's

Triton hides the warp. You never write threadIdx, never touch __syncthreads, never schedule async copies by hand. The compiler does that for you, and gets it right ~80–95% of the time on modern hardware.

The one trade-off, said precisely
In CUDA, you control which thread does which operation, when warps synchronise, how shared memory is swizzled, and when async copies issue. In Triton, you write code as if the tile were a single fat vector and the compiler chooses how to schedule it onto warps. Both compile to the same hardware; the difference is where the choices are made — your code, or the compiler.

What you give up

What you gain

The one-line decision rule

Reach for Triton when all three of these are true:

  1. No library covers the operation you need.
  2. The operation is bandwidth-bound (fusion would save HBM bytes) or custom enough that no fused library kernel exists.
  3. You want autotuning across multiple shapes and don't have time to hand-tune CUDA per shape.

If those don't all hold, the answer is usually:

SituationUse
Vanilla bf16/fp16 matmultorch.matmul → cuBLAS
AttentionFlashAttention v3 or FlashInfer
Elementwise chain in a modelTry torch.compile first; Triton if it doesn't fuse
Quantised GEMM (W4/W8)Marlin / Machete via vLLM
The op is <1% of the stepDon't bother

Interactive · is Triton the right tool?

Set how much of the step the candidate region takes, how unique the op is, and how irregular the shapes are. The widget recommends Triton, a library, or "skip".

Triton vs library vs no-op

Triton's sweet spot: novel fusion, bandwidth-bound, multiple shapes. Outside it, prefer a library or skip.

What's next

You're sold on the lane. The question now is: what does "tile-level code" actually feel like? Lesson 02 introduces the central abstraction — programs and tiles — with a live widget that shows what each warp lane is doing when you write tl.arange(0, BLOCK).