Why Triton?

PyTorch can't fuse the operation you want. CUDA takes a week to write and breaks when the shape changes. Triton is the lane in between. This lesson is the case: why the lane exists, what it gives up, and the one-line decision rule for when it's the right tool.

The problem in one example

You profiled a transformer block and one chain of ops is eating 18% of the step:

z = x @ w           # GEMM, calls cuBLAS
z = z + b           # elementwise, separate kernel
z = gelu(z)         # elementwise, separate kernel
y = z * mask        # elementwise, separate kernel

Four kernels, four round-trips to HBM. The GEMM is fine — cuBLAS is hand-tuned. But the three trailing elementwise ops each read all of z from HBM, do a trivial amount of math per byte, and write all of z back. That's pure memory bandwidth waste:

data moved (4 ops)

~7×

data moved (fused)

~2×

launches

4 → 1

arithmetic intensity

3× higher

If you fuse the four ops into one kernel that streams z through registers, you save 5/7 of the HBM traffic and 3/4 of the launch overhead. On an H100 with ~3 TB/s HBM, that's a 2–4× speedup on this region.

Why this is hard without Triton

You have three options and they all hurt in different ways:

Option	Pros	Cons
`torch.compile`	Zero kernel code. Often fuses elementwise chains automatically.	Opaque when it doesn't fuse. Black-box debugging. Some patterns it just refuses.
Hand-written CUDA	Maximum control. Hit 100% of peak when you know what you're doing.	~1000 lines of C++ for a fused matmul. Shape-fragile. Build system. Two weeks.
Library call (CUTLASS, FlashInfer, …)	If a library covers it, this is best.	Libraries don't cover novel fusions. The custom op you needed last week isn't there.

Triton is option 4: ~100 lines of Python that compile to a fused kernel, autotune across shapes, and need no build system.

The defining trade-off

Triton hides the warp. You never write threadIdx, never touch __syncthreads, never schedule async copies by hand. The compiler does that for you, and gets it right ~80–95% of the time on modern hardware.

The one trade-off, said precisely

In CUDA, you control which thread does which operation, when warps synchronise, how shared memory is swizzled, and when async copies issue. In Triton, you write code as if the tile were a single fat vector and the compiler chooses how to schedule it onto warps. Both compile to the same hardware; the difference is where the choices are made — your code, or the compiler.

What you give up

Last-mile mma scheduling. CUTLASS hand-orders mma.sync instructions to keep the tensor cores fed every cycle. Triton's scheduler is good but not surgical. Result: cuBLAS / CUTLASS beat Triton GEMM by 5–15% on vanilla bf16 shapes.
Warp specialisation. Hopper/Blackwell kernels split warps into producer (TMA loads) and consumer (mma compute) roles. Triton has some support, but the canonical FlashAttention v3 / FlashInfer kernels are still hand-written.
Communication primitives. Triton has no NCCL, no NVLink intrinsics, no inter-SM atomic shortcuts. Cross-GPU work belongs in higher layers.

What you gain

~10× less code. A tiled bf16 matmul is ~80 lines in Triton vs ~800 in raw CUDA (closer to CUTLASS's lines if you use templates, but you're inheriting their abstractions).
Autotune for free. One decorator and Triton sweeps tile sizes, num_warps, num_stages and caches the winner per shape key.
Python everywhere. Your test harness, the kernel, and the wrapper are one file. No CMake, no .cu compilation, no PTXAS flags.
Portability. The same Triton source runs on NVIDIA (PTX), AMD (ROCm), and emerging backends. CUDA does not.

The one-line decision rule

Reach for Triton when all three of these are true:

No library covers the operation you need.
The operation is bandwidth-bound (fusion would save HBM bytes) or custom enough that no fused library kernel exists.
You want autotuning across multiple shapes and don't have time to hand-tune CUDA per shape.

If those don't all hold, the answer is usually:

Situation	Use
Vanilla bf16/fp16 matmul	`torch.matmul` → cuBLAS
Attention	FlashAttention v3 or FlashInfer
Elementwise chain in a model	Try `torch.compile` first; Triton if it doesn't fuse
Quantised GEMM (W4/W8)	Marlin / Machete via vLLM
The op is <1% of the step	Don't bother

Interactive · is Triton the right tool?

Set how much of the step the candidate region takes, how unique the op is, and how irregular the shapes are. The widget recommends Triton, a library, or "skip".

What's next

You're sold on the lane. The question now is: what does "tile-level code" actually feel like? Lesson 02 introduces the central abstraction — programs and tiles — with a live widget that shows what each warp lane is doing when you write tl.arange(0, BLOCK).