Triton — Writing GPU Kernels in Python

A linearized tour of OpenAI Triton, the kernel DSL — built so you understand the tile programming model first, then build kernels (vector add → matmul → softmax → flash attention) from those primitives. The forward kernels are the front half; the training Part then writes their backward kernels — the half a general engineer needs to train, not just serve.

This series of twenty-one interactive lessons unwraps Triton from scratch. Part I (lessons 01–03) covers the execution model: why Triton exists, what a "program" is, and how Python source becomes PTX. Part II (lessons 04–06) covers the DSL: pointers, masks, tl.dot, and reductions — every primitive you'll touch. Part III (lessons 07–11) walks you through five real kernels, each one a one-step elaboration on the previous: vector add, fused linear+activation, tiled matmul, online softmax, fused norm. Part IV (lesson 12) is the flagship: Flash Attention as a synthesis of every primitive. Part V (lessons 13–14) is performance and production: autotune, pipelining, profiling, backward passes, and the decision tree of when not to write Triton. Part VI (lessons 15–17) drills the same DSL under interview conditions: tile-program operations and launch flow, optimized snippets, and algorithm/data-structure patterns. Part VII (lessons 18–21) is training kernels — the backward pass: wiring forward+backward with torch.autograd.Function, the two-GEMM matmul backward, fused cross-entropy, and the norm-backward dγ/dβ reduction. Each lesson has at least one interactive widget so you can grab a knob and feel the consequence.

Who this is for

You can read Python, you've used PyTorch, and you have a rough mental model of a GPU (warps, SMs, HBM vs SRAM). You don't need to have written CUDA — Triton is what most people learn instead. If you want the CUDA-side background, the GPU Kernels for ML Engineers series (lessons 01–09) covers it from first principles.

New to GPU programming? Start here

Read 00 · Orientation first — a 4-minute map of what Triton is, the one trade-off (tiles instead of threads) that defines the language, and how the 21 lessons that follow fit together. Then dive into lesson 01.

The model you're learning

Triton is a Python-embedded DSL with one central abstraction: a program handles one tile of work. You write tile-level code; the compiler maps tiles onto warps, schedules loads against compute, and picks layouts. Hover a stage to see its job.

Part I · The model (lessons 01–03 · why tiles, not threads)

Why Triton?

The gap between PyTorch ops and hand-written CUDA. Why most "I need a fused kernel" tasks land in this gap. The autotuner as a labor-saver. The one-line decision rule.

The tile programming model

CUDA gives you threads; Triton gives you tiles. Why hiding warps is the trade that defines the language. SIMT vs SPMD-on-tiles intuition with a live "what does each lane do" widget.

The execution model

Python → Triton IR → TritonGPU IR → LLVM → PTX. What @triton.jit actually does. What tl.constexpr controls. The grid is your problem decomposition.

Part II · The DSL (lessons 04–06 · every primitive you'll touch)

Triton has a small DSL. By the end of these three lessons you'll know every op you need to write 90% of kernels — and what each one compiles to.

Pointers, masks, and boundaries

tl.load and tl.store with predicates. Why every tile needs a mask and what happens when it doesn't. Strides as the address calculator. Live "coalesce-or-not" address visualiser.

tl.dot and tensor cores

When tl.dot lowers to mma/wgmma and when it falls back to FMA. Accumulation dtype rules: bf16 in, fp32 accumulate. The shape constraints that decide whether you hit tensor cores.

Reductions and online algorithms

tl.sum, tl.max, tl.cumsum. How a tile reduction lowers to warp shuffles + SMEM. The online softmax recurrence (Milakov-Gimelshein) — your first taste of why Flash Attention works.

Part III · Building real kernels (lessons 07–11 · five canonical examples in order of complexity)

Five kernels you'd ship in a production stack. Each is a one-step elaboration on the previous — read in order and the last one (RMSNorm) is straightforward; skip the order and it isn't.

    1D, 1 op                 epilogue fusion             2D + K-loop
    vector_add  ───────▶  fused_linear_act  ───────▶  tiled_matmul
                                                            │
                                                            │  online reduction
                                                            ▼
                                                       softmax
                                                            │
                                                            │  stat + scale fused
                                                            ▼
                                                       rms_norm

Vector add — your first kernel

End to end: @jit, grid, launch, mask the tail, autotune one config. Verify against PyTorch. Benchmark with do_bench. The minimum a Triton kernel can be.

Fused linear + bias + GELU

3 launches collapse to 1. Why saving one round-trip to HBM is worth more than any math optimisation at this scale. The epilogue-fusion pattern that shows up in every transformer kernel.

Tiled matmul — the canonical GEMM

The K-loop. Output tile in registers. Accumulating in fp32. Boundary masks per axis. Why cuBLAS still wins for vanilla bf16 — and exactly where Triton catches up.

Softmax — the online reduction

3-pass naive → 2-pass numerically stable → 1-pass online (Milakov-Gimelshein). The recurrence you'll reuse in lesson 12. Live "watch the running max chase the true max" widget.

RMSNorm — fused stat + scale

Fuse the variance reduction with the rescale and the weight multiply. Halves bandwidth vs eager PyTorch. The pattern generalises to LayerNorm, group norm, anything stat+scale.

Part IV · The flagship (lesson 12 · everything composes)

Flash Attention — block-tiled attention

The O(N²) materialisation that kills HBM bandwidth. Block the Q rows, stream the K/V columns, keep the running softmax (m, ℓ) in registers, apply the one-line correction. The synthesis of every primitive from Parts II–III. Live "watch the tile sweep" widget.

Part V · Performance & production (lessons 13–14 · shipping it)

Autotune, num_stages, and pitfalls

The autotuner: key, configs, cache. num_warps vs num_stages demystified — software pipelining is what makes tl.dot-heavy kernels fast. The full pitfall checklist: register spill, bf16 accum, missing mask, stale cache.

Production — backward, profile, when NOT

torch.autograd.Function with explicit Triton forward + backward. Profiling: do_bench, Nsight Compute, dumping TTGIR/PTX. The decision tree: Triton vs CUDA vs torch.compile vs library.

Part VI · Interview conditions (lessons 15–17 · the same DSL, drilled)

The same tile model from Parts I–V, now the way a kernel interview asks for it: what you write from memory, the snippets worth knowing cold, and how to reason out a Triton kernel on the spot.

Common operations & launch flow

The tile-program surface you should write without looking it up: tl.program_id, tl.arange, masked tl.load/tl.store, strides, tl.constexpr, num_warps/num_stages, and the launch grid.

Highly optimized Triton snippets

The snippets worth memorizing: fused bias+GELU, the row-softmax and RMSNorm tiles, tl.dot matmul with an fp32 accumulator, and the autotune key that makes them fast at real shapes.

Algorithms, data structures & the line of thinking

Reasoning out a kernel live: top-k, segmented reductions, block-sparse metadata, and online softmax — one program per row, the tile as the data structure, the mask as correctness.

Part VII · Training kernels — the backward pass (lessons 18–21 · write the gradient kernels)

Parts I–VI built forward kernels. A model that learns needs their backward twins. These four lessons write them in Triton: the autograd wiring, the two-GEMM matmul backward, the fused cross-entropy that keeps the giant logits off HBM, and the norm-backward reduction. For the deepest conceptual treatment of each (and FlashAttention backward, the hardest case), the companion GPU Kernels training Part (25–30) derives the engineering line of thinking; these lessons are the runnable Triton.

Backward in Triton — autograd.Function

A forward kernel and a separate backward kernel, wired with torch.autograd.Function and ctx.save_for_backward. The backward grid, save-vs-recompute, fp32 grad accumulation, and the order you must return gradients.

Matmul backward — two tl.dot kernels

dX = dY·Wᵀ and dW = Xᵀ·dY as two tl.dot kernels. The transpose done by swapped strides, not a transpose kernel. Why dW contracts the batch dim → split-K with tl.atomic_add or a reduction kernel.

Fused cross-entropy — the memory kernel

Online-softmax the logits per row, write dlogits = (p − onehot)/N in place, and chunk the lm-head matmul so the [B·T, V] logits never materialize. The biggest training-memory win, in Triton.

Norm backward — the dγ/dβ reduction

dx recomputes the row statistic; dγ/dβ sum across every token. The classic Triton partial-buffer + reduce (or lock / atomic) pattern, and the determinism trade. Activation derivative fused into the epilogue.

How to use this

Linearly. Each lesson assumes the previous. Lesson 12 (Flash Attention) literally calls every primitive from lessons 04–10; skip them and it won't read.
Run every kernel. The lessons include complete, runnable code. Paste it into a Colab with a T4 or better and time it. The point is the wall-clock surprise.
Touch every knob. Every widget has a setting that makes the kernel wrong or slow. Find it. The bugs are the lesson.

Companion lessons

Triton sits on top of CUDA. If you hit a concept here that needs the GPU-internals view, lessons 01–09 of GPU Kernels cover SMs, warps, memory hierarchy, occupancy, and tensor cores from first principles. Lesson 22 of that series is the elevator-pitch version of this whole track.