all_lessons / Triton kernels / lessons / index 14 lessons + orientation · ~5h read

Triton — Writing GPU Kernels in Python

A linearized tour of OpenAI Triton, the kernel DSL — built so you understand the tile programming model first, then build kernels (vector add → matmul → softmax → flash attention) from those primitives.

This series of fourteen interactive lessons unwraps Triton from scratch. Part I (lessons 01–03) covers the execution model: why Triton exists, what a "program" is, and how Python source becomes PTX. Part II (lessons 04–06) covers the DSL: pointers, masks, tl.dot, and reductions — every primitive you'll touch. Part III (lessons 07–11) walks you through five real kernels, each one a one-step elaboration on the previous: vector add, fused linear+activation, tiled matmul, online softmax, fused norm. Part IV (lesson 12) is the flagship: Flash Attention as a synthesis of every primitive. Part V (lessons 13–14) is performance and production: autotune, pipelining, profiling, backward passes, and the decision tree of when not to write Triton. Each lesson has at least one interactive widget so you can grab a knob and feel the consequence.

Who this is for
You can read Python, you've used PyTorch, and you have a rough mental model of a GPU (warps, SMs, HBM vs SRAM). You don't need to have written CUDA — Triton is what most people learn instead. If you want the CUDA-side background, the GPU Kernels for LLM Serving series (lessons 01–09) covers it from first principles.
New to GPU programming? Start here
Read 00 · Orientation first — a 4-minute map of what Triton is, the one trade-off (tiles instead of threads) that defines the language, and how the 14 lessons that follow fit together. Then dive into lesson 01.

The model you're learning

Triton is a Python-embedded DSL with one central abstraction: a program handles one tile of work. You write tile-level code; the compiler maps tiles onto warps, schedules loads against compute, and picks layouts. Hover a stage to see its job.

PYTHON @jit staged at compile time Triton IR tile-level ops TritonGPU IR layouts · warps · pipeline LLVM → PTX cached binary per shape grid of programs tl.program_id(axis) one tile per program BLOCK_M × BLOCK_N autotune per shape configs cached by key=[...] tile → warps → tensor cores → HBM (the compiler decides)

Part I · The model (lessons 01–03 · why tiles, not threads)

01
Why Triton?
The gap between PyTorch ops and hand-written CUDA. Why most "I need a fused kernel" tasks land in this gap. The autotuner as a labor-saver. The one-line decision rule.
02
The tile programming model
CUDA gives you threads; Triton gives you tiles. Why hiding warps is the trade that defines the language. SIMT vs SPMD-on-tiles intuition with a live "what does each lane do" widget.
03
The execution model
Python → Triton IR → TritonGPU IR → LLVM → PTX. What @triton.jit actually does. What tl.constexpr controls. The grid is your problem decomposition.

Part II · The DSL (lessons 04–06 · every primitive you'll touch)

Triton has a small DSL. By the end of these three lessons you'll know every op you need to write 90% of kernels — and what each one compiles to.

04
Pointers, masks, and boundaries
tl.load and tl.store with predicates. Why every tile needs a mask and what happens when it doesn't. Strides as the address calculator. Live "coalesce-or-not" address visualiser.
05
tl.dot and tensor cores
When tl.dot lowers to mma/wgmma and when it falls back to FMA. Accumulation dtype rules: bf16 in, fp32 accumulate. The shape constraints that decide whether you hit tensor cores.
06
Reductions and online algorithms
tl.sum, tl.max, tl.cumsum. How a tile reduction lowers to warp shuffles + SMEM. The online softmax recurrence (Milakov-Gimelshein) — your first taste of why Flash Attention works.

Part III · Building real kernels (lessons 07–11 · five canonical examples in order of complexity)

Five kernels you'd ship in a production stack. Each is a one-step elaboration on the previous — read in order and the last one (RMSNorm) is straightforward; skip the order and it isn't.

    1D, 1 op                 epilogue fusion             2D + K-loop
    vector_add  ───────▶  fused_linear_act  ───────▶  tiled_matmul
                                                            │
                                                            │  online reduction
                                                            ▼
                                                       softmax
                                                            │
                                                            │  stat + scale fused
                                                            ▼
                                                       rms_norm
07
Vector add — your first kernel
End to end: @jit, grid, launch, mask the tail, autotune one config. Verify against PyTorch. Benchmark with do_bench. The minimum a Triton kernel can be.
08
Fused linear + bias + GELU
3 launches collapse to 1. Why saving one round-trip to HBM is worth more than any math optimisation at this scale. The epilogue-fusion pattern that shows up in every transformer kernel.
09
Tiled matmul — the canonical GEMM
The K-loop. Output tile in registers. Accumulating in fp32. Boundary masks per axis. Why cuBLAS still wins for vanilla bf16 — and exactly where Triton catches up.
10
Softmax — the online reduction
3-pass naive → 2-pass numerically stable → 1-pass online (Milakov-Gimelshein). The recurrence you'll reuse in lesson 12. Live "watch the running max chase the true max" widget.
11
RMSNorm — fused stat + scale
Fuse the variance reduction with the rescale and the weight multiply. Halves bandwidth vs eager PyTorch. The pattern generalises to LayerNorm, group norm, anything stat+scale.

Part IV · The flagship (lesson 12 · everything composes)

12
Flash Attention — block-tiled attention
The O(N²) materialisation that kills HBM bandwidth. Block the Q rows, stream the K/V columns, keep the running softmax (m, ℓ) in registers, apply the one-line correction. The synthesis of every primitive from Parts II–III. Live "watch the tile sweep" widget.

Part V · Performance & production (lessons 13–14 · shipping it)

13
Autotune, num_stages, and pitfalls
The autotuner: key, configs, cache. num_warps vs num_stages demystified — software pipelining is what makes tl.dot-heavy kernels fast. The full pitfall checklist: register spill, bf16 accum, missing mask, stale cache.
14
Production — backward, profile, when NOT
torch.autograd.Function with explicit Triton forward + backward. Profiling: do_bench, Nsight Compute, dumping TTGIR/PTX. The decision tree: Triton vs CUDA vs torch.compile vs library.

How to use this

  1. Linearly. Each lesson assumes the previous. Lesson 12 (Flash Attention) literally calls every primitive from lessons 04–10; skip them and it won't read.
  2. Run every kernel. The lessons include complete, runnable code. Paste it into a Colab with a T4 or better and time it. The point is the wall-clock surprise.
  3. Touch every knob. Every widget has a setting that makes the kernel wrong or slow. Find it. The bugs are the lesson.
Companion lessons
Triton sits on top of CUDA. If you hit a concept here that needs the GPU-internals view, lessons 01–09 of GPU Kernels cover SMs, warps, memory hierarchy, occupancy, and tensor cores from first principles. Lesson 22 of that series is the elevator-pitch version of this whole track.