Vector add — your first kernel

The "hello world" of GPU programming. We use it to walk the full Triton workflow end to end: jit, grid, launch, autotune, verify against PyTorch, benchmark with do_bench. By the end you have a kernel you can paste into a Colab.

The complete kernel — 13 lines

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK: tl.constexpr):
    pid  = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    m    = offs < N
    x    = tl.load(a_ptr + offs, mask=m, other=0.0)
    y    = tl.load(b_ptr + offs, mask=m, other=0.0)
    tl.store(c_ptr + offs, x + y, mask=m)

That's the whole kernel. Every primitive from lesson 04 is here: program_id, arange, mask, tl.load with other, tl.store with mask.

The PyTorch-facing wrapper

def vector_add(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    assert a.is_cuda and b.is_cuda and a.shape == b.shape
    N = a.numel()
    c = torch.empty_like(a)
    BLOCK = 1024
    grid  = (triton.cdiv(N, BLOCK),)
    add_kernel[grid](a, b, c, N, BLOCK=BLOCK)
    return c

Two notable things:

The launch syntax add_kernel[grid](...) uses Python's __getitem__: the grid goes in [], the kernel args in (). Triton overloads this on the @jit object.
BLOCK is passed as a keyword. Because it's marked tl.constexpr in the kernel, this triggers compilation.

Verify against PyTorch — always

a = torch.randn(1_000_000, device='cuda', dtype=torch.float32)
b = torch.randn(1_000_000, device='cuda', dtype=torch.float32)

c_triton = vector_add(a, b)
c_torch  = a + b

torch.testing.assert_close(c_triton, c_torch)        # exact for fp32 add

For bf16/fp16 you'd use assert_close(... atol=1e-3, rtol=1e-2) — same as you'd use for any GPU reduction comparison.

Always test at "ugly" sizes

Test N = 1_000_000 (not a power of two) and N = 1024 (exact tile size) and N = 1 (tiny). The first checks the boundary mask. The second checks the no-boundary case. The third checks you didn't accidentally rely on a tile size larger than your data.

Benchmark with do_bench

Triton ships a microbenchmarker that handles warmup and the noise floor:

ms = triton.testing.do_bench(lambda: vector_add(a, b), warmup=25, rep=100)
N_bytes = 3 * a.numel() * a.element_size()   # 2 loads + 1 store
bw_gbps = N_bytes / (ms * 1e-3) / 1e9
print(f"{ms:.3f} ms  ·  {bw_gbps:.0f} GB/s")

For a 1M-element fp32 add on an H100 (peak HBM ~3 TB/s) you should see ~2700 GB/s — close to peak because the kernel is purely bandwidth-bound. If you see <50% of peak, the kernel is launch-bound (try a bigger N) or your BLOCK is too small.

Adding autotune — picking BLOCK for you

Different N's like different BLOCK sizes. Rather than hardcode 1024, let the autotuner choose:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK':  256}, num_warps=2),
        triton.Config({'BLOCK':  512}, num_warps=4),
        triton.Config({'BLOCK': 1024}, num_warps=4),
        triton.Config({'BLOCK': 2048}, num_warps=8),
    ],
    key=['N'],
)
@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK: tl.constexpr):
    # ... same body ...

def vector_add(a, b):
    N = a.numel()
    c = torch.empty_like(a)
    grid = lambda meta: (triton.cdiv(N, meta['BLOCK']),)
    add_kernel[grid](a, b, c, N)               # no BLOCK arg — autotune fills it
    return c

The first call for each N in the key triggers a benchmarking pass — Triton compiles all four configs, runs each ~5 times, and picks the fastest. Subsequent calls with the same N hit the cache.

N=1_000_000 → BLOCK=1024, num_warps=4. Subsequent calls with the same N: instant lookup. Different N (e.g. 16M): fresh sweep, picks whatever is best for that size.

What can go wrong on a vector add

Symptom	Cause	Fix
Wrong output at N = 1_000_000 but right at N = 1024	Forgot the mask	Add `mask=offs < N, other=0.` to every load/store.
Segfault at large N	Mask missing OR you passed a non-contiguous tensor	Mask, and `a = a.contiguous()` before the call.
Only ~10% of peak bandwidth	Launch-bound (kernel runs too fast) or BLOCK too small	Increase BLOCK (try 4096) or batch more work per kernel.
First call takes 200 ms	Autotune sweep on first hit	Run a warmup call before timing.
`RuntimeError: index out of bounds` on launch	Grid uses an old BLOCK after autotune changed it	Use `grid = lambda meta: (cdiv(N, meta['BLOCK']),)` so grid reads BLOCK from the meta dict.

Reading the roofline — is this kernel fast?

A vector add does 1 floating-point op per 3 memory accesses (2 loads, 1 store). That's an arithmetic intensity of 1/12 flop per byte for fp32 — far below the roofline knee. The kernel is bandwidth-bound, and the right ceiling is HBM peak, not flops peak.

flops/byte (fp32)

~0.08

H100 HBM peak

~3.0 TB/s

target wall-clock at N=4M fp32

~16 µs

if you measure 30 µs

launch-bound; increase N

Interactive · tune your add kernel

Pick BLOCK and num_warps; see the predicted bandwidth on a 4-million-fp32 add and where each setting hurts.

What's next

Vector add is bandwidth-bound and trivial to write. Lesson 08 picks a more typical workload — a fused linear + bias + GELU — that only wins through fusion. You'll see eager PyTorch's 3 launches collapse into 1 Triton kernel, and feel the bandwidth saving in your fingertips.