all_lessons / Triton kernels / lessons / 07 · vector add lesson 07 / 14

Vector add — your first kernel

The "hello world" of GPU programming. We use it to walk the full Triton workflow end to end: jit, grid, launch, autotune, verify against PyTorch, benchmark with do_bench. By the end you have a kernel you can paste into a Colab.

The complete kernel — 13 lines

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK: tl.constexpr):
    pid  = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    m    = offs < N
    x    = tl.load(a_ptr + offs, mask=m, other=0.0)
    y    = tl.load(b_ptr + offs, mask=m, other=0.0)
    tl.store(c_ptr + offs, x + y, mask=m)

That's the whole kernel. Every primitive from lesson 04 is here: program_id, arange, mask, tl.load with other, tl.store with mask.

The PyTorch-facing wrapper

def vector_add(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    assert a.is_cuda and b.is_cuda and a.shape == b.shape
    N = a.numel()
    c = torch.empty_like(a)
    BLOCK = 1024
    grid  = (triton.cdiv(N, BLOCK),)
    add_kernel[grid](a, b, c, N, BLOCK=BLOCK)
    return c

Two notable things:

Verify against PyTorch — always

a = torch.randn(1_000_000, device='cuda', dtype=torch.float32)
b = torch.randn(1_000_000, device='cuda', dtype=torch.float32)

c_triton = vector_add(a, b)
c_torch  = a + b

torch.testing.assert_close(c_triton, c_torch)        # exact for fp32 add

For bf16/fp16 you'd use assert_close(... atol=1e-3, rtol=1e-2) — same as you'd use for any GPU reduction comparison.

Always test at "ugly" sizes
Test N = 1_000_000 (not a power of two) and N = 1024 (exact tile size) and N = 1 (tiny). The first checks the boundary mask. The second checks the no-boundary case. The third checks you didn't accidentally rely on a tile size larger than your data.

Benchmark with do_bench

Triton ships a microbenchmarker that handles warmup and the noise floor:

ms = triton.testing.do_bench(lambda: vector_add(a, b), warmup=25, rep=100)
N_bytes = 3 * a.numel() * a.element_size()   # 2 loads + 1 store
bw_gbps = N_bytes / (ms * 1e-3) / 1e9
print(f"{ms:.3f} ms  ·  {bw_gbps:.0f} GB/s")

For a 1M-element fp32 add on an H100 (peak HBM ~3 TB/s) you should see ~2700 GB/s — close to peak because the kernel is purely bandwidth-bound. If you see <50% of peak, the kernel is launch-bound (try a bigger N) or your BLOCK is too small.

Adding autotune — picking BLOCK for you

Different N's like different BLOCK sizes. Rather than hardcode 1024, let the autotuner choose:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK':  256}, num_warps=2),
        triton.Config({'BLOCK':  512}, num_warps=4),
        triton.Config({'BLOCK': 1024}, num_warps=4),
        triton.Config({'BLOCK': 2048}, num_warps=8),
    ],
    key=['N'],
)
@triton.jit
def add_kernel(a_ptr, b_ptr, c_ptr, N, BLOCK: tl.constexpr):
    # ... same body ...

def vector_add(a, b):
    N = a.numel()
    c = torch.empty_like(a)
    grid = lambda meta: (triton.cdiv(N, meta['BLOCK']),)
    add_kernel[grid](a, b, c, N)               # no BLOCK arg — autotune fills it
    return c

The first call for each N in the key triggers a benchmarking pass — Triton compiles all four configs, runs each ~5 times, and picks the fastest. Subsequent calls with the same N hit the cache.

First call with N=1_000_000 → autotuner sweep BLOCK=256 0.42 ms BLOCK=512 0.31 ms BLOCK=1024 ✓ 0.27 ms (best) BLOCK=2048 0.28 ms Result cached as N=1_000_000 → BLOCK=1024, num_warps=4. Subsequent calls with the same N: instant lookup. Different N (e.g. 16M): fresh sweep, picks whatever is best for that size.

What can go wrong on a vector add

SymptomCauseFix
Wrong output at N = 1_000_000 but right at N = 1024Forgot the maskAdd mask=offs < N, other=0. to every load/store.
Segfault at large NMask missing OR you passed a non-contiguous tensorMask, and a = a.contiguous() before the call.
Only ~10% of peak bandwidthLaunch-bound (kernel runs too fast) or BLOCK too smallIncrease BLOCK (try 4096) or batch more work per kernel.
First call takes 200 msAutotune sweep on first hitRun a warmup call before timing.
RuntimeError: index out of bounds on launchGrid uses an old BLOCK after autotune changed itUse grid = lambda meta: (cdiv(N, meta['BLOCK']),) so grid reads BLOCK from the meta dict.

Reading the roofline — is this kernel fast?

A vector add does 1 floating-point op per 3 memory accesses (2 loads, 1 store). That's an arithmetic intensity of 1/12 flop per byte for fp32 — far below the roofline knee. The kernel is bandwidth-bound, and the right ceiling is HBM peak, not flops peak.

flops/byte (fp32)
~0.08
H100 HBM peak
~3.0 TB/s
target wall-clock at N=4M fp32
~16 µs
if you measure 30 µs
launch-bound; increase N

Interactive · tune your add kernel

Pick BLOCK and num_warps; see the predicted bandwidth on a 4-million-fp32 add and where each setting hurts.

Vector add bandwidth predictor

Model: bandwidth ≈ HBM_peak · efficiency(BLOCK) · 1 − launch_overhead(BLOCK). The sweet spot moves with N.

What's next

Vector add is bandwidth-bound and trivial to write. Lesson 08 picks a more typical workload — a fused linear + bias + GELU — that only wins through fusion. You'll see eager PyTorch's 3 launches collapse into 1 Triton kernel, and feel the bandwidth saving in your fingertips.