GPU Kernels for LLM Serving
Part I (01–09): CUDA from first principles. Part II (10–17): serving on top — FlashAttention, paged KV, prefix reuse, continuous batching + CUDA graphs, quantized GEMM & MoE & sampling, framework synthesis. Part III (18–24): the optimization loop — PyTorch dispatch, profiling, reading the roofline, Triton, torch.compile, performant patterns.