Lessons · a linearized library
11 series · 160 lessons · ~54 hours

A linearized lesson library

Read one series end-to-end and you can hold the system in your head.

Each track is a strict-linear chain: every lesson assumes only the lessons before it, every concept is justified from first principles, and every trade-off comes with quantitative reasoning. Covers post-training RL, GPU kernels for LLM serving, distributed training, generative modelling, and the durable ML interview canon.

160lesson pages
11series
~54hread time
3interview tracks

Where to start

Pick the goal that fits. Each path threads through one or two series in order.

No series match those filters.

flagship

GPU Kernels for LLM Serving

Part I (01–09): CUDA from first principles. Part II (10–17): serving on top — FlashAttention, paged KV, prefix reuse, continuous batching + CUDA graphs, quantized GEMM & MoE & sampling, framework synthesis. Part III (18–24): the optimization loop — PyTorch dispatch, profiling, reading the roofline, Triton, torch.compile, performant patterns.

CUDAFlashAttentionpaged KVradix cacheTritontorch.compile
24 lessons ~10 h Open series →
new

Triton (OpenAI) — Writing GPU Kernels in Python

The kernel DSL (OpenAI Triton), not the NVIDIA inference server. Part I (01–03): the tile programming model, the execution pipeline, why hiding warps is the trade. Part II (04–06): the DSL — pointers/masks, tl.dot + tensor cores, reductions and the online softmax recurrence. Part III (07–11): five canonical kernels — vector add → fused linear+activation → tiled matmul → softmax → RMSNorm. Part IV (12): Flash Attention as the synthesis. Part V (13–14): autotune, software pipelining, pitfalls, backward passes, profiling, and the decision tree of when to write Triton.

tile modeltl.dotonline softmaxFlash Attentionautotunenum_stages
14 lessons ~5 h Open series →

RL Post-Training

The system roles, algorithms, and production recipes behind RLHF, RLVR, PPO, GRPO, RLOO, DAPO, and related reasoning-model training loops.

PPOGRPORLHFDPOrolloutsverifiers
24 lessons ~10 h Open series →

System ML

Distributed training and inference: collectives, interconnect, DDP/FSDP/TP/PP/SP/EP, 3D parallelism, PyTorch internals, mixed precision, caching allocator, kernel fusion, Triton, torch.compile, CUDA Graphs & TensorRT. CUDA primitives moved to the GPU Kernels track.

FSDPtensor parallelpipeline parallelmixed precisiontorch.compile
19 lessons ~8 h Open series →

Mini GPT

A compact path through GPT architecture, pretraining, supervised tuning, chain-of-thought data, DPO, and RLVR.

architecturepretrainingSFTDPORLVR
6 lessons ~3 h Open series →

vLLM

Serving from first principles: KV cache math, PagedAttention, FlashAttention, continuous batching, prefill/decode splits, GQA/MQA, and Multi-LoRA.

PagedAttentionFlashAttentioncontinuous batchingMulti-LoRA
12 lessons ~5 h Open series →

SGLang

A serving framework whose unit of work is the program, not the single call. RadixAttention turns prefix sharing into a tree-shaped cache; cache-aware scheduling turns that capability into hit rate; compressed-FSM + xgrammar make constrained outputs free; FlashInfer + DP-attention + EP carry the kernel and parallelism load.

RadixAttentioncache-aware schedxgrammarFlashInferDP attentionEAGLE
11 lessons ~3 h Open series →

Generative Continuous

Diffusion, flow matching, DiT, tokenizers, discrete generation, unified-token models, and hybrid reasoning/image pipelines.

DDPMflow matchingDiTVQ-VAEunified tokens
15 lessons ~6 h Open series →

Traditional ML Interviews

The durable interview canon: bias-variance, linear and logistic models, trees, boosting, kernels, Bayes, unsupervised methods, evaluation, and interpretability.

bias-varianceGLMsGBDTSVMsevaluation
12 lessons ~4 h Open series →
new

Deep Learning Foundations

The math you get drilled on once the interviewer stops asking about XGBoost. Backprop, optimizers, normalization, attention, positional encodings, tokenization, scaling laws, calibration, A/B testing, init, MoE — derived from first principles, with the trade-offs an interviewer probes.

backpropAdamattentionRoPEscaling lawsMoE
12 lessons ~5 h Open series →

Search Ads & Recsys Interviews

Retrieval, ranking, embeddings, losses, calibration, negative sampling, bias, evaluation, auctions, pacing, and system design.

retrievalrankingANNcalibrationauctions
11 lessons ~4 h Open series →