A fundamentals library

The fundamentals of modern ML and software systems, explained from first principles.

A reference library, not a course to grind. Each topic is built from the ground up — every concept justified, every trade-off made explicit, every track readable end-to-end. Browse by area below, or filter for a specific idea.

~988lesson pages

16areas

~360hread time

49tracks

A first-principles continuation of the CV track into three dimensions. Frames all of 3D vision as one inverse-rendering loop, then derives it: representations (points, voxels, meshes, SDFs, radiance fields, Gaussians), rotations & SE(3), depth sensing, registration/ICP, surfaces & marching cubes, NeRF, Gaussian Splatting, 3D deep learning & detection, learned depth/MVS, and generative 3D + systems.

representationsSE(3)point cloudsNeRFsplattinggenerative 3D

13 lessons ~8 h Open series →

new

Computer Graphics · 计算机图形学

A first-principles Computer Graphics track — the forward-rendering mirror of the CV tracks (scene → image). Bilingual: English & 中文, with a per-page language toggle. Framed around the rendering equation as the north star: the forward problem & two visibility strategies, transforms, geometry, rasterization, interpolation/textures, sampling/anti-aliasing, radiometry, shading & PBR, materials, ray tracing & BVH, Monte Carlo path tracing, real-time GI, animation, color/HDR/tone mapping, and the GPU frame + neural rendering.

rendering equationrasterizationPBRpath tracingreal-time GIGPU

15 lessons ~9 h Open series →

Generative models

3 tracks

Building samplers and language models: diffusion, flow matching, and tokenizers; a GPT from pretrain to RLVR; and the full from-scratch LLM course (CS336).

Generative Continuous

Diffusion, flow matching, DiT, tokenizers, discrete generation, unified-token models, and hybrid reasoning/image pipelines.

DDPMflow matchingDiTVQ-VAEunified tokensDDIM/DPM-SolverCFGlatent diffusionCLIPStable DiffusionLoRA/ControlNet

23 lessons ~7 h Open series →

Mini GPT

new

Modern GPU Programming — Hopper & Blackwell

The advanced sequel to GPU Kernels: what happens when the classic synchronous loop breaks down on Hopper/Blackwell and the kernel becomes an asynchronous supply chain. The engine of the whole track: the tensor core got so fast that feeding it is the entire kernel — a B200 sustains ≈2 PFLOP/s of fp16 but HBM delivers only ≈8 TB/s (roofline ridge ≈250 FLOP/byte). Part I (00–02): the forcing function — roofline, why occupancy gave way to explicit async overlap, and the shape·stride·swizzle layout algebra. Part II (03–07): the five primitives, each a forced move — tensor-core generations (mma→wgmma→tcgen05), TMA, TMEM, mbarriers + the phase model, and clusters/DSMEM/CLC. Part III (08–10): assemble a GEMM from the smallest correct tile to a warp-specialized, clustered, cuBLAS-parity kernel (70 ms → 0.094 ms, ~744×). Part IV (11–14): Flash Attention 4 (two MMAs with a softmax wedged between), debugging warp-specialized kernels, and where it all lives — CUTLASS/CuTe, cuDNN, Triton. Framed in CUTLASS/CuTe + PTX; distilled from MLC.ai's Modern GPU Programming for MLSys.

tcgen05TMATMEMmbarrierswarp specializationFlash Attention 4

15 lessons ~6 h Open series →

new

Triton (OpenAI) — Writing GPU Kernels in Python

The kernel DSL (OpenAI Triton), not the NVIDIA inference server. Part I (01–03): the tile programming model, the execution pipeline, why hiding warps is the trade. Part II (04–06): the DSL — pointers/masks, tl.dot + tensor cores, reductions and the online softmax recurrence. Part III (07–11): five canonical kernels — vector add → fused linear+activation → tiled matmul → softmax → RMSNorm. Part IV (12): Flash Attention as the synthesis. Part V (13–14): autotune, software pipelining, pitfalls, backward passes, profiling, and the decision tree of when to write Triton. Part VI (15–17): the same DSL under interview conditions — operations and launch flow, optimized snippets, and algorithm/data-structure patterns. Part VII (18–21): training kernels — the backward pass: torch.autograd.Function wiring, the two-GEMM matmul backward, fused cross-entropy, and the norm-backward dγ/dβ reduction.

tile modeltl.dotonline softmaxFlash Attentionautotunenum_stages

21 lessons ~7 h Open series →

new

AI Compilers — From Graph to Kernel

How torch.compile, XLA, TVM, and TensorRT turn the tensor graph you wrote into a handful of fused, hardware-specialized kernels — derived as one forced chain where each pass answers an inefficiency the last left behind. Part I (00–01): the semantic gap and the eager baseline you must beat. Part II (02–03): the front end — graph capture (tracing vs Dynamo, guards, graph breaks) and the IR (SSA, multi-level lowering). Part III (04–07): the middle end — rewrites/canonicalization, operator fusion (the central win), memory planning, and layout assignment. Part IV (08–10): the back end — scheduling (the loop nest), codegen (LLVM/PTX vs Triton), and autotuning. Part V (11–13): the ML-specific hard parts — dynamic shapes, autodiff as a pass, and distributed compilation. Part VI (14–17): shipping the compiled model — operator coverage & decomposition (the long tail), quantization & precision lowering, the runtime/executor (streams, CUDA graphs, the allocator), and debugging compiled models. Part VII (18–19): the real stacks as one Rosetta stone, then a capstone tracing one block source-to-kernel. The "how the kernels get generated" sibling of the GPU & Triton tracks; a hub that links out for kernel-level depth.

torch.compilefusionIR & loweringschedulingcodegenautotuning

20 lessons ~7.5 h Open series →

vLLM

Serving from first principles: KV cache math, PagedAttention, FlashAttention, continuous batching, prefill/decode splits, GQA/MQA, and Multi-LoRA.

PagedAttentionFlashAttentioncontinuous batchingMulti-LoRA

12 lessons ~5 h Open series →

SGLang

A serving framework whose unit of work is the program, not the single call. RadixAttention turns prefix sharing into a tree-shaped cache; cache-aware scheduling turns that capability into hit rate; compressed-FSM + xgrammar make constrained outputs free; FlashInfer + DP-attention + EP carry the kernel and parallelism load.

RadixAttentioncache-aware schedxgrammarFlashInferDP attentionEAGLE

11 lessons ~3 h Open series →

System ML

Distributed training and inference: collectives, interconnect, DDP/FSDP/TP/PP/SP/EP, 3D parallelism, PyTorch internals, mixed precision, caching allocator, kernel fusion, Triton, torch.compile, CUDA Graphs & TensorRT. CUDA primitives moved to the GPU Kernels track.

FSDPtensor parallelpipeline parallelmixed precisiontorch.compile

26 lessons ~11 h Open series →

new

Kubernetes from First Principles

The substrate, derived — read before the GenAI track. A linear-thinking (线性思维) derivation of Kubernetes itself: start with one process you want to keep alive for the world, and at every step name the exact failure the previous tool leaves behind, then derive the one primitive that fixes it. The one idea — declare desired state, let a control loop reconcile it — then container, pod, node, control plane, controllers & labels, deployments, networking & services, DNS & ingress, config & storage, stateful/batch workloads, the scheduler, health & autoscaling, security & RBAC, and CRDs/operators, closing with a capstone that traces one apply and one request through the whole machine and bridges to the GenAI track.

reconciliationpodscontrollersservicesschedulerRBACoperators

14 lessons ~6 h Open series →

new

Generative AI on Kubernetes

The production platform layer for LLMs on Kubernetes. A book-linear operational track: manual model deployment, model servers and controllers, model artifact delivery, GPU node setup, multi-GPU topology, benchmarking, autoscaling, AI-aware routing, disaggregated serving, LLM observability, guardrails, customization, fine-tuning jobs, batch scheduling, RAG, agents, and a capstone covering GitOps release gates, security, FinOps, and incident response.

KServemodel dataGPU schedulingAI gatewaysobservabilityRAGagentsFinOps

18 lessons ~7 h Open series →

The pipelines that feed post-training. Part I (01–03): the data units across SFT / preference / RL, the ETL/ELT skeleton (medallion bronze/silver/gold, idempotency, determinism), ingestion & provenance. Part II (04–09): the pipeline one stage per lesson — storage formats (JSONL/Parquet/Arrow), distributed transformation (Spark/Ray/Daft and the shuffle), dedup & decontamination (MinHash + LSH), tokenization & packing, quality & validation, orchestration & versioning. Part III (10–11): the RL online dataplane (in-loop rollout→verify→buffer) and the end-to-end cost / throughput / monitoring model. Pairs with the RL data-curation lesson — that one is which data; this series is how to build the pipeline.

ETL / ELTmedallionParquetSpark / RayMinHash + LSHpackingorchestrationRL online dataplane

11 lessons ~4 h Open series →

Search, ads & recommender systems

1 track · 3 paths

Production ranking systems from first principles, as three linearized reading paths over one folder: Search — query understanding, the inverted index & BM25, dense & hybrid retrieval, learning-to-rank, semantic reranking, autocomplete, evaluation, and serving; Recommender systems — the retrieve→rank→rerank cascade, objectives, bias correction, sequences, multimodal, and ops; and Ads — auctions, bidding & pacing, and autobidding. Search and recsys share ~90% of the machine; the paths cross-link rather than repeat it.

Search, Ads & Recommender Systems

One stand-alone series, three linear tracks. Search (40–49): query understanding, inverted index & BM25, dense & hybrid retrieval, learning-to-rank, semantic reranking, autocomplete, relevance evaluation, and search system design. Recommender systems (01–08, 12–39): the retrieve→rank→rerank cascade, objectives, bias correction, sequences, multimodal, evaluation, serving, and responsible constraints. Ads (09–10, 25): auctions, bidding & pacing, autobidding. Start anywhere; the tracks cross-link the shared machine instead of repeating it.

query understandingBM25hybrid retrievallearning-to-rankrankingA/B & relevance evalauctions

49 lessons ~10 h Open series →

通识 · 科学与人文

20 tracks

面向所有人的中文科普与通识。把每一门学科都讲成一条因果推理链，而不是要背的词条与清单——每一课都被上一课没答完的问题逼出来，每课配一个可交互的小实验。一条贯穿的脉络：数学 🔢 → 宇宙 🌌 → 物质 🧪 → 地球 🌍 → 人类与文明 📜 → 心智 🧠 → 思想 💭 → 政治 🏛️ → 艺术（音乐 🎵 · 电影 🎬）与生活（烹饪 🍳 · 旅行 ✈️）。涵盖数学（思想史）、天体物理与宇宙学、化学、地理、历史（人类简史 · 世界文明史 · 人类学）、心理学（含进阶《计算心智》：认知科学 × 机器学习）、哲学、政治、音乐、电影、烹饪与美食、酒与品鉴、民航与旅行酒店，以及侦查与反侦察（一门跨学科的「信息对抗学」：从信号检测、军事欺骗到大规模监控与取证）与犯罪心理学（把「恶」翻译成可研究的行为，沿一桩案子的生命周期走完侦破、审判与矫正）。

new

数学的逻辑 · The Logic of Mathematics

一条从「3 − 5 等于几」走到「数学永远证不完自己」、再从哥德尔与图灵转身一路造到神经网络与扩散模型的中文数学推理线。全程 44 课用线性思维：每一课都被上一课没答完的问题逼出来，引擎是「危机 → 发明 → 新危机」。前半是数学思想史加三块现代地基：数系阶梯逼出 ℤ/ℚ/ℝ/ℂ（00–04）→ 欧氏与非欧几何，公理从真理变成选择（05）→ 极限/导数/积分/ε–δ（06–09）→ 概率与线性代数（10–11）→ 数学回望自身：群、康托尔、罗素、哥德尔（12–15）→ 现代地基：拓扑、测度、图灵（16–18）。后半把这套数学搭成会学习、会生成的机器：表示与几何（内积投影、SVD/PCA、维度灾难，19–24）→ 学习就是优化（梯度、反向传播、凸优化，25–28）→ 概率·统计·信息（分布、多元高斯、MLE/MAP、贝叶斯、信息论、蒙特卡洛，29–34）→ 拼成经典学习器：回归、PCA/GMM、SVM（35–37）→ 从一个神经元到现代 AI：深度网络与万能逼近、泛化、卷积、注意力与 Transformer、VAE/扩散，收官于「从计数到现代人工智能」（38–43）。每课配一个可交互小实验。

一条从「怎么把『无法理解的恶』翻译成『可研究的行为』」走到「一门决定自由与罪责的科学，其边界与伦理在哪」的中文推理线。全程 16 课线性思维：不猎奇、不贩卖「变态天才」，而是沿一桩案子的生命周期，回答五个递进之问——为什么发生？是谁干的？人心为何靠不住？他要负责吗？他会再犯、能不能改？贯穿一条张力：理解（心理学）↔ 追责（法律）；并在每个流行神话前踩刹车（侧写读心术、测谎仪、被压抑记忆、精神病辩护、CSI 效应、天生罪犯）。第一部分（00–03）为什么有人犯罪（定义/根源/精神变态）；第二部分（04–07）是谁干的（行为签名/侧写/讯问/假供）；第三部分（08–10）人心为什么靠不住（记忆/测谎/侦查员偏误）；第四部分（11–12）他要负责吗（责任能力/陪审团）；第五部分（13–14）会再犯吗、能不能改（危险性评估/矫正）；第六部分（15）收官。是《侦查与反侦察》「人力情报 / 欺骗 / 取证」的人心深潜版。每课配一个可交互小实验。

把恶翻译成行为理解↔追责犯罪根源·反决定论精神变态(去污名)手法vs签名侧写=贝叶斯/巴纳姆讯问与假供记忆重构·冤案测谎的神话隧道视野·确认偏误精神病辩护危险性评估·基率矫正与再犯

16 lessons ~6 h Open series →

new

训练的逻辑 · The Logic of Training

一条从「身体是一台适应机器」走到「循证训练」的中文训练科普推理线。16 课全程线性思维：训练 = 给身体一个它当前应付不了的刺激，身体在恢复期超量补偿、长得比原来强一点；一切训练变量（强度 / 容量 / 频率 / 恢复 / 营养）都只是在调「刺激 → 恢复 → 适应」这个循环，而渐进超负荷是主原则。第一部分（00–03）核心循环（超量补偿 / 渐进超负荷 / 特异性）；第二部分（04–07）拧哪些旋钮（强度 vs 容量 / 增肌 / 力量 / 能量系统）；第三部分（08–10）恢复才是长肌肉的地方（睡眠 / 营养 / 过度训练与减载）；第四部分（11–13）把旋钮组织成计划（周期化 / 个体差异 / 循证训练）；第五部分（14–15）按目标配方与收官。诚实循证：戳破「酸痛=增肌」「肌肉混淆」「动作雕刻形状」「补剂神话」等误区。它是「运动 · 竞技的逻辑」系列的前传（运动员如何被造强），也与《烹饪的逻辑》《物质为什么会变》《心理学》同族。每课配一个可交互小实验。

刺激→恢复→适应渐进超负荷增肌与力量能量系统与耐力恢复·营养·减载周期化循证 vs 玄学

16 lessons ~5.5 h Open series →

一条从「刃 × 压力」到公园自由式的中文单板技术推理线（CASI 体系）。14 课全程线性思维：单板只做一件事——用身体的基本动作管理「板刃 × 压力」，把重力沿 fall line 的下坠驯成受控的弧；CASI 的一切（站姿平衡 + 四类基本动作 + 转弯三相）都是这台引擎，而从刻线到公园跳台，每个进阶动作只是它的重新组合。第一部分（00–03）CASI 基本功；第二部分（04–07）进阶转弯（skid→carve、侧切物理、动态换刃、switch）；第三部分（08–13）公园自由式（安全选线 / ollie / jib / 跳台 / 转体 / 抓板收官）。它与《高尔夫的逻辑》同为「你对抗重力/地形/自己」的个人技术运动。每课配一个 canvas 小实验。

刃 × 压力CASI 四类基本动作转弯三相刻线 carve动态转弯 · switch公园自由式

14 lessons ~5 h Open series →

Interview prep

1 track

Condensed, exam-oriented digests. Kernel-interview coding now lives inside its track — CUDA in GPU Kernels for ML Engineers (Part V) and Triton in the Triton series (Part VI). This is the coding-round companion.

Python & Coding Patterns

The coding-round companion: Python essentials and the standard algorithm/data-structure patterns — two pointers, binary search, stacks & queues, trees, graphs, DP, greedy, union-find — each as a reusable template.

two pointersbinary searchtreesgraphsDP

18 lessons ~6 h Open series →