Data-Intensive Systems, From First Principles
A linear, mechanism-first track on the data systems beneath reliable ML products: data models, storage engines, schema evolution, replication, partitioning, transactions, distributed failure, consistency, consensus, batch, streams, and derived data.
This track is an original educational synthesis inspired by Martin Kleppmann's Designing Data-Intensive Applications. It does not reproduce the book's prose or figures; it uses the book's conceptual arc as a launchpad for site-native, first-principles lessons with ML-infrastructure examples.
Start with one truthful copy on one machine. Then add one pressure at a time: bigger data, more readers, more writers, more failures, more versions of code, more regions, and more derived views. Every mechanism in the track is an answer to one of those pressures, and every answer charges a cost somewhere else.
The map
Syllabus
How this differs from the neighboring tracks
| Track | Focus | This track's role |
|---|---|---|
| Distributed Systems Design | Interview patterns and backend architecture moves | Goes deeper on the data-system mechanisms behind those moves |
| Data Engineering for Post-Training | ML training-data pipelines and lakehouse workflows | Explains the storage, schema, batch, stream, and correctness substrate |
| ML Systems Design | Designing model-serving, training, evaluation, and product systems | Provides the database/log/index/derived-state vocabulary those systems depend on |