all_lessons / data_intensive_systems 17 lessons · ~7h read

Data-Intensive Systems, From First Principles

A linear, mechanism-first track on the data systems beneath reliable ML products: data models, storage engines, schema evolution, replication, partitioning, transactions, distributed failure, consistency, consensus, batch, streams, and derived data.

Source note

This track is an original educational synthesis inspired by Martin Kleppmann's Designing Data-Intensive Applications. It does not reproduce the book's prose or figures; it uses the book's conceptual arc as a launchpad for site-native, first-principles lessons with ML-infrastructure examples.

The linearized idea

Start with one truthful copy on one machine. Then add one pressure at a time: bigger data, more readers, more writers, more failures, more versions of code, more regions, and more derived views. Every mechanism in the track is an answer to one of those pressures, and every answer charges a cost somewhere else.

The map

single truthful copy -> data model and query language -> physical layout and schema evolution -> replication and partitioning -> transactions and partial failure -> consistency and consensus -> batch, streams, and derived data

Syllabus

00
Orientation
A linear map of data-intensive systems: start with one machine and one truthful copy, then add scale, failure, change, and derived views one constraint at a time.
01
Reliability, Scalability, and Maintainability
Before storage engines or consensus, define what the system must keep true while load grows, components fail, and humans keep changing the software.
02
Data Models: Relational, Document, and Graph
A data model is the shape of thought a system makes cheap: tables for joins and constraints, documents for aggregate reads, graphs for relationships that keep moving.
03
Query Languages and Access-Pattern Design
Declarative languages say what you want and let the engine choose a plan; access-pattern design says the plan is part of the contract.
04
Encoding, Schemas, and Evolution
Data outlives code. Encoding is how objects become bytes; schema evolution is how old and new code keep understanding those bytes during change.
05
Storage Engines I: Logs, Hash Indexes, and B-Trees
A database write becomes durable by entering a log; a read becomes fast by using an index. The first storage fork is whether you overwrite pages or append history.
06
Storage Engines II: LSM Trees and Columnar Storage
LSM trees make writes sequential and push cleanup into compaction; column stores make analytical reads cheap by arranging bytes by question, not by object.
07
Replication I: Leaders, Followers, Lag, and Failover
Copies buy availability, read scale, and lower latency. The price is that copies can disagree, and the gap between them is where user-visible anomalies live.
08
Replication II: Multi-Leader, Leaderless, Quorums, and Conflicts
Once more than one node can accept writes, availability rises and conflict resolution becomes part of the application contract.
09
Partitioning: Hash, Range, Hot Keys, and Rebalancing
Partitioning splits data and load across machines. The shard key is a performance contract: it decides locality, balance, and which queries become scatter-gather.
10
Transactions and Isolation Anomalies
Transactions let the application pretend certain concurrency and crash failures did not happen. Isolation level decides how convincing that illusion is.
11
Partial Failures, Timeouts, Clocks, and Fencing
Distributed systems are not just slow local systems. Some parts can fail, pause, lie by omission, or come back from the dead after everyone replaced them.
12
Consistency, Causality, Ordering, and Linearizability
Consistency models define what reads are allowed to see. The strongest models simplify reasoning by making distributed state look single-copy, but they charge latency and availability.
13
Consensus and Coordination
Consensus is the machinery for making several unreliable machines decide one thing irrevocably. Use it where the invariant truly needs one winner.
14
Batch Processing, Joins, Materialization, and Recompute
Batch systems turn bounded input into derived output. Their strength is deterministic recomputation; their cost is scan, shuffle, materialization, and time.
15
Stream Processing: Logs, CDC, Event Time, Windows, and Fault Tolerance
Streams are batch processing without an end-of-file. The hard parts are ordering, time, replay, duplicates, state, and deciding when a window is complete enough.
16
Derived Data Capstone: Caches, Indexes, Features, RAG, and Correctness
Modern applications compose specialized systems. The design problem is not choosing one perfect database; it is keeping derived state useful, explainable, and repairable.

How this differs from the neighboring tracks

TrackFocusThis track's role
Distributed Systems DesignInterview patterns and backend architecture movesGoes deeper on the data-system mechanisms behind those moves
Data Engineering for Post-TrainingML training-data pipelines and lakehouse workflowsExplains the storage, schema, batch, stream, and correctness substrate
ML Systems DesignDesigning model-serving, training, evaluation, and product systemsProvides the database/log/index/derived-state vocabulary those systems depend on