Data-Intensive Systems, From First Principles

A linear, mechanism-first track on the data systems beneath reliable ML products: data models, storage engines, schema evolution, replication, partitioning, transactions, distributed failure, consistency, consensus, batch, streams, and derived data.

Source note

This track is an original educational synthesis inspired by Martin Kleppmann's Designing Data-Intensive Applications. It does not reproduce the book's prose or figures; it uses the book's conceptual arc as a launchpad for site-native, first-principles lessons with ML-infrastructure examples.

The linearized idea

Start with one truthful copy on one machine. Then add one pressure at a time: bigger data, more readers, more writers, more failures, more versions of code, more regions, and more derived views. Every mechanism in the track is an answer to one of those pressures, and every answer charges a cost somewhere else.

The map

single truthful copy -> data model and query language -> physical layout and schema evolution -> replication and partitioning -> transactions and partial failure -> consistency and consensus -> batch, streams, and derived data

Syllabus

Orientation

A linear map of data-intensive systems: start with one machine and one truthful copy, then add scale, failure, change, and derived views one constraint at a time.

Reliability, Scalability, and Maintainability

Before storage engines or consensus, define what the system must keep true while load grows, components fail, and humans keep changing the software.

Data Models: Relational, Document, and Graph

A data model is the shape of thought a system makes cheap: tables for joins and constraints, documents for aggregate reads, graphs for relationships that keep moving.

Query Languages and Access-Pattern Design

Declarative languages say what you want and let the engine choose a plan; access-pattern design says the plan is part of the contract.

Encoding, Schemas, and Evolution

Data outlives code. Encoding is how objects become bytes; schema evolution is how old and new code keep understanding those bytes during change.

Storage Engines I: Logs, Hash Indexes, and B-Trees

A database write becomes durable by entering a log; a read becomes fast by using an index. The first storage fork is whether you overwrite pages or append history.

Storage Engines II: LSM Trees and Columnar Storage

LSM trees make writes sequential and push cleanup into compaction; column stores make analytical reads cheap by arranging bytes by question, not by object.

Replication I: Leaders, Followers, Lag, and Failover

Copies buy availability, read scale, and lower latency. The price is that copies can disagree, and the gap between them is where user-visible anomalies live.

Replication II: Multi-Leader, Leaderless, Quorums, and Conflicts

Once more than one node can accept writes, availability rises and conflict resolution becomes part of the application contract.

Partitioning: Hash, Range, Hot Keys, and Rebalancing

Partitioning splits data and load across machines. The shard key is a performance contract: it decides locality, balance, and which queries become scatter-gather.

Transactions and Isolation Anomalies

Transactions let the application pretend certain concurrency and crash failures did not happen. Isolation level decides how convincing that illusion is.

Partial Failures, Timeouts, Clocks, and Fencing

Distributed systems are not just slow local systems. Some parts can fail, pause, lie by omission, or come back from the dead after everyone replaced them.

Consistency, Causality, Ordering, and Linearizability

Consistency models define what reads are allowed to see. The strongest models simplify reasoning by making distributed state look single-copy, but they charge latency and availability.

Consensus and Coordination

Consensus is the machinery for making several unreliable machines decide one thing irrevocably. Use it where the invariant truly needs one winner.

Batch Processing, Joins, Materialization, and Recompute

Batch systems turn bounded input into derived output. Their strength is deterministic recomputation; their cost is scan, shuffle, materialization, and time.

Stream Processing: Logs, CDC, Event Time, Windows, and Fault Tolerance

Streams are batch processing without an end-of-file. The hard parts are ordering, time, replay, duplicates, state, and deciding when a window is complete enough.

Derived Data Capstone: Caches, Indexes, Features, RAG, and Correctness

Modern applications compose specialized systems. The design problem is not choosing one perfect database; it is keeping derived state useful, explainable, and repairable.

How this differs from the neighboring tracks

Track	Focus	This track's role
Distributed Systems Design	Interview patterns and backend architecture moves	Goes deeper on the data-system mechanisms behind those moves
Data Engineering for Post-Training	ML training-data pipelines and lakehouse workflows	Explains the storage, schema, batch, stream, and correctness substrate
ML Systems Design	Designing model-serving, training, evaluation, and product systems	Provides the database/log/index/derived-state vocabulary those systems depend on