Orientation

A linear map of data-intensive systems: start with one machine and one truthful copy, then add scale, failure, change, and derived views one constraint at a time.

First principle

A data-intensive application is an application whose hardest engineering problem is no longer the CPU instruction; it is the movement, durability, interpretation, and agreement of data across time and machines.

USER ACTION | v SYSTEM OF RECORD --change log--> DERIVED VIEWS | | | | | | | +--> ML features / embeddings | | +------> search index / cache | +----------> analytics table v contract: the source of truth must survive failure, evolve safely, and explain every copy.

Why this track exists

The site already has tracks for backend system design, post-training data pipelines, and ML systems design. This track sits underneath them. It answers the lower-level question those tracks often assume: what are the data systems actually doing when they store, index, replicate, partition, transact, stream, and recompute?

We will not treat databases, logs, queues, caches, feature stores, search indexes, and vector indexes as unrelated products. We will treat them as different positions in one design space: each one chooses which facts are authoritative, which reads are cheap, which writes are cheap, how much coordination is required, and what happens when machines disagree.

The linearized path

The sequence is intentionally linear. First we define the target properties: reliability, scalability, maintainability. Then we choose how data is shaped and queried. Then we open the box: storage engines, indexes, encoding, and schema evolution. Only after a single node makes sense do we distribute it: replication, partitioning, transactions, failure, consistency, and consensus. Finally we turn one dataset into many: batch, streams, CDC, event logs, and derived state.

The central habit is: name the constraint before naming the tool. If the constraint is write throughput, an LSM tree may follow. If it is range scans, a B-tree or sorted columnar layout may follow. If it is cross-system correctness, idempotency or an outbox may follow. If it is global uniqueness, consensus may follow. Tools are conclusions, not opening moves.

How to read each lesson

Every lesson has the same spine: first principle, mechanism, trade-off, failure mode, and design prompt. The goal is not to memorize DDIA terminology. The goal is to be able to re-derive the design: why the log exists, why replication lags, why a shard key creates hot spots, why serializable isolation is expensive, why event time is not processing time, and why every derived view must be treated as a cache unless you can recompute or verify it.

Examples will keep returning to ML infrastructure because it stresses the same abstractions: feature freshness, vector indexes, RAG ingestion, model registries, training-data lineage, online ranking, and RL rollout logs. But the mechanics are general. A fraud system, a search system, and a recommendation system all live on the same substrate.

Trade-offs

Choice	Buys	Costs
One database	Simple, one source of truth, easy transactions	Limited scale and availability; one tool must serve every access pattern
Many specialized systems	Search, cache, analytics, ML features each get the right layout	Data integration, lag, duplicate state, and correctness become the main problem
Strong coordination	Clear invariants and simple mental model	Latency, lower availability during partitions, and operational complexity
Asynchronous derivation	Fast writes, decoupled systems, recomputation-friendly	Readers may see stale or incomplete derived state

What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

If you skip the substrate and jump straight to products, every decision sounds like taste: Postgres vs Cassandra, Kafka vs RabbitMQ, cache vs materialized view. The linear method turns those into consequences of workload, failure model, and correctness requirements.

Design prompts

For a RAG product, which data is the system of record, and which data is derived?
For a feature store, which reads can be stale and which writes must be durable before returning?
Name one invariant that needs coordination and one derived view that can be recomputed asynchronously.