Orientation
A linear map of data-intensive systems: start with one machine and one truthful copy, then add scale, failure, change, and derived views one constraint at a time.
A data-intensive application is an application whose hardest engineering problem is no longer the CPU instruction; it is the movement, durability, interpretation, and agreement of data across time and machines.
Why this track exists
The site already has tracks for backend system design, post-training data pipelines, and ML systems design. This track sits underneath them. It answers the lower-level question those tracks often assume: what are the data systems actually doing when they store, index, replicate, partition, transact, stream, and recompute?
We will not treat databases, logs, queues, caches, feature stores, search indexes, and vector indexes as unrelated products. We will treat them as different positions in one design space: each one chooses which facts are authoritative, which reads are cheap, which writes are cheap, how much coordination is required, and what happens when machines disagree.
The linearized path
The sequence is intentionally linear. First we define the target properties: reliability, scalability, maintainability. Then we choose how data is shaped and queried. Then we open the box: storage engines, indexes, encoding, and schema evolution. Only after a single node makes sense do we distribute it: replication, partitioning, transactions, failure, consistency, and consensus. Finally we turn one dataset into many: batch, streams, CDC, event logs, and derived state.
The central habit is: name the constraint before naming the tool. If the constraint is write throughput, an LSM tree may follow. If it is range scans, a B-tree or sorted columnar layout may follow. If it is cross-system correctness, idempotency or an outbox may follow. If it is global uniqueness, consensus may follow. Tools are conclusions, not opening moves.
How to read each lesson
Every lesson has the same spine: first principle, mechanism, trade-off, failure mode, and design prompt. The goal is not to memorize DDIA terminology. The goal is to be able to re-derive the design: why the log exists, why replication lags, why a shard key creates hot spots, why serializable isolation is expensive, why event time is not processing time, and why every derived view must be treated as a cache unless you can recompute or verify it.
Examples will keep returning to ML infrastructure because it stresses the same abstractions: feature freshness, vector indexes, RAG ingestion, model registries, training-data lineage, online ranking, and RL rollout logs. But the mechanics are general. A fraud system, a search system, and a recommendation system all live on the same substrate.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| One database | Simple, one source of truth, easy transactions | Limited scale and availability; one tool must serve every access pattern |
| Many specialized systems | Search, cache, analytics, ML features each get the right layout | Data integration, lag, duplicate state, and correctness become the main problem |
| Strong coordination | Clear invariants and simple mental model | Latency, lower availability during partitions, and operational complexity |
| Asynchronous derivation | Fast writes, decoupled systems, recomputation-friendly | Readers may see stale or incomplete derived state |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
If you skip the substrate and jump straight to products, every decision sounds like taste: Postgres vs Cassandra, Kafka vs RabbitMQ, cache vs materialized view. The linear method turns those into consequences of workload, failure model, and correctness requirements.
Design prompts
- For a RAG product, which data is the system of record, and which data is derived?
- For a feature store, which reads can be stale and which writes must be durable before returning?
- Name one invariant that needs coordination and one derived view that can be recomputed asynchronously.