Reliability, Scalability, and Maintainability
Before storage engines or consensus, define what the system must keep true while load grows, components fail, and humans keep changing the software.
The first design object is not the database. It is the promise: what must remain correct, fast enough, and changeable when hardware fails, traffic shifts, and the code evolves.
Reliability: correctness despite faults
A fault is one component misbehaving: a disk dies, a process pauses, a packet is lost, a deploy ships a bug, an operator clicks the wrong thing. A failure is when the system as a whole stops meeting its promise. Reliability work is the art of preventing faults from turning into failures.
That distinction matters. A reliable data system assumes faults are normal. It keeps durable logs before acknowledging writes, replicates state before a node disappears, uses idempotency so retries do not double-apply work, and exposes enough observability that humans can see partial failure before it becomes data corruption. The goal is not perfection. The goal is bounded blast radius and recoverability.
In ML infrastructure, the same idea appears as dataset lineage and model registry state. A failed transform should not silently produce a half-empty training set. A model promotion should either happen or not happen; it should not leave serving pointing at weights whose feature schema is from yesterday.
Scalability: describe load before solving it
Scalability is meaningless until load is described. Use concrete quantities: requests per second, writes per second, active users, records per day, bytes per record, fan-out per request, model invocations per event, tokens per second, and freshness SLA. The same QPS can be trivial or brutal depending on whether a request touches one row, 500 followees, or a billion-vector ANN index.
Then describe performance in percentiles, not averages. A recommender feed with 80 ms median but 5 second p99 feels broken. A stream processor with good average lag but hour-long tail lag creates stale features for the users who matter most. Tail latency is where fan-out, queues, GC pauses, and overloaded partitions reveal themselves.
Queueing is the first place scalability becomes nonlinear. As utilization approaches 100%, waiting time explodes because each new request has fewer idle gaps to land in. That is why a system can look fine at 50% load and fall apart at 90% without any single request getting more expensive. SLO design therefore names both the percentile target and the operating headroom: p99 under 200 ms at 60% normal utilization is a different promise from p99 under 200 ms while saturated.
Only after load and performance are named should mechanisms appear: caches for repeated reads, indexes for selective reads, partitions for data volume, replicas for read scale and availability, queues for burst absorption, batch processing for bounded recomputation, streams for freshness.
Maintainability: the cost humans pay
Maintainability is not softer than scalability. It is the property that decides whether the system can keep changing without collapsing under its own special cases. Three subproperties matter: operability, simplicity, and evolvability.
Operability means people can understand what the system is doing: dashboards, logs, runbooks, backfills, replay, and clear ownership. Simplicity means the design has fewer states than the team must reason about. A distributed transaction protocol may protect an invariant, but it also creates stuck coordinator states. Evolvability means schemas, APIs, encodings, and derived views can change while old and new code coexist.
The recurring trade-off: a design can buy scale by adding copies and asynchronous workflows, but every copy is another thing that can be stale, broken, or forgotten during migration. Maintainability asks whether the team can operate the design on a bad Tuesday.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Single-node relational database | Simple operations, strong local transactions | Scale and availability eventually hit one-machine limits |
| Cache | Lowers read latency and database load | Creates a second copy that can be stale or invalidated incorrectly |
| Queue / log | Absorbs bursts and decouples producers from consumers | Moves failure to lag, duplicates, replays, and poison messages |
| Microservices with independent stores | Team autonomy and per-service scaling | Cross-service correctness, schema evolution, and observability get harder |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
If you optimize for scalability without reliability, you get fast corruption. If you optimize for reliability without maintainability, you get a system nobody dares change. If you optimize for maintainability without measuring load, you may beautifully operate the wrong bottleneck.
Design prompts
- Describe load for a feature store serving 20k requests/sec. What is the unit of load?
- A cache lowers p50 latency but p99 gets worse. Name two possible causes.
- What is the smallest reliable design for a model registry promotion workflow?