Replication I: Leaders, Followers, Lag, and Failover

Copies buy availability, read scale, and lower latency. The price is that copies can disagree, and the gap between them is where user-visible anomalies live.

First principle

Replication is not making data magically safer. It is running a protocol that decides where writes go, how copies catch up, and what readers are allowed to see while they are behind.

client write | v leader --replication log--> follower A | follower B +--> ack policy: before or after followers receive? sync replication: safer, slower async replication: faster, stale reads and possible data loss on failover

Why replicate?

Replication has several jobs. It keeps the system available when a node fails. It lets reads happen near users or on read replicas. It can reduce latency by placing copies geographically close. It can increase read throughput when the leader would otherwise be overloaded.

The abstraction sounds simple: keep copies of the same data. The difficulty is that writes happen over time and networks are unreliable. At any instant, some replicas may have seen a write and others may not. Replication is therefore a question about time: who knows what, when?

Single-leader replication

The common pattern is single leader. All writes go to one leader. The leader appends changes to its log and sends them to followers. Followers apply the log in order. Reads may go to the leader or followers depending on freshness requirements.

Synchronous replication waits for follower acknowledgement before confirming the write. It reduces data loss risk but adds latency and can reduce availability if followers are slow. Asynchronous replication returns after the leader commits locally, then followers catch up later. It is fast and available, but followers lag.

Lag creates anomalies: a user writes a profile update, then reads from a follower and does not see it. A timeline shows comments out of order. A permission change has not propagated, so an old replica allows an action that should be denied. The fix is not always "make everything synchronous"; it is to choose consistency per operation.

Failover is where hidden assumptions surface

When the leader fails, the system must choose a new leader. If replication was asynchronous, the chosen follower may be missing writes that the old leader acknowledged. If the old leader comes back, it may believe it is still leader. If two leaders accept writes, you have split brain.

Failover needs leader election, fencing, and careful client routing. Many systems use consensus or an external coordinator to decide leadership. Others accept small windows of data loss for lower latency. The point is to name the contract: can an acknowledged write be lost? Can reads be stale? How quickly must failover happen?

Trade-offs

Choice	Buys	Costs
Synchronous follower	Lower risk of acknowledged data loss	Higher write latency; follower slowness affects availability
Asynchronous follower	Low write latency and tolerant of slow replicas	Stale reads and data-loss window on failover
Read from followers	Scales reads and reduces geographic latency	Read-your-writes may break
Read from leader	Fresher and simpler semantics	Leader becomes read bottleneck

What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

Without a clear replication contract, product behavior depends on which replica a request randomly hits. Users experience ghosts: writes that disappear, permissions that lag, and timelines that reorder themselves.

Design prompts

How would you guarantee read-your-writes for profile edits while still using read replicas?
What data loss window exists with async replication?
For model registry state, would you allow follower reads? Why?