Consensus and Coordination
Consensus is the machinery for making several unreliable machines decide one thing irrevocably. Use it where the invariant truly needs one winner.
Consensus turns a distributed decision into a durable, agreed log entry. Its power is agreement; its cost is coordination on the write path.
What consensus solves
Consensus lets nodes agree on a value even if some crash or messages are delayed. In practice, systems use consensus to maintain a replicated log: each committed entry is a decision, and all nodes apply decisions in the same order.
This solves leader election, metadata changes, distributed locks, membership, configuration, shard ownership, and control-plane state. It is not usually the right tool for every user write or every event. It is the right tool when the system must pick one winner and every participant must respect that choice.
Raft-shaped intuition
Raft makes the core idea approachable. Nodes elect a leader for a term. Clients send proposals to the leader. The leader appends entries to its log and replicates them to followers. Once a majority stores an entry, it is committed. Future leaders must contain committed entries, so committed history is preserved across failover.
The majority is the trick: any two majorities overlap. That overlap carries knowledge from one decision to the next. You do not need every node alive; you need enough nodes that two decisions cannot be made by disjoint groups.
The price is that writes require quorum round trips. If the network partitions, the minority side cannot commit. That is the availability cost of agreement.
Coordination boundaries
The most important design question is not "can consensus solve this?" It usually can. The question is "should this be on the consensus path?" Consensus is perfect for metadata: who owns shard 12, which model version is production, which schema version is active. It is often too expensive for high-volume data-plane events.
A common architecture separates control plane and data plane. Consensus protects small, critical decisions. High-throughput data flows through logs, partitions, and idempotent processors. The control plane decides ownership; the data plane moves bytes.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Consensus metadata | Strong ownership and clear failover | Quorum dependency and operational complexity |
| No coordination | Fast, available, simple data plane | Cannot enforce single-winner invariants |
| Distributed lock | Mutual exclusion for critical sections | Requires fencing to protect resources from old holders |
| 2PC atomic commit | All participants commit or abort together | Blocking and coordinator failure complexity |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
Without consensus or an equivalent coordination mechanism, split brain is not a rare bug; it is the expected outcome during partitions when two sides can both believe they are allowed to decide.
Design prompts
- Which parts of a distributed training scheduler belong in a consensus-backed control plane?
- Why does majority quorum preserve committed log entries across leader changes?
- Why is a lock service insufficient without fencing tokens?