all_lessons / data_intensive_systems / 13 · consensus lesson 13 / 16 · ~12 min

Consensus and Coordination

Consensus is the machinery for making several unreliable machines decide one thing irrevocably. Use it where the invariant truly needs one winner.

First principle

Consensus turns a distributed decision into a durable, agreed log entry. Its power is agreement; its cost is coordination on the write path.

client proposal | v leader proposes log entry | +--> majority accepts | v entry committed; all correct nodes eventually learn same decision

What consensus solves

Consensus lets nodes agree on a value even if some crash or messages are delayed. In practice, systems use consensus to maintain a replicated log: each committed entry is a decision, and all nodes apply decisions in the same order.

This solves leader election, metadata changes, distributed locks, membership, configuration, shard ownership, and control-plane state. It is not usually the right tool for every user write or every event. It is the right tool when the system must pick one winner and every participant must respect that choice.

Raft-shaped intuition

Raft makes the core idea approachable. Nodes elect a leader for a term. Clients send proposals to the leader. The leader appends entries to its log and replicates them to followers. Once a majority stores an entry, it is committed. Future leaders must contain committed entries, so committed history is preserved across failover.

The majority is the trick: any two majorities overlap. That overlap carries knowledge from one decision to the next. You do not need every node alive; you need enough nodes that two decisions cannot be made by disjoint groups.

The price is that writes require quorum round trips. If the network partitions, the minority side cannot commit. That is the availability cost of agreement.

Coordination boundaries

The most important design question is not "can consensus solve this?" It usually can. The question is "should this be on the consensus path?" Consensus is perfect for metadata: who owns shard 12, which model version is production, which schema version is active. It is often too expensive for high-volume data-plane events.

A common architecture separates control plane and data plane. Consensus protects small, critical decisions. High-throughput data flows through logs, partitions, and idempotent processors. The control plane decides ownership; the data plane moves bytes.

Trade-offs

ChoiceBuysCosts
Consensus metadataStrong ownership and clear failoverQuorum dependency and operational complexity
No coordinationFast, available, simple data planeCannot enforce single-winner invariants
Distributed lockMutual exclusion for critical sectionsRequires fencing to protect resources from old holders
2PC atomic commitAll participants commit or abort togetherBlocking and coordinator failure complexity
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

Without consensus or an equivalent coordination mechanism, split brain is not a rare bug; it is the expected outcome during partitions when two sides can both believe they are allowed to decide.

Design prompts

  1. Which parts of a distributed training scheduler belong in a consensus-backed control plane?
  2. Why does majority quorum preserve committed log entries across leader changes?
  3. Why is a lock service insufficient without fencing tokens?