Partial Failures, Timeouts, Clocks, and Fencing
Distributed systems are not just slow local systems. Some parts can fail, pause, lie by omission, or come back from the dead after everyone replaced them.
The defining feature of distributed systems is partial failure: one component can be broken or unreachable while the rest continues making decisions.
Timeouts are guesses
When a node does not respond, you do not know whether the request was lost, the response was lost, the node is slow, the node crashed, or the network is partitioned. A timeout converts uncertainty into a decision. It is necessary, but it is not truth.
Short timeouts detect problems quickly but create false suspicions during pauses and tail latency spikes. Long timeouts avoid false positives but slow failover and hold resources. Retry too aggressively and you amplify load exactly when the system is struggling. Retry without idempotency and you duplicate work.
Design every remote operation as maybe executed, maybe not, maybe executed twice. That one sentence explains idempotency keys, deduplication, request ids, leases, and fencing tokens.
Clocks are useful, dangerous, and different
Time-of-day clocks can jump forward or backward when synchronized. Monotonic clocks only move forward and are suitable for measuring durations. Logical clocks capture ordering without pretending to know wall time. Mixing these up creates subtle bugs.
Clock-based ordering is especially dangerous for conflict resolution. Last-write-wins depends on comparable timestamps, but real clocks have skew. A write from a machine with a bad clock can erase newer intent. Use physical time for observability and expiration with margins; use logical ordering or coordination when correctness depends on order.
In stream processing, event time and processing time differ. In leases, clock drift decides whether two owners believe they hold the lease. In model monitoring, timestamp bugs can make fresh features look stale or stale features look fresh.
Process pauses and fencing
A process can pause for garbage collection, overload, page faults, scheduler delays, or stop-the-world runtime behavior. Other nodes may declare it dead and elect a replacement. Then the paused process resumes, still believing it owns the resource.
Fencing solves this by giving each lease or leadership term a monotonically increasing token. Downstream systems reject operations with old tokens. The old leader can wake up and send writes, but storage refuses them because a newer owner has fenced it off.
This is the pattern behind safe distributed locks, model deployment ownership, shard leaders, and workflow executors. The lock service alone is not enough; the resource being protected must check the token.
Trade-offs
| Choice | Buys | Costs |
|---|---|---|
| Short timeout | Fast detection and failover | False positives and retry storms |
| Long timeout | Fewer false suspicions | Slow recovery and resource buildup |
| Wall-clock ordering | Simple and observable | Clock skew can corrupt causality |
| Fencing tokens | Protects against zombie leaders | Requires downstream enforcement |
You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.
If timeout equals truth in your mental model, retries duplicate side effects, old leaders keep writing, and clock skew becomes a data-loss mechanism.
Design prompts
- A request timed out. What are the possible states of the server-side operation?
- How would you protect a model registry from an old deployment worker resuming after pause?
- When is a monotonic clock preferable to a wall clock?