all_lessons / data_intensive_systems / 11 · partial failures lesson 11 / 16 · ~12 min

Partial Failures, Timeouts, Clocks, and Fencing

Distributed systems are not just slow local systems. Some parts can fail, pause, lie by omission, or come back from the dead after everyone replaced them.

First principle

The defining feature of distributed systems is partial failure: one component can be broken or unreachable while the rest continues making decisions.

client -> node A -> node B timeout means: no response before deadline timeout does not mean: request did not run timeout does not mean: node is dead timeout does not mean: retry is safe

Timeouts are guesses

When a node does not respond, you do not know whether the request was lost, the response was lost, the node is slow, the node crashed, or the network is partitioned. A timeout converts uncertainty into a decision. It is necessary, but it is not truth.

Short timeouts detect problems quickly but create false suspicions during pauses and tail latency spikes. Long timeouts avoid false positives but slow failover and hold resources. Retry too aggressively and you amplify load exactly when the system is struggling. Retry without idempotency and you duplicate work.

Design every remote operation as maybe executed, maybe not, maybe executed twice. That one sentence explains idempotency keys, deduplication, request ids, leases, and fencing tokens.

Clocks are useful, dangerous, and different

Time-of-day clocks can jump forward or backward when synchronized. Monotonic clocks only move forward and are suitable for measuring durations. Logical clocks capture ordering without pretending to know wall time. Mixing these up creates subtle bugs.

Clock-based ordering is especially dangerous for conflict resolution. Last-write-wins depends on comparable timestamps, but real clocks have skew. A write from a machine with a bad clock can erase newer intent. Use physical time for observability and expiration with margins; use logical ordering or coordination when correctness depends on order.

In stream processing, event time and processing time differ. In leases, clock drift decides whether two owners believe they hold the lease. In model monitoring, timestamp bugs can make fresh features look stale or stale features look fresh.

Process pauses and fencing

A process can pause for garbage collection, overload, page faults, scheduler delays, or stop-the-world runtime behavior. Other nodes may declare it dead and elect a replacement. Then the paused process resumes, still believing it owns the resource.

Fencing solves this by giving each lease or leadership term a monotonically increasing token. Downstream systems reject operations with old tokens. The old leader can wake up and send writes, but storage refuses them because a newer owner has fenced it off.

This is the pattern behind safe distributed locks, model deployment ownership, shard leaders, and workflow executors. The lock service alone is not enough; the resource being protected must check the token.

Trade-offs

ChoiceBuysCosts
Short timeoutFast detection and failoverFalse positives and retry storms
Long timeoutFewer false suspicionsSlow recovery and resource buildup
Wall-clock orderingSimple and observableClock skew can corrupt causality
Fencing tokensProtects against zombie leadersRequires downstream enforcement
What you can now decide

You should be able to name the contract this mechanism offers, the workload or invariant that justifies it, and the bill it sends somewhere else: read latency, write latency, storage, availability, freshness, or operational complexity.

What breaks if you skip this?

If timeout equals truth in your mental model, retries duplicate side effects, old leaders keep writing, and clock skew becomes a data-loss mechanism.

Design prompts

  1. A request timed out. What are the possible states of the server-side operation?
  2. How would you protect a model registry from an old deployment worker resuming after pause?
  3. When is a monotonic clock preferable to a wall clock?