concept

Partition Mode Design (3-state distributed systems)

created 2026-05-25 distributed-systems · cap · partition · design-pattern · brewer

Partition Mode Design

Brewer’s 2012 design pattern: instead of two states (working / broken), distributed systems should have three explicit states with deliberately-designed behavior for each.

The three states

StateWhat’s happeningWhat you design
Normal modeEverything connectedOptimize for common case: load balancers, replicas, caches, health checks
Partition modeA node is unreachable, detected deliberately (not via a generic error handler)Per-operation rules: what to allow, block, queue, bound
Recovery modePartition healed, but nodes ran independently — their states have divergedMerge protocol: reconcile, compensate, notify

Most teams design for normal and broken. Recovery mode in particular gets treated as “back to normal,” which is wrong — there are two histories that need merging, and some of the operations have already been shown to users (you can’t quietly undo them).

How to design partition mode

Walk through your operations one by one. For each, ask: what should happen if this runs without access to the rest of the system?

OperationPartition-mode decision
Read user profileServe from local cache — allow
Register new email (must be globally unique)Can’t verify — block
ATM withdrawalBalance could go negative — allow up to a bounded limit
Charge credit cardExternal, irreversible action — queue for after recovery

The exercise itself is more valuable than the document. It surfaces decisions your team has been making implicitly for years. Doing it on paper is uncomfortable; discovering the answers at 3 AM during an incident is worse.

The ATM as canonical example

When was the last time an ATM refused your withdrawal because it couldn’t reach the bank? Probably never.

That’s not luck — it’s an explicit partition-mode design. ATMs have one core invariant: balance should never go below zero. Strict consistency would refuse the transaction when offline. Instead, ATMs enter stand-in mode:

  1. Allow withdrawals up to a bounded limit (e.g. $200/day offline).
  2. Record every transaction locally.
  3. On reconnect: upload the log, bank reconciles.
  4. If balance went negative during the window → overdraft fee.

The bank doesn’t prevent every mistake. It bounds the maximum mistake, detects the violation on reconciliation, and compensates afterward. The overdraft fee is the answer to “what does your system do when it can’t have all three?” The same shape appears in:

  • Ride-sharing apps that confirm bookings before driver acceptance.
  • E-commerce checkouts that accept payment before inventory is confirmed reserved.
  • Airlines that issue boarding passes before downstream systems are caught up.
  • Food delivery apps that confirm an order before the restaurant accepts.

All of these are: move forward → bound the damage → reconcile afterward. When you formalize this into a workflow primitive, you get the saga-pattern.

See also

  • cap-theorem — the framework this pattern operationalizes
  • pacelc — the latency dimension that shapes partition-vs-normal decisions
  • saga-pattern — accept-bound-log-compensate as a named primitive
  • crdt — for data where the merge can be made automatic
  • local-first-software — partition mode as the default state