Partition Mode Design

Brewer’s 2012 design pattern: instead of two states (working / broken), distributed systems should have three explicit states with deliberately-designed behavior for each.

The three states

State	What’s happening	What you design
Normal mode	Everything connected	Optimize for common case: load balancers, replicas, caches, health checks
Partition mode	A node is unreachable, detected deliberately (not via a generic error handler)	Per-operation rules: what to allow, block, queue, bound
Recovery mode	Partition healed, but nodes ran independently — their states have diverged	Merge protocol: reconcile, compensate, notify

Most teams design for normal and broken. Recovery mode in particular gets treated as “back to normal,” which is wrong — there are two histories that need merging, and some of the operations have already been shown to users (you can’t quietly undo them).

How to design partition mode

Walk through your operations one by one. For each, ask: what should happen if this runs without access to the rest of the system?

Operation	Partition-mode decision
Read user profile	Serve from local cache — allow
Register new email (must be globally unique)	Can’t verify — block
ATM withdrawal	Balance could go negative — allow up to a bounded limit
Charge credit card	External, irreversible action — queue for after recovery

The exercise itself is more valuable than the document. It surfaces decisions your team has been making implicitly for years. Doing it on paper is uncomfortable; discovering the answers at 3 AM during an incident is worse.

The ATM as canonical example

When was the last time an ATM refused your withdrawal because it couldn’t reach the bank? Probably never.

That’s not luck — it’s an explicit partition-mode design. ATMs have one core invariant: balance should never go below zero. Strict consistency would refuse the transaction when offline. Instead, ATMs enter stand-in mode:

Allow withdrawals up to a bounded limit (e.g. $200/day offline).
Record every transaction locally.
On reconnect: upload the log, bank reconciles.
If balance went negative during the window → overdraft fee.

The bank doesn’t prevent every mistake. It bounds the maximum mistake, detects the violation on reconciliation, and compensates afterward. The overdraft fee is the answer to “what does your system do when it can’t have all three?” The same shape appears in:

Ride-sharing apps that confirm bookings before driver acceptance.
E-commerce checkouts that accept payment before inventory is confirmed reserved.
Airlines that issue boarding passes before downstream systems are caught up.
Food delivery apps that confirm an order before the restaurant accepts.

All of these are: move forward → bound the damage → reconcile afterward. When you formalize this into a workflow primitive, you get the saga-pattern.

Partition Mode Design (3-state distributed systems)

Partition Mode Design

The three states

How to design partition mode

The ATM as canonical example

See also