concept
Saga Pattern
Saga Pattern
A formal pattern from a 1987 paper by Hector Garcia-Molina and Kenneth Salem for breaking long, multi-service transactions into a sequence of small local transactions, each immediately committed, with compensating actions if any step fails. The structure Brewer summarized as accept · bound · log · compensate.
The problem it solves
Traditional database transactions assume you can hold a lock on everything involved until the whole operation commits or rolls back. One database, one machine — fine. Across multiple services over a network, distributed locks create a cascade:
- Service A holds a lock waiting on B.
- B waits on C.
- C is slow.
- Everything stalls.
Under production traffic, this is a reliability problem before it’s a correctness problem.
How sagas work
Instead of one transaction spanning everything, the saga is a sequence of small local transactions. Each commits immediately. When step N fails, the saga runs a compensation path backward through the completed steps:
[charge card] → [reserve inventory] → [create shipment]
✓ ✓ ✗ fails
↓ ↓
[refund] [release inventory]
The correctness guarantee is not “all or nothing” — it’s that the net effect, including compensations, leaves the system in an acceptable state. A refund is not the same as never having been charged; for the business, it’s equivalent enough.
Examples in the wild
- Every e-commerce checkout is a saga.
- Every loan approval that touches multiple services is a saga.
- Every subscription sign-up that needs to create accounts, charge cards, and provision access is a saga.
- An ATM’s stand-in mode (allow bounded withdrawal offline → reconcile + overdraft fee on reconnect) is a saga.
The pattern predates microservices by decades because the problem predates microservices by decades. We just have better tooling now (Temporal, AWS Step Functions, custom orchestrators).
Implementation flavors
- Choreography — each service emits events; the next service in the chain subscribes and acts. No central coordinator. Compensations are reverse-event chains. Simple at small scale, hard to reason about at scale.
- Orchestration — a saga coordinator (state machine) calls each step explicitly and triggers compensations on failure. Easier to reason about, observe, and replay. Temporal, AWS Step Functions, and BPMN engines are this flavor.
See also
- cap-theorem — sagas answer the “what do we do when a partition forces us off CA” question for cross-service workflows
- partition-mode-design — the saga is how you implement an operation in partition mode for ops that span services
- crdt — the data-side answer; sagas are the workflow-side answer