concept
Dead Letter Queue (DLQ) + replay path
Dead Letter Queue (DLQ)
A separate holding area for messages that have failed processing more than N times. Without one, a single malformed message (“poison message”) can freeze an entire pipeline. With one — but without a replay path — your DLQ is a graveyard.
The poison message
One malformed message arrives. The consumer throws. The message goes back to the queue. Gets retried. Throws again. Retried again. And meanwhile, every message behind it waits. In an ordered partition (Kafka), the entire partition is now frozen.
One poison message can take down a pipeline handling millions of good messages per minute. The industry term is — fittingly — a “poison message” or “poison pill.”
The fix (incomplete version)
After a configured number of attempts, the bad message gets moved to a separate holding area — the DLQ. The main pipeline resumes. The poison message is preserved for inspection. Not retried into oblivion, not silently dropped.
Most teams stop here. Tick the DLQ box on the architecture review. Go home.
The fix (complete version)
A DLQ without a replay path is a graveyard.
You set up the DLQ → messages land in it → you fix the bug that caused them → and then what?
If there’s no tooling to push those messages back into the main queue, you have three choices:
- Write a one-off script under time pressure.
- Replay them by hand.
- Admit they’re lost.
Most teams quietly pick option 3. The replay path is the whole point of having a DLQ. Everything else is just setup.
The full DLQ checklist
- Cap retries — usually 3-5 attempts before DLQ.
- Replay tooling — a CLI / admin UI / cron that can push a single message (or a batch, filtered by reason) back into the main queue.
- Idempotent consumers — see idempotency. DLQ replay is a guaranteed source of duplicates.
- DLQ depth alerting — page when depth > threshold, not when it crosses zero.
- Failure metadata — attach the failure reason, stack trace, retry count, and original ingress timestamp to every DLQ message. Future you will thank present you.
- Visibility timeout / hold time — long enough that you have time to debug before the message expires.
Where it lives
| Broker | DLQ support |
|---|---|
| sqs | First-class — configure on the source queue, AWS handles routing. |
| rabbitmq | Via x-dead-letter-exchange argument on the source queue. |
| kafka | Application-level — your consumer code writes failed messages to a *.dlq topic. Less out-of-the-box than the other two. |
| bullmq | Built-in “failed” set + manual replay APIs. |
See also
- idempotency — required for safe replay
- delivery-guarantees — the broader retry picture
- back-pressure — what happens when retries cause overload
- bullmq — concrete replay pattern in Node.js