Dead Letter Queue (DLQ)

A separate holding area for messages that have failed processing more than N times. Without one, a single malformed message (“poison message”) can freeze an entire pipeline. With one — but without a replay path — your DLQ is a graveyard.

The poison message

One malformed message arrives. The consumer throws. The message goes back to the queue. Gets retried. Throws again. Retried again. And meanwhile, every message behind it waits. In an ordered partition (Kafka), the entire partition is now frozen.

One poison message can take down a pipeline handling millions of good messages per minute. The industry term is — fittingly — a “poison message” or “poison pill.”

The fix (incomplete version)

After a configured number of attempts, the bad message gets moved to a separate holding area — the DLQ. The main pipeline resumes. The poison message is preserved for inspection. Not retried into oblivion, not silently dropped.

Most teams stop here. Tick the DLQ box on the architecture review. Go home.

The fix (complete version)

A DLQ without a replay path is a graveyard.

You set up the DLQ → messages land in it → you fix the bug that caused them → and then what?

If there’s no tooling to push those messages back into the main queue, you have three choices:

Write a one-off script under time pressure.
Replay them by hand.
Admit they’re lost.

Most teams quietly pick option 3. The replay path is the whole point of having a DLQ. Everything else is just setup.

The full DLQ checklist

Cap retries — usually 3-5 attempts before DLQ.
Replay tooling — a CLI / admin UI / cron that can push a single message (or a batch, filtered by reason) back into the main queue.
Idempotent consumers — see idempotency. DLQ replay is a guaranteed source of duplicates.
DLQ depth alerting — page when depth > threshold, not when it crosses zero.
Failure metadata — attach the failure reason, stack trace, retry count, and original ingress timestamp to every DLQ message. Future you will thank present you.
Visibility timeout / hold time — long enough that you have time to debug before the message expires.

Where it lives

Broker	DLQ support
sqs	First-class — configure on the source queue, AWS handles routing.
rabbitmq	Via `x-dead-letter-exchange` argument on the source queue.
kafka	Application-level — your consumer code writes failed messages to a `*.dlq` topic. Less out-of-the-box than the other two.
bullmq	Built-in “failed” set + manual replay APIs.