All concepts
Reliability

Dead Letter Queue (DLQ)

Quarantine for poison messages so they don't take down healthy traffic.

Why it matters

A DLQ is a normal SQS queue that holds messages that could not be processed after N receives. It isolates poison-pill messages from the main queue so consumers stay productive.

What it is

A dead-letter queue is an SQS queue that other queues (source queues) can target for messages that are not processed successfully. DLQs are useful for debugging because you can isolate unconsumed messages and determine why processing failed.

  • Keep source queue and DLQ in the same AWS account and Region for optimal performance.
  • FIFO source queues require a FIFO DLQ.

Setup

Create a second queue, then attach it to the source queue using a redrive policy that specifies the DLQ ARN and maxReceiveCount. Optionally configure a redrive allow policy on the DLQ to control which source queues may use it.

SetQueueAttributes({
  QueueUrl: sourceQueueUrl,
  Attributes: {
    RedrivePolicy: JSON.stringify({
      deadLetterTargetArn: dlqArn,
      maxReceiveCount: 5
    })
  }
})

What goes in the DLQ

Messages that have been received more than maxReceiveCount times without being deleted. Once in the DLQ, you can examine logs, analyze message bodies, and verify whether consumers had sufficient processing time.

Retention and age metrics

For standard queues, expiration is based on the original enqueue timestamp — moving to a DLQ does not reset it. Set DLQ retention longer than the source queue so messages aren't deleted before you can investigate.

  • Standard: ApproximateAgeOfOldestMessage in DLQ reflects time since original send.
  • FIFO: enqueue timestamp resets when moved to DLQ — age reflects time in DLQ.
  • Retention range: 60 seconds to 14 days (default 4 days).

Operational practices

Configure CloudWatch alarms on DLQ message count. Use DLQ redrive to move messages back after fixing the bug. Don't share one DLQ across unrelated services — debugging becomes difficult.

  • Examine messages with ReceiveMessage (messages are not auto-consumed from DLQs).
  • Use StartMessageMoveTask API or console 'Start DLQ redrive' to replay.
Gotchas
  • !Alarm on DLQ message count — a silent DLQ filling up is one of the most common production incidents.
  • !Set DLQ retention longer than the source queue (up to 14 days) so you have time to debug.
  • !Don't use the same DLQ for many unrelated queues — debugging becomes a nightmare.
  • !Avoid DLQ with FIFO when exact ordering of all operations must be preserved (e.g. video edit decision lists).
Try the DLQ overflow lab
Apply this concept to a broken system.
open →