All concepts
Reliability

Retries & maxReceiveCount

How SQS retries a message until it succeeds — or gives up.

Why it matters

Every time a consumer fails to delete a message before its visibility timeout, the receive count increments. After N receives, the message moves to the DLQ (if configured).

How retries work

SQS does not implement application-level retry logic. Instead, if a consumer does not delete a message before the visibility timeout expires — due to errors, crashes, or slow processing — the message becomes visible again. Each successful ReceiveMessage increments ApproximateReceiveCount.

maxReceiveCount

Configured in the source queue's redrive policy. When ApproximateReceiveCount exceeds maxReceiveCount, SQS moves the message to the configured dead-letter queue on the next receive that does not result in deletion.

  • Set high enough to allow transient failures (typically 3–5).
  • maxReceiveCount of 1 sends messages to the DLQ after a single failed attempt.
{
  "deadLetterTargetArn": "arn:aws:sqs:region:account:my-dlq",
  "maxReceiveCount": 5
}

Standard queue receive-count behavior

For standard queues with maxReceiveCount greater than 3, if a message is received 3 or more times without being deleted, SQS may move it to the back of the queue. The ApproximateAgeOfOldestMessage metric then reflects the age of the next message that hasn't exceeded this threshold.

Backoff strategies

SQS does not provide built-in exponential backoff. To implement backoff, use ChangeMessageVisibility on each retry to delay when the message becomes visible again.

  • Increase visibility timeout after each failure (e.g. 30s → 60s → 120s).
  • Combine with DLQ for messages that exceed maxReceiveCount.

Monitoring

Track ApproximateReceiveCount on individual messages during debugging. Alarm on DLQ depth and on ApproximateAgeOfOldestMessage on the source queue to detect retry storms early.

Gotchas
  • !maxReceiveCount of 1 is almost never what you want — a single network blip will dead-letter healthy messages.
  • !Typical production values are 3–5 for transient failures, higher for long-running jobs.
  • !Moving to DLQ does not fix the root cause — investigate why processing failed before redriving.
Try the DLQ overflow lab
Apply this concept to a broken system.
open →