Fixing AWS SQS Message Visibility Timeouts That Cause Duplicate Processing

June 06, 2026 7 min read 46 views
Abstract illustration of a cloud message queue pipeline with envelope icons and retry flow on a soft blue background

Your SQS consumer is processing the same message twice, and now you have duplicate records in your database, duplicate emails in your users' inboxes, or duplicate charges on a payment ledger. The queue looks fine on the surface, but something keeps re-delivering messages your worker already handled. Nine times out of ten, the culprit is a misconfigured visibility timeout.

This article walks you through exactly what visibility timeouts are, why they go wrong, and how to configure your way out of duplicate processing for good.

What you'll learn

  • How SQS visibility timeouts work and why they cause duplicates
  • How to calculate a safe timeout value for your workload
  • How to extend the timeout mid-processing using ChangeMessageVisibility
  • How to use dead-letter queues as a safety net
  • Common configuration mistakes and how to avoid them

How Visibility Timeouts Actually Work

When a consumer calls ReceiveMessage, SQS hides that message from every other consumer for a fixed window of time. That window is the visibility timeout. The assumption is that your consumer will finish processing and call DeleteMessage before the window closes.

If the window closes before you delete the message, SQS assumes your consumer crashed or stalled. It makes the message visible again so another consumer can pick it up. This is the mechanism that causes duplicates: two consumers end up working on the same message, or the same consumer processes it a second time after a restart.

The default visibility timeout is 30 seconds. That is fine for a simple job that runs in milliseconds, but it is dangerously short for anything that calls an external API, writes to a database under load, or does any meaningful computation.

Diagnosing Whether Timeouts Are Your Problem

Before you start tuning, confirm the root cause. Open the AWS Console and navigate to your SQS queue. Check the ApproximateNumberOfMessagesNotVisible metric in CloudWatch. A steadily climbing value means messages are being received but not deleted β€” they are sitting in the invisible state, waiting to time out and re-appear.

You can also check the NumberOfMessagesSent versus NumberOfMessagesDeleted over a time window. If deleted lags significantly behind sent, messages are looping. Cross-reference that with your application logs. If you see the same message ID appearing in your logs more than once, timeouts are almost certainly the issue.

Calculating the Right Timeout Value

The right timeout is the maximum realistic time your consumer needs to process a single message, plus a comfortable buffer. Do not use the average processing time. Use the 99th percentile, or better yet, the worst case you have ever observed in production.

A simple formula: timeout = (P99 processing time) Γ— 1.5. The 1.5 multiplier absorbs GC pauses, slow database queries, and brief network hiccups without being so generous that a genuinely stuck consumer holds up the queue for a long time.

You can set the visibility timeout at the queue level in the Console, via the CLI, or in infrastructure-as-code. Here is how to set it with the AWS CLI:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --attributes VisibilityTimeout=120

That sets a 120-second timeout on the queue. Any ReceiveMessage call that does not pass an explicit override will inherit this value.

Extending the Timeout Mid-Processing

A fixed timeout is a bet that your processing will always finish within that window. For unpredictable workloads, that bet loses eventually. The safer approach is to extend the timeout while your consumer is still working, using the ChangeMessageVisibility API.

The pattern is: receive the message, start a background heartbeat that periodically calls ChangeMessageVisibility, and cancel the heartbeat once you call DeleteMessage. Here is a Python example using boto3:

import boto3
import threading
import time

sqs = boto3.client('sqs', region_name='us-east-1')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-queue'
EXTENSION_SECONDS = 30
HEARTBEAT_INTERVAL = 20  # extend before the current window expires

def extend_visibility(receipt_handle, stop_event):
    while not stop_event.wait(HEARTBEAT_INTERVAL):
        try:
            sqs.change_message_visibility(
                QueueUrl=QUEUE_URL,
                ReceiptHandle=receipt_handle,
                VisibilityTimeout=EXTENSION_SECONDS
            )
        except Exception as e:
            print(f'Heartbeat failed: {e}')
            break

def process_message(message):
    receipt_handle = message['ReceiptHandle']
    stop_event = threading.Event()

    heartbeat = threading.Thread(
        target=extend_visibility,
        args=(receipt_handle, stop_event),
        daemon=True
    )
    heartbeat.start()

    try:
        # Your actual processing logic goes here
        do_work(message['Body'])

        sqs.delete_message(
            QueueUrl=QUEUE_URL,
            ReceiptHandle=receipt_handle
        )
    finally:
        stop_event.set()

def do_work(body):
    # Simulate slow processing
    time.sleep(45)
    print(f'Processed: {body}')

# Receive and process
response = sqs.receive_message(QueueUrl=QUEUE_URL, MaxNumberOfMessages=1)
for msg in response.get('Messages', []):
    process_message(msg)

The key detail here is that HEARTBEAT_INTERVAL must be less than EXTENSION_SECONDS. If your heartbeat fires every 20 seconds and resets the window to 30 seconds, there is always at least 10 seconds of buffer before the message becomes visible again.

Making Your Consumers Idempotent

Visibility timeout tuning reduces duplicates but does not eliminate them entirely. SQS is an at-least-once delivery system by design. Network partitions, consumer crashes at exactly the wrong moment, and AWS-side edge cases mean a message can still be delivered more than once, no matter how carefully you tune.

The production-grade solution is to make your consumer idempotent. Processing the same message twice should produce the same outcome as processing it once. Common approaches include:

  • Deduplication keys: Store the message ID in a fast store like Redis or a database table with a unique constraint. Before processing, check if the ID exists. If it does, skip and delete.
  • Conditional writes: Use database-level constraints (unique indexes, conditional updates) so a duplicate insert fails gracefully rather than creating a duplicate record.
  • Idempotent external calls: Many APIs support idempotency keys in their request headers. Pass the SQS message ID as the idempotency key when calling payment processors or notification services.

Using Dead-Letter Queues as a Safety Net

A dead-letter queue (DLQ) catches messages that fail processing repeatedly. You configure a maxReceiveCount on your source queue's redrive policy: if a message is received more than that many times without being deleted, SQS moves it to the DLQ automatically.

This prevents a poison message from cycling through your queue indefinitely. Set up a CloudWatch alarm on the DLQ's ApproximateNumberOfMessagesVisible metric so you get alerted when anything lands there. Here is how to configure the redrive policy with the CLI:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-queue-dlq\",\"maxReceiveCount\":\"5\"}"
  }'

A maxReceiveCount of 5 means SQS tolerates four failed attempts before moving the message to the DLQ on the fifth receive. Tune this number based on how many transient retries make sense for your workload.

FIFO Queues and Exactly-Once Processing

If your use case absolutely cannot tolerate duplicates even during edge cases, consider switching to an SQS FIFO queue. FIFO queues support content-based deduplication and have a built-in deduplication window of 5 minutes: if you send two messages with the same MessageDeduplicationId within that window, only one is delivered.

FIFO queues come with trade-offs. Throughput is capped, ordering is enforced per message group, and they cost slightly more per API call than standard queues. Use them when message ordering and deduplication are hard requirements, not just as a fix for a misconfigured visibility timeout.

Common Pitfalls

  • Setting the timeout too high: A 12-hour visibility timeout means a crashed consumer holds a message invisible for 12 hours. Other consumers cannot retry it, and your queue depth metric looks misleadingly empty. Keep timeouts reasonable and rely on the heartbeat pattern for genuinely long jobs.
  • Forgetting to account for long-polling wait time: If you use long polling (WaitTimeSeconds), the receive call can block for up to 20 seconds before returning. Your effective processing budget starts after the receive call returns, but the visibility clock starts ticking as soon as the message is received. Factor this in when calculating your timeout.
  • Using the queue-level timeout without overrides: The queue-level timeout is just a default. Individual ReceiveMessage calls can pass a VisibilityTimeout parameter that overrides it. If different consumers process the same queue at different speeds, set the override per call rather than relying on a single queue-level value.
  • Not handling receipt handle expiry in the heartbeat: A receipt handle becomes invalid once the visibility timeout expires and the message becomes visible to other consumers. At that point, ChangeMessageVisibility will throw an error. Catch that error in your heartbeat thread so it does not silently swallow exceptions.
  • Skipping the DLQ on development queues: Development queues feel low-stakes until a buggy deploy sends a malformed message into an infinite retry loop. Configure a DLQ everywhere.

Wrapping Up

Duplicate processing from SQS is a solved problem once you understand what is actually happening under the hood. Start here:

  1. Check CloudWatch for ApproximateNumberOfMessagesNotVisible climbing unexpectedly and confirm duplicate message IDs in your logs.
  2. Set a queue-level visibility timeout based on your P99 processing time multiplied by 1.5.
  3. Implement a heartbeat using ChangeMessageVisibility for any consumer whose processing time is variable or can exceed a few seconds.
  4. Add idempotency to your consumer logic using deduplication keys or conditional database writes, so accidental duplicates are harmless.
  5. Configure a dead-letter queue with a redrive policy and set a CloudWatch alarm on it so you catch poison messages before they become a crisis.

πŸ“€ Share this article

Sign in to save

Comments (0)

No comments yet. Be the first!

Leave a Comment

Sign in to comment with your profile.

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.