Designing idempotent event-driven systems, The Cloud Ledger

The bug report read: "Customer charged three times for one order." The order service had published one event. SQS had, as it is allowed to, delivered it more than once, and the payment consumer happily processed each copy. Nothing was broken according to the AWS docs. The system was broken according to the customer.

At-least-once delivery is the default across most of the AWS messaging stack, which means duplicates are not an edge case to handle later. They are a guarantee you design around from the start. Idempotency is how you do that.

Where duplicates come from

It helps to know your enemy. Duplicates appear at several layers:

SQS standard queues deliver at least once by design; a consumer that crashes before deleting a message will see it again after the visibility timeout.
SNS can deliver the same notification more than once.
Lambda retries on error, and asynchronous invocations retry twice by default.
EventBridge retries delivery with backoff for up to 24 hours.

FIFO queues and SNS FIFO offer exactly-once processing within a 5-minute deduplication window, but that window is short and FIFO throughput is capped. For most systems the durable answer is consumer-side idempotency.

The idempotency key pattern

The core idea: every message carries a stable identifier, and the consumer records "I have processed this" atomically with doing the work. The natural home for that record is a DynamoDB table with a conditional write.

The key must come from the business event, not be generated by the consumer. A retry has to produce the same key, so use the order ID or a producer-set idempotency token, never a timestamp or a UUID minted at receive time.

import boto3
from botocore.exceptions import ClientError

ddb = boto3.client("dynamodb")

def already_processed(idempotency_key: str) -> bool:
    try:
        ddb.put_item(
            TableName="processed_events",
            Item={
                "event_id": {"S": idempotency_key},
                "ttl": {"N": str(int(time.time()) + 7 * 86400)},
            },
            ConditionExpression="attribute_not_exists(event_id)",
        )
        return False  # we won the race; safe to process
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return True  # someone already processed this
        raise

The conditional put_item is the whole trick. It succeeds only if the key is new, and it is atomic, so two concurrent deliveries cannot both win. The TTL keeps the table from growing forever.

Ordering the write and the side effect

Recording the key is not enough if you crash between the side effect and the record. Two safe orderings:

Claim first. Write the key, then do the work. If you crash after the work but before acking the message, the redelivery sees the key and skips, but only if the work itself was also idempotent (e.g., an upsert).
Transactional outbox. Make the state change and the dedup record one DynamoDB transaction via TransactWriteItems, so they commit together or not at all.

For payments, where the side effect is an external API call, I add an idempotency token on that call too. Stripe and most payment APIs accept one, so even a duplicated downstream request collapses to a single charge.

Don't forget the poison messages

Idempotency protects against duplicates of valid messages. It does nothing for a message that always fails. Without a dead-letter queue, that message redelivers forever and blocks the queue. Configure a DLQ with a sane maxReceiveCount:

resource "aws_sqs_queue" "orders" {
  name                       = "orders"
  visibility_timeout_seconds = 60
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "orders-dlq"
  message_retention_seconds = 1209600  # 14 days
}

Takeaways

Treat at-least-once delivery as the default and make every consumer idempotent rather than hoping for exactly-once.
Derive the idempotency key from the business event so retries reproduce it; never mint it at receive time.
Use a DynamoDB conditional write (or TransactWriteItems) so claiming the key and doing the work commit atomically.
Pair idempotency with a dead-letter queue so poison messages exit the retry loop instead of blocking it.

Designing idempotent event-driven systems

Where duplicates come from

The idempotency key pattern

Ordering the write and the side effect

Don't forget the poison messages

Takeaways

More on AWS