Blog/Designing for Idempotency: The Pattern Every Distributed System Needs
idempotencydistributed-systemsresiliencemessaging

Designing for Idempotency: The Pattern Every Distributed System Needs

January 20, 2024·13 min read·by Bishwambhar Sen
A sequence diagram showing a producer retrying a message three times to a consumer that uses an idempotency key store to deduplicate and execute the operation only once

Concept

An operation is idempotent if applying it multiple times produces the same result as applying it once. In mathematics, f(f(x)) = f(x). In distributed systems, "the payment was charged once even though the charge request was sent three times due to network retries."

Idempotency is a design constraint, not an optimisation. The reason it is mandatory in distributed systems is simple: at-least-once delivery is the default guarantee of most message brokers, HTTP with retries, and event-driven systems. Networks drop responses. Consumers crash after processing but before acknowledging. Load balancers reroute mid-request. In every case, the sender cannot determine whether the operation completed. It must retry — and the system must be safe when it does.

The distinction between idempotent operations and operations made idempotent by the implementation matters:

  • Naturally idempotent: SET account_status = 'suspended' WHERE id = ?. Running this SQL twice produces the same state as running it once.
  • Not naturally idempotent: INSERT INTO transactions VALUES (...). Running this twice inserts two rows. Made idempotent by: adding a unique constraint on an idempotency key and handling the unique violation as a success.
  • Not naturally idempotent: UPDATE balance SET amount = amount - 100 WHERE id = ?. Running this twice debits 200. Made idempotent by: associating the debit with a unique transaction ID and checking whether that transaction ID has already been applied.

The architecture of idempotency in a distributed system consists of three components: the idempotency key (a client-generated unique identifier that travels with the request), the idempotency store (a persistent registry of processed keys and their outcomes), and the retry-safe boundary (the operation that, once checked against the store, is guaranteed to produce the same effect).

Constraints

Idempotency Key Generation and Scope

The idempotency key must be:

  • Client-generated: The server cannot generate it, because the key must exist in the retry request identically to the original. The server generating a key per request means retries always appear as new requests.
  • Unique per logical operation: The key scopes the deduplication. If a customer places two separate orders, each must have a different key. If the same order placement is retried 3 times due to network failure, all 3 retries carry the same key.
  • Bounded in TTL: Idempotency keys should have a retention period (typically 24–72 hours). After that, the same key from a retry is indistinguishable from a genuinely new operation with the same key — which is statistically improbable with UUIDs but operationally significant to define explicitly.

Stripe's API uses idempotency keys passed in the Idempotency-Key HTTP header. AWS SQS and SNS use MessageDeduplicationId for FIFO queues. Azure Service Bus uses MessageId. These are infrastructure-level idempotency keys — they prevent duplicate message delivery from the broker. They do not prevent duplicate processing at the application layer if your consumer crashes after processing but before acknowledging.

Application-Layer vs. Broker-Layer Idempotency

Broker-layer idempotency (SQS FIFO deduplication window, Service Bus message ID deduplication) prevents the broker from delivering the same message twice within its deduplication window (typically 5 minutes for SQS). This does not protect you from:

  • A consumer that processes the message, crashes before acknowledging, and receives the same message again after the broker's visibility timeout expires.
  • A consumer that processes a message successfully but the database write fails mid-transaction, causing a partial state and a retry.
  • A message that arrives outside the broker's deduplication window (after 5 minutes for SQS FIFO).

Application-layer idempotency — where the consumer itself checks a deduplication store before processing — is the only defence against all of these. Broker-layer deduplication is a performance optimisation (reduces unnecessary processing), not a correctness guarantee.

The Deduplication Store Consistency Challenge

The deduplication store check and the actual operation must be atomic from the perspective of the business invariant. The correct pattern is: within a single database transaction, (1) attempt to insert the idempotency key into the deduplication table with a unique constraint, and (2) apply the business operation. If the idempotency key already exists (unique constraint violation), the business operation is not applied, and the previous result (stored alongside the key) is returned. If the insert succeeds, the business operation is applied in the same transaction.

This requires the deduplication store to be in the same database as the business data — or transactionally coordinated with it. A deduplication store in Redis combined with a business operation in PostgreSQL cannot be made atomic without a distributed transaction (2PC), which introduces the availability trade-offs discussed in the Saga pattern post. For most systems, co-locating the deduplication table with the business data in the same relational database is the correct default.

Safe Retry Boundaries

Not all operations can be made idempotent. Side effects that cross your service boundary — sending an email, publishing an event to an external broker, calling a third-party payment API — have their own idempotency requirements that you cannot enforce unilaterally.

For external payment APIs, use the idempotency key in the outbound request (Stripe, Adyen, and most modern payment processors support this). For email sending, track sent emails by idempotency key and skip resending if already sent. For event publishing, use transactional outbox pattern — write the event to a local database table in the same transaction as the business operation, and have a background publisher relay it to the broker.

Trade-offs

Deduplication Store Size and Retention

A deduplication table that retains keys indefinitely grows without bound. For a system processing 1,000 operations per second with 10 bytes per key, that is 86GB per day of raw key data — before indexing overhead. Retention must be bounded. The correct retention period is: the maximum retry window your clients operate within, plus a safety margin. If clients retry for up to 24 hours after a failure, retain keys for 48 hours. Use a TTL index (PostgreSQL's pg_partman with range partitioning on timestamp, or Redis's built-in TTL) to automatically expire old keys.

Performance Impact of the Idempotency Check

Every idempotent operation requires at least one database read (check if key exists) before executing. For high-throughput APIs (10,000 RPS), this adds a database read per request. Mitigations:

  • Index the deduplication table on (idempotency_key, created_at) so the lookup is an index scan, not a table scan.
  • Use an in-memory cache (Redis) as a fast-path check, with the database as the authoritative source. A key present in Redis can be short-circuited without a database read. A key absent from Redis requires a database check (the key may have been processed on another instance or after a cache flush).
  • Shard the deduplication table by key prefix or client ID to distribute write load.

Code

The following C# implements an idempotent command handler that combines a deduplication store check with the business operation in a single database transaction:

public class IdempotentCommandHandler<TCommand, TResult>
    where TCommand : IIdempotentCommand<TResult>
{
    private readonly ICommandHandler<TCommand, TResult> _inner;
    private readonly IIdempotencyStore _idempotencyStore;
    private readonly IDbConnection _db;
    private readonly ILogger<IdempotentCommandHandler<TCommand, TResult>> _logger;

    public IdempotentCommandHandler(
        ICommandHandler<TCommand, TResult> inner,
        IIdempotencyStore idempotencyStore,
        IDbConnection db,
        ILogger<IdempotentCommandHandler<TCommand, TResult>> logger)
    {
        _inner = inner;
        _idempotencyStore = idempotencyStore;
        _db = db;
        _logger = logger;
    }

    public async Task<TResult> HandleAsync(TCommand command, CancellationToken ct)
    {
        var idempotencyKey = command.IdempotencyKey;
        var commandType = typeof(TCommand).Name;

        // Fast-path check: already processed?
        var existingResult = await _idempotencyStore.GetResultAsync<TResult>(
            idempotencyKey, commandType, ct);

        if (existingResult is not null)
        {
            _logger.LogInformation(
                "Idempotent replay for key={Key}, command={Command}. Returning stored result.",
                idempotencyKey, commandType);
            return existingResult.Value;
        }

        // Open a transaction that covers both the business operation and the idempotency record
        using var transaction = await _db.BeginTransactionAsync(ct);

        try
        {
            // Attempt to claim the idempotency key — this will throw on duplicate key violation
            // if a concurrent request is processing the same key
            await _idempotencyStore.ClaimAsync(
                idempotencyKey,
                commandType,
                expiresAt: DateTimeOffset.UtcNow.AddHours(48),
                ct);

            // Execute the actual business operation within the same transaction
            var result = await _inner.HandleAsync(command, ct);

            // Store the result so retries can return it without re-executing
            await _idempotencyStore.RecordResultAsync(idempotencyKey, commandType, result, ct);

            await transaction.CommitAsync(ct);

            _logger.LogInformation(
                "Command {Command} executed successfully for key={Key}",
                commandType, idempotencyKey);

            return result;
        }
        catch (IdempotencyKeyConflictException)
        {
            // Another concurrent request is processing this exact key — wait and return their result
            await transaction.RollbackAsync(ct);

            _logger.LogWarning(
                "Concurrent idempotency key conflict for key={Key}. Waiting for concurrent execution to complete.",
                idempotencyKey);

            // Poll with backoff for the concurrent operation to complete and store its result
            return await WaitForConcurrentResultAsync<TResult>(idempotencyKey, commandType, ct);
        }
        catch (Exception ex)
        {
            await transaction.RollbackAsync(ct);
            // Do NOT record a result on failure — the key claim is rolled back too, allowing retry
            _logger.LogError(ex, "Command {Command} failed for key={Key}", commandType, idempotencyKey);
            throw;
        }
    }

    private async Task<TResult> WaitForConcurrentResultAsync<T>(
        string idempotencyKey,
        string commandType,
        CancellationToken ct)
    {
        var backoff = TimeSpan.FromMilliseconds(50);
        const int maxAttempts = 10;

        for (int attempt = 0; attempt < maxAttempts; attempt++)
        {
            await Task.Delay(backoff, ct);
            backoff = TimeSpan.FromMilliseconds(Math.Min(backoff.TotalMilliseconds * 2, 2000));

            var storedResult = await _idempotencyStore.GetResultAsync<T>(idempotencyKey, commandType, ct);
            if (storedResult is not null)
                return storedResult.Value;
        }

        throw new InvalidOperationException(
            $"Concurrent execution for idempotency key {idempotencyKey} did not complete within the wait window.");
    }
}

The transactional outbox pattern below solves the "at-least-once event publishing" problem — ensuring that events are published to the message broker exactly in tandem with the database state change, with no possibility of publishing without persisting or persisting without publishing:

public class TransactionalOutboxPublisher
{
    private readonly IDbConnection _db;
    private readonly IMessageBroker _broker;
    private readonly ILogger<TransactionalOutboxPublisher> _logger;

    public TransactionalOutboxPublisher(
        IDbConnection db,
        IMessageBroker broker,
        ILogger<TransactionalOutboxPublisher> logger)
    {
        _db = db;
        _broker = broker;
        _logger = logger;
    }

    // Step 1: Write event to outbox table in the SAME transaction as the business operation.
    // This guarantees that either both the business state and the outbox entry are committed,
    // or neither is — no orphaned events, no missing events.
    public async Task EnqueueEventAsync<TEvent>(
        TEvent domainEvent,
        IDbTransaction transaction,
        CancellationToken ct)
        where TEvent : IDomainEvent
    {
        var outboxEntry = new OutboxEntry
        {
            Id = Guid.NewGuid(),
            EventType = typeof(TEvent).FullName!,
            Payload = JsonSerializer.Serialize(domainEvent),
            IdempotencyKey = domainEvent.EventId.ToString(), // event ID is the idempotency key
            CreatedAt = DateTimeOffset.UtcNow,
            Status = OutboxStatus.Pending
        };

        await _db.InsertOutboxEntryAsync(outboxEntry, transaction, ct);
    }

    // Step 2: Background relay process — reads pending outbox entries and publishes to broker.
    // This process runs independently and is retried until the event is acknowledged by the broker.
    public async Task RelayPendingEventsAsync(CancellationToken ct)
    {
        var pendingEntries = await _db.GetPendingOutboxEntriesAsync(batchSize: 50, ct);

        foreach (var entry in pendingEntries)
        {
            try
            {
                await _broker.PublishAsync(
                    eventType: entry.EventType,
                    payload: entry.Payload,
                    messageId: entry.IdempotencyKey, // broker uses this for its own deduplication
                    ct: ct);

                await _db.MarkOutboxEntryRelayedAsync(entry.Id, ct);

                _logger.LogDebug(
                    "Relayed outbox entry {EntryId} (type={Type})", entry.Id, entry.EventType);
            }
            catch (Exception ex)
            {
                _logger.LogWarning(ex,
                    "Failed to relay outbox entry {EntryId}. Will retry on next relay cycle.",
                    entry.Id);

                await _db.IncrementOutboxEntryRetryCountAsync(entry.Id, ct);
            }
        }
    }
}

The outbox pattern guarantees at-least-once delivery from your service to the broker. Combined with consumer-side idempotency (the first pattern above), you achieve exactly-once business effect with at-least-once infrastructure.

Further Reading