Payment Retry Strategies and Exponential Backoff — The $34K Lesson in Getting Retries Wrong

Retrying a failed HTTP request is one of the first things you learn as a backend engineer. It's also one of the most dangerous things you can do in a payment system. I learned this the expensive way — $34,000 expensive — when our "simple" retry logic turned a 12-minute provider outage into a week-long cleanup across 200+ merchant accounts.

The root cause wasn't complicated. Our payment service retried failed charge requests every second, up to five times, with no idempotency keys and no backoff. When the provider came back online, it processed every single retry as a new charge. Customers got billed two, three, sometimes four times for the same order.

Payments Are Not GET Requests

Most retry advice assumes you're retrying something safe — a read operation, a status check, a file download. Payments are fundamentally different. When you retry a charge, you're asking a provider to move money. If the original request actually succeeded but you got a timeout, your retry creates a second charge. The customer pays twice. You've now got a support ticket, a potential chargeback, and a trust problem.

This is why every payment retry strategy starts with one non-negotiable requirement: idempotency keys.

Never retry a payment API call without an idempotency key. If your provider supports idempotency (Stripe, Adyen, Braintree all do), generate a unique key per payment intent and attach it to every request, including retries. Without it, every retry is a new charge from the provider's perspective. This is the single most common cause of duplicate charges in production.

Idempotency Keys as the Foundation

An idempotency key tells the payment provider: "If you've already processed a request with this key, return the original result instead of creating a new charge." It's your safety net. Even if your retry logic has bugs, even if you accidentally fire off ten requests, the provider deduplicates them.

We generate ours from a combination of the merchant ID, order ID, and attempt purpose — not the retry count. The key stays the same across all retries for the same payment intent:

func idempotencyKey(merchantID, orderID string) string {
    h := sha256.Sum256([]byte(merchantID + ":" + orderID))
    return hex.EncodeToString(h[:16])
}

The Retry Decision Tree

Not every failure deserves a retry. Retrying a "card declined" response is pointless — it'll be declined again. Retrying a network timeout makes sense. The decision tree we use looks like this:

Payment Failed

→

Retryable?

→

Has Idemp. Key?

→

Under Budget?

→

Backoff + Jitter

→

Retry

Any "No" above

→

Fail Fast / DLQ

Here's how we classify errors in practice:

Retry immediately: Nothing. Never retry immediately.
Retry with backoff: Network timeouts, TCP connection resets, HTTP 502/503/504, provider-specific "try again" codes.
Fail fast: HTTP 400 (bad request), 401/403 (auth), 404, card declined, insufficient funds, invalid card number — any 4xx that indicates a client-side or permanent problem.
Fail and alert: HTTP 500 from the provider that persists. This might indicate a provider-side bug, not a transient issue.

func isRetryable(statusCode int, errCode string) bool {
    // Never retry client errors or permanent declines
    if statusCode >= 400 && statusCode < 500 {
        return false
    }
    // Retry on gateway errors
    if statusCode == 502 || statusCode == 503 || statusCode == 504 {
        return true
    }
    // Retry on network-level failures
    if errCode == "timeout" || errCode == "connection_reset" {
        return true
    }
    return false
}

Exponential Backoff with Jitter

Fixed-interval retries are a trap. If your payment service retries every second and you have 500 concurrent payment requests fail at the same time (say, during a provider blip), you get 500 retries hitting the provider simultaneously one second later. Then again one second after that. This is the thundering herd problem, and it's exactly what turns a minor outage into a major one.

Exponential backoff spaces out retries — 1s, 2s, 4s, 8s, 16s. But pure exponential backoff still synchronizes retries from requests that failed at the same time. Adding jitter (randomness) breaks that synchronization:

Attempt	Fixed 1s	Exponential	Exp + Jitter
1	1.0s	1.0s	0.3–1.0s
2	1.0s	2.0s	0.8–2.0s
3	1.0s	4.0s	1.5–4.0s
4	1.0s	8.0s	3.2–8.0s
5	1.0s	16.0s	6.0–16.0s

func backoffWithJitter(attempt int, base time.Duration) time.Duration {
    max := base * (1 << attempt) // exponential: 1s, 2s, 4s, 8s...
    jitter := time.Duration(rand.Int63n(int64(max)))
    return jitter + (max / 4) // at least 25% of max delay
}

Circuit Breakers: Knowing When to Stop

Retries with backoff handle transient failures. But what about sustained outages? If a provider is down for 30 minutes, you don't want thousands of payment requests queuing up retries. That's where circuit breakers come in.

We use a simple three-state circuit breaker per payment provider. After 10 consecutive failures within a 60-second window, the circuit opens. All new payment attempts to that provider fail immediately with a "provider unavailable" error — no retry, no backoff, no wasted time. Every 30 seconds, we let a single probe request through. If it succeeds, the circuit closes and normal traffic resumes.

During the $34K incident, we had no circuit breaker. Our service kept hammering the provider with retries for the entire 12-minute outage. When the provider recovered, it processed a backlog of ~2,400 retry requests — most of which were duplicates of charges that had actually succeeded before the outage began (the timeouts masked successful charges).

Dead Letter Queues for Exhausted Retries

When a payment exhausts all retry attempts, it can't just disappear. We push it to a dead letter queue (DLQ) with the full context: merchant ID, order ID, amount, idempotency key, every error response received, and timestamps. An on-call engineer reviews the DLQ daily. Some payments get manually retried after a provider confirms the issue is resolved. Others get flagged for merchant notification.

The DLQ also serves as an audit trail. During incident reviews, it tells us exactly how many payments were affected and what error patterns emerged.

Retry Budgets: Fleet-Wide Protection

Individual backoff and circuit breakers protect against per-request and per-provider failures. But we also needed a global safety valve. A retry budget caps the total number of retries across the entire fleet as a percentage of total requests.

We set ours at 10% — if retries exceed 10% of total outgoing payment requests in a rolling 5-minute window, all retries are paused fleet-wide. This prevents a scenario where hundreds of services independently decide to retry, each within their own limits, but collectively overwhelming the provider.

type RetryBudget struct {
    mu       sync.Mutex
    requests int64
    retries  int64
    limit    float64 // e.g., 0.10 for 10%
}

func (rb *RetryBudget) CanRetry() bool {
    rb.mu.Lock()
    defer rb.mu.Unlock()
    if rb.requests == 0 {
        return true
    }
    return float64(rb.retries)/float64(rb.requests) < rb.limit
}

What We Changed After the Incident

The $34K incident forced a complete rewrite of our retry layer. The fix wasn't any single technique — it was layering all of them together. Idempotency keys on every request. Exponential backoff with full jitter. Per-provider circuit breakers. A fleet-wide retry budget. And a DLQ with alerting for anything that falls through.

The refund process took a week. We had to reconcile every duplicate charge across 200+ merchants, issue refunds, and send apology emails. Some merchants had already issued their own refunds, creating a double-refund problem. The total engineering time spent on cleanup was roughly 3x the time it would have taken to build proper retry logic from the start.

If you're building payment integrations, treat retry logic as critical infrastructure — not an afterthought. The cost of getting it wrong isn't a failed request. It's real money moving in the wrong direction.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

Payments Are Not GET Requests

Idempotency Keys as the Foundation

The Retry Decision Tree

Exponential Backoff with Jitter

Circuit Breakers: Knowing When to Stop

Dead Letter Queues for Exhausted Retries

Retry Budgets: Fleet-Wide Protection

What We Changed After the Incident

References

Related Articles