April 5, 2026 8 min read

Webhook Reliability Patterns: How to Never Miss a Payment Notification

After integrating 14+ payment gateways at StraitsX, I've seen every way a webhook can fail. Here's the playbook I wish I had on day one — the patterns that took us from "why did we miss that settlement?" to a 99.97% delivery rate.

Why Webhooks Fail (And They Will)

Let's get this out of the way: webhooks are unreliable by design. They're HTTP requests fired across the open internet. Things go wrong. In my experience, the failure modes fall into a few predictable buckets:

None of these are exotic. They happen weekly at scale. The question isn't if your webhook pipeline will drop events — it's whether your system recovers gracefully when it does.

99.97%
Delivery rate target
≤ 5s
Max response time
72 hr
Retry window

Receive Fast, Process Later

This is the single most important pattern. Your webhook endpoint should do three things: validate the signature, persist the raw payload, and return 200 OK. That's it. All the heavy lifting — updating balances, sending emails, reconciling ledgers — happens asynchronously.

The reason is simple: PSPs have short timeout windows (typically 5–30 seconds). If your endpoint is busy talking to a database, calling a third-party API, or running business logic, you're gambling with that window. One slow query and the PSP marks the delivery as failed.

Here's what our Go handler looks like in practice:

func handleWebhook(w http.ResponseWriter, r *http.Request) {
    body, err := io.ReadAll(r.Body)
    if err != nil {
        w.WriteHeader(http.StatusBadRequest)
        return
    }

    // Verify signature BEFORE anything else
    sig := r.Header.Get("X-Signature-256")
    if !verifyHMAC(body, sig, webhookSecret) {
        w.WriteHeader(http.StatusUnauthorized)
        return
    }

    // Persist raw payload, return immediately
    if err := enqueueWebhook(r.Context(), body); err != nil {
        log.Error("failed to enqueue webhook", "err", err)
        w.WriteHeader(http.StatusInternalServerError)
        return
    }

    w.WriteHeader(http.StatusOK)
}

The entire handler runs in under 50ms. The worker picks up the event from the queue and does the real work. If the worker crashes, the message stays in the queue. Nothing is lost.

Signature Verification: Don't Skip This

Every reputable PSP signs their webhooks with HMAC-SHA256. The flow is straightforward: they hash the request body with a shared secret, attach the hash as a header, and you recompute it on your end. If the hashes match, the payload is authentic and untampered.

func verifyHMAC(payload []byte, signature, secret string) bool {
    mac := hmac.New(sha256.New, []byte(secret))
    mac.Write(payload)
    expected := hex.EncodeToString(mac.Sum(nil))

    // Constant-time comparison prevents timing attacks
    return hmac.Equal([]byte(expected), []byte(signature))
}

Why constant-time comparison matters: A naive string comparison (==) leaks timing information. An attacker can brute-force the signature byte-by-byte by measuring response times. hmac.Equal runs in constant time regardless of where the strings differ.

I've seen teams skip verification in staging and forget to enable it in production. Don't. Treat an unverified webhook like an unauthenticated API call — reject it at the door.

Idempotency: The Non-Negotiable

PSPs will send you the same webhook more than once. Stripe says so in their docs. So does Adyen. So does every gateway I've integrated. Your system needs to handle duplicates without double-crediting an account or sending two confirmation emails.

The pattern is simple: every webhook carries a unique event ID. Before processing, check if you've seen it before.

func processWebhook(ctx context.Context, event WebhookEvent) error {
    // Atomic check-and-insert using unique constraint
    inserted, err := db.InsertIfNotExists(ctx,
        "processed_webhooks",
        "event_id", event.ID,
    )
    if err != nil {
        return fmt.Errorf("idempotency check failed: %w", err)
    }
    if !inserted {
        log.Info("duplicate webhook, skipping", "event_id", event.ID)
        return nil // Already processed
    }

    // Safe to process
    return handlePaymentEvent(ctx, event)
}

Watch out for duplicate deliveries. In one incident, a PSP retry storm sent us 23 copies of the same payment.captured event in 4 minutes. Without idempotency keys, that would have been 23 credits to the same merchant account. The unique constraint on event_id caught every single duplicate. This isn't a nice-to-have — it's the difference between a normal Tuesday and an incident review.

Retry Strategy: Exponential Backoff with Jitter

When your worker fails to process a webhook (maybe the downstream API is down, maybe the database is overloaded), you need a retry strategy that doesn't make things worse. Exponential backoff is the standard: wait 1 second, then 2, then 4, then 8, and so on.

But pure exponential backoff has a thundering herd problem. If 500 webhooks all fail at the same time, they'll all retry at the same time too. Adding jitter — a random offset — spreads the retries out:

func backoffWithJitter(attempt int) time.Duration {
    base := math.Pow(2, float64(attempt)) // 1, 2, 4, 8...
    maxDelay := math.Min(base, 60)        // Cap at 60 seconds
    jitter := rand.Float64() * maxDelay   // Random [0, maxDelay)
    return time.Duration(jitter * float64(time.Second))
}

After exhausting retries (we use 8 attempts over roughly 72 hours), failed events land in a dead letter queue. An on-call engineer reviews them manually. In practice, DLQ events are rare — maybe 2–3 per month — and they're almost always caused by a bug in our processing logic, not a transient failure.

Ordering: Expect the Unexpected

Here's a fun one: payment.succeeded arrives before payment.created. Sounds impossible, but it happens. Different webhook types might be dispatched from different services inside the PSP, and network routing isn't deterministic.

The fix depends on your domain. For payment state machines, we use a simple rule: only allow forward transitions. If the current state is succeeded and a created event arrives, we ignore it. The state machine enforces the invariant, not the arrival order.

For cases where ordering truly matters, we attach a sequence_number or use the event's created_at timestamp. The worker checks: "Is this event newer than what I've already processed?" If not, skip it.

State machine transitions we enforce

  1. createdprocessingsucceeded / failed
  2. Backward transitions are logged but never applied
  3. Unknown states trigger an alert for manual review

Monitoring: What to Track

You can't fix what you can't see. We track three key metrics on every webhook pipeline:

We also run a synthetic monitor that sends a test webhook every 5 minutes and verifies end-to-end processing. If the test event doesn't complete within 60 seconds, PagerDuty fires. It's caught issues before real traffic was affected more than once.

Pro tip: Log the raw webhook payload before any processing. When something goes wrong at 3 AM, you'll want to replay the exact payload that caused the issue. We store raw payloads for 90 days — it's saved us during more than a few post-mortems.

Putting It All Together

Webhook reliability isn't one clever trick. It's a stack of boring, well-understood patterns applied consistently: verify signatures, persist immediately, process asynchronously, handle duplicates, retry with backoff, and monitor everything. Each pattern is simple on its own. Together, they're the difference between a payment system that works and one that works reliably.

After running this stack across 14+ gateway integrations at StraitsX, the numbers speak for themselves: 99.97% first-attempt delivery rate, sub-second p95 processing time, and zero missed settlements in the last 18 months. The patterns aren't glamorous, but they work.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Technical specifications are subject to change — always verify with official documentation.