Why Webhooks Fail (And They Will)
Let's get this out of the way: webhooks are unreliable by design. They're HTTP requests fired across the open internet. Things go wrong. In my experience, the failure modes fall into a few predictable buckets:
- Network timeouts. Your server took 31 seconds to respond. The PSP gave up at 30. You processed the payment internally but never acknowledged it, so the PSP thinks it failed.
- Server restarts. You deployed at 2:47 PM. A webhook arrived at 2:47 PM. Your container was recycling. Nobody was home.
- Queue overflows. A batch settlement dropped 4,000 webhooks on your endpoint in 12 seconds. Your worker pool choked.
- Duplicate deliveries. The PSP's retry logic fired because your 200 response arrived 1ms after their timeout. Now you've got the same webhook twice.
None of these are exotic. They happen weekly at scale. The question isn't if your webhook pipeline will drop events — it's whether your system recovers gracefully when it does.
Receive Fast, Process Later
This is the single most important pattern. Your webhook endpoint should do three things: validate the signature, persist the raw payload, and return 200 OK. That's it. All the heavy lifting — updating balances, sending emails, reconciling ledgers — happens asynchronously.
The reason is simple: PSPs have short timeout windows (typically 5–30 seconds). If your endpoint is busy talking to a database, calling a third-party API, or running business logic, you're gambling with that window. One slow query and the PSP marks the delivery as failed.
Here's what our Go handler looks like in practice:
func handleWebhook(w http.ResponseWriter, r *http.Request) {
body, err := io.ReadAll(r.Body)
if err != nil {
w.WriteHeader(http.StatusBadRequest)
return
}
// Verify signature BEFORE anything else
sig := r.Header.Get("X-Signature-256")
if !verifyHMAC(body, sig, webhookSecret) {
w.WriteHeader(http.StatusUnauthorized)
return
}
// Persist raw payload, return immediately
if err := enqueueWebhook(r.Context(), body); err != nil {
log.Error("failed to enqueue webhook", "err", err)
w.WriteHeader(http.StatusInternalServerError)
return
}
w.WriteHeader(http.StatusOK)
}
The entire handler runs in under 50ms. The worker picks up the event from the queue and does the real work. If the worker crashes, the message stays in the queue. Nothing is lost.
Signature Verification: Don't Skip This
Every reputable PSP signs their webhooks with HMAC-SHA256. The flow is straightforward: they hash the request body with a shared secret, attach the hash as a header, and you recompute it on your end. If the hashes match, the payload is authentic and untampered.
func verifyHMAC(payload []byte, signature, secret string) bool {
mac := hmac.New(sha256.New, []byte(secret))
mac.Write(payload)
expected := hex.EncodeToString(mac.Sum(nil))
// Constant-time comparison prevents timing attacks
return hmac.Equal([]byte(expected), []byte(signature))
}
Why constant-time comparison matters: A naive string comparison (==) leaks timing information. An attacker can brute-force the signature byte-by-byte by measuring response times. hmac.Equal runs in constant time regardless of where the strings differ.
I've seen teams skip verification in staging and forget to enable it in production. Don't. Treat an unverified webhook like an unauthenticated API call — reject it at the door.
Idempotency: The Non-Negotiable
PSPs will send you the same webhook more than once. Stripe says so in their docs. So does Adyen. So does every gateway I've integrated. Your system needs to handle duplicates without double-crediting an account or sending two confirmation emails.
The pattern is simple: every webhook carries a unique event ID. Before processing, check if you've seen it before.
func processWebhook(ctx context.Context, event WebhookEvent) error {
// Atomic check-and-insert using unique constraint
inserted, err := db.InsertIfNotExists(ctx,
"processed_webhooks",
"event_id", event.ID,
)
if err != nil {
return fmt.Errorf("idempotency check failed: %w", err)
}
if !inserted {
log.Info("duplicate webhook, skipping", "event_id", event.ID)
return nil // Already processed
}
// Safe to process
return handlePaymentEvent(ctx, event)
}
Watch out for duplicate deliveries. In one incident, a PSP retry storm sent us 23 copies of the same payment.captured event in 4 minutes. Without idempotency keys, that would have been 23 credits to the same merchant account. The unique constraint on event_id caught every single duplicate. This isn't a nice-to-have — it's the difference between a normal Tuesday and an incident review.
Retry Strategy: Exponential Backoff with Jitter
When your worker fails to process a webhook (maybe the downstream API is down, maybe the database is overloaded), you need a retry strategy that doesn't make things worse. Exponential backoff is the standard: wait 1 second, then 2, then 4, then 8, and so on.
But pure exponential backoff has a thundering herd problem. If 500 webhooks all fail at the same time, they'll all retry at the same time too. Adding jitter — a random offset — spreads the retries out:
func backoffWithJitter(attempt int) time.Duration {
base := math.Pow(2, float64(attempt)) // 1, 2, 4, 8...
maxDelay := math.Min(base, 60) // Cap at 60 seconds
jitter := rand.Float64() * maxDelay // Random [0, maxDelay)
return time.Duration(jitter * float64(time.Second))
}
After exhausting retries (we use 8 attempts over roughly 72 hours), failed events land in a dead letter queue. An on-call engineer reviews them manually. In practice, DLQ events are rare — maybe 2–3 per month — and they're almost always caused by a bug in our processing logic, not a transient failure.
Ordering: Expect the Unexpected
Here's a fun one: payment.succeeded arrives before payment.created. Sounds impossible, but it happens. Different webhook types might be dispatched from different services inside the PSP, and network routing isn't deterministic.
The fix depends on your domain. For payment state machines, we use a simple rule: only allow forward transitions. If the current state is succeeded and a created event arrives, we ignore it. The state machine enforces the invariant, not the arrival order.
For cases where ordering truly matters, we attach a sequence_number or use the event's created_at timestamp. The worker checks: "Is this event newer than what I've already processed?" If not, skip it.
State machine transitions we enforce
created→processing→succeeded/failed- Backward transitions are logged but never applied
- Unknown states trigger an alert for manual review
Monitoring: What to Track
You can't fix what you can't see. We track three key metrics on every webhook pipeline:
- Webhook lag — time between the PSP's
created_attimestamp and when our worker finishes processing. If this creeps above 30 seconds, something is backing up. - Failure rate — percentage of webhooks that fail processing on the first attempt. Baseline is under 0.1%. A spike usually means a downstream dependency is degraded.
- Queue depth — how many unprocessed webhooks are sitting in the queue. This is the early warning system. If depth is growing faster than workers can drain it, you need to scale up or investigate.
We also run a synthetic monitor that sends a test webhook every 5 minutes and verifies end-to-end processing. If the test event doesn't complete within 60 seconds, PagerDuty fires. It's caught issues before real traffic was affected more than once.
Pro tip: Log the raw webhook payload before any processing. When something goes wrong at 3 AM, you'll want to replay the exact payload that caused the issue. We store raw payloads for 90 days — it's saved us during more than a few post-mortems.
Putting It All Together
Webhook reliability isn't one clever trick. It's a stack of boring, well-understood patterns applied consistently: verify signatures, persist immediately, process asynchronously, handle duplicates, retry with backoff, and monitor everything. Each pattern is simple on its own. Together, they're the difference between a payment system that works and one that works reliably.
After running this stack across 14+ gateway integrations at StraitsX, the numbers speak for themselves: 99.97% first-attempt delivery rate, sub-second p95 processing time, and zero missed settlements in the last 18 months. The patterns aren't glamorous, but they work.
References
- Stripe Webhook Documentation — best-in-class reference for webhook integration patterns
- Adyen Webhooks Guide — detailed coverage of HMAC verification and retry behavior
- AWS: Implementing Idempotent Lambda Functions — idempotency patterns applicable beyond Lambda
- Exponential Backoff and Jitter (AWS) — the definitive write-up on retry strategies
- Brandur Leach: Webhooks — excellent deep dive on webhook design from a Stripe engineer
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Technical specifications are subject to change — always verify with official documentation.