If you've ever built a payment system that sends webhooks to merchants, you know the feeling. Everything works fine in staging. You ship it. Then at 2 AM on a Saturday, you discover that a downstream partner changed their TLS certificate and your retry logic has been silently dropping events for six hours. Nobody got paged because your monitoring only checked HTTP 200s, not actual delivery confirmation.
That was us about two years ago. We were processing around 800K payment events per day, and our "notification system" was essentially a goroutine that fired HTTP POST requests inline with the payment flow. It worked until it didn't. What followed was a six-month rebuild that taught me more about distributed systems than any textbook.
The Problem with Inline Webhook Delivery
The naive approach — fire a webhook inside your payment transaction handler — has a few failure modes that compound fast:
- The merchant endpoint is slow or down, and your payment processing latency spikes because you're waiting on their server.
- Your process crashes after committing the payment but before sending the webhook. The event is lost forever.
- You retry on failure, but the merchant's endpoint is flapping, so you end up delivering the same event 14 times.
- Events arrive out of order because retries for event A complete after the first attempt for event B.
Each of these is solvable individually. But solving all of them together without introducing new failure modes — that's the actual engineering challenge.
The Outbox Pattern: Decouple or Die
The single most impactful change we made was adopting the transactional outbox pattern. Instead of sending webhooks inline, we write the event to an outbox table in the same database transaction that records the payment state change. A separate worker polls the outbox and handles delivery.
The key insight: the outbox write and the payment state change are in the same transaction. Either both happen or neither does. No more lost events.
func (s *PaymentService) CompletePayment(ctx context.Context, tx *sql.Tx, p *Payment) error {
// Update payment status
if err := s.repo.UpdateStatus(ctx, tx, p.ID, StatusCompleted); err != nil {
return fmt.Errorf("update status: %w", err)
}
// Write to outbox in the SAME transaction
event := OutboxEvent{
ID: uuid.New().String(),
AggregateID: p.ID,
EventType: "payment.completed",
Payload: marshalPaymentEvent(p),
CreatedAt: time.Now().UTC(),
}
if err := s.outbox.Insert(ctx, tx, &event); err != nil {
return fmt.Errorf("outbox insert: %w", err)
}
return tx.Commit()
}
Tip: Keep your outbox table lean. We store only the event ID, type, payload, status, and timestamps. Anything else belongs in your domain tables. A bloated outbox table will slow down your polling queries as it grows.
Exponential Backoff with Jitter
When a delivery attempt fails, you need to retry — but not immediately, and definitely not all at once. We use exponential backoff with full jitter, which spreads retry attempts across time and prevents thundering herd problems when a merchant endpoint recovers.
The formula is straightforward: delay = min(base * 2^attempt, maxDelay), then apply random jitter between 0 and that value. Without jitter, all failed events for the same merchant retry at exactly the same instant, which is basically a self-inflicted DDoS.
func retryDelay(attempt int) time.Duration {
base := 1 * time.Second
maxDelay := 5 * time.Minute
delay := base * time.Duration(1<<uint(attempt))
if delay > maxDelay {
delay = maxDelay
}
// Full jitter: uniform random between 0 and calculated delay
jitter := time.Duration(rand.Int63n(int64(delay)))
return jitter
}
After five failed attempts, we route the event to a dead letter queue. At that point, something is genuinely wrong — the merchant's endpoint is misconfigured, their server is down for an extended period, or there's a payload issue. An engineer needs to look at it.
Idempotent Consumers: The Merchant's Side
Even with perfect retry logic on your end, merchants will receive duplicate events. Network timeouts, load balancer retries, and consumer restarts all contribute. Every webhook consumer needs to be idempotent.
We include an X-Event-ID header with every delivery. Merchants should store processed event IDs and skip duplicates. On our side, we document this clearly and provide SDKs that handle deduplication automatically.
Warning: Don't rely on the merchant implementing idempotency correctly. Design your retry logic to minimize duplicates in the first place. We track delivery confirmations (HTTP 2xx responses) and only retry on genuine failures — never on ambiguous timeouts where the merchant might have processed the event.
Signature Verification
Every webhook payload must be signed. We use HMAC-SHA256 with a per-merchant secret, and include a timestamp in the signed content to prevent replay attacks. The merchant verifies the signature before processing.
func signPayload(secret []byte, timestamp int64, body []byte) string {
msg := fmt.Sprintf("%d.%s", timestamp, body)
mac := hmac.New(sha256.New, secret)
mac.Write([]byte(msg))
return hex.EncodeToString(mac.Sum(nil))
}
// Webhook headers sent with every delivery:
// X-Signature: sha256=<hex-encoded HMAC>
// X-Timestamp: <unix seconds>
// X-Event-ID: <uuid>
We reject any verification request where the timestamp is older than five minutes. This is a simple but effective guard against replay attacks. Stripe does something very similar, and for good reason — it works.
Monitoring What Actually Matters
After the rebuild, we instrumented everything. But the metrics that actually saved us from incidents were surprisingly few:
- Delivery success rate per merchant — a sudden drop for one merchant means their endpoint is down. A gradual drop across all merchants means we broke something.
- Outbox lag — the time between event creation and first delivery attempt. If this grows, your workers are falling behind.
- DLQ depth — should be near zero. Any sustained growth needs immediate attention.
- p95/p99 delivery latency — we alert if p95 exceeds 2 seconds, which usually indicates network issues or slow merchant endpoints.
We pipe all of this into Prometheus and Grafana, with PagerDuty alerts on the critical thresholds. The outbox lag metric alone has caught three potential incidents before they became customer-facing.
Tip: Build a webhook delivery dashboard that your support team can access. When a merchant reports missing notifications, the first thing you want is a per-merchant delivery log showing every attempt, response code, and latency. This cuts incident resolution time dramatically.
Lessons from Production
A few things I wish I'd known before starting this rebuild:
- Set aggressive timeouts on outbound requests. We use a 5-second connect timeout and 10-second read timeout. A merchant endpoint that takes 30 seconds to respond is effectively down — don't let it drag your worker pool down with it.
- Partition your delivery workers by merchant. One merchant with a broken endpoint shouldn't block deliveries to everyone else. We use consistent hashing to assign merchants to worker partitions.
- Version your webhook payloads from day one. We didn't, and migrating 4,000 merchants to a new payload format was a six-month project involving a compatibility layer that still haunts our codebase.
- Test with chaos. We run monthly game days where we inject random failures — dropped connections, slow responses, certificate errors. Every time, we find something new.
The notification system is one of those things that nobody thinks about when it works and everyone notices when it doesn't. Getting it right takes deliberate engineering, not just bolting on a retry loop. The outbox pattern, proper backoff, idempotency, and honest monitoring — together, they turned our "webhook hell" into infrastructure we actually trust.
References
- Stripe Webhook Documentation — Best practices for webhook delivery and signature verification
- AWS SQS Dead Letter Queues — Official guide to configuring DLQs for message processing failures
- Exponential Backoff and Jitter — AWS Builders' Library — Marc Brooker's definitive article on retry strategies
- Transactional Outbox Pattern — Microservices.io — Pattern description and implementation guidance
- AWS Well-Architected Reliability Pillar — Graceful degradation and failure handling strategies
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.