April 9, 2026 10 min read

Building Real-Time Payment Notification Systems — From Webhook Hell to Reliable Delivery

After losing a weekend to a silent webhook failure that left 12,000 merchants without payment confirmations, I rebuilt our entire notification pipeline from scratch. Here's what I learned about making payment event delivery actually reliable.

If you've ever built a payment system that sends webhooks to merchants, you know the feeling. Everything works fine in staging. You ship it. Then at 2 AM on a Saturday, you discover that a downstream partner changed their TLS certificate and your retry logic has been silently dropping events for six hours. Nobody got paged because your monitoring only checked HTTP 200s, not actual delivery confirmation.

That was us about two years ago. We were processing around 800K payment events per day, and our "notification system" was essentially a goroutine that fired HTTP POST requests inline with the payment flow. It worked until it didn't. What followed was a six-month rebuild that taught me more about distributed systems than any textbook.

99.97%
Delivery rate (after rebuild)
< 500ms
p95 first-attempt latency
3.2B
Events delivered / month

The Problem with Inline Webhook Delivery

The naive approach — fire a webhook inside your payment transaction handler — has a few failure modes that compound fast:

Each of these is solvable individually. But solving all of them together without introducing new failure modes — that's the actual engineering challenge.

The Outbox Pattern: Decouple or Die

The single most impactful change we made was adopting the transactional outbox pattern. Instead of sending webhooks inline, we write the event to an outbox table in the same database transaction that records the payment state change. A separate worker polls the outbox and handles delivery.

Payment Service
Outbox Table
Delivery Worker
HTTP Dispatch
Retry Queue
DLQ
Payment event flows from transaction commit through outbox polling to delivery, with failed attempts routed to retry queue and eventually dead letter queue

The key insight: the outbox write and the payment state change are in the same transaction. Either both happen or neither does. No more lost events.

func (s *PaymentService) CompletePayment(ctx context.Context, tx *sql.Tx, p *Payment) error {
    // Update payment status
    if err := s.repo.UpdateStatus(ctx, tx, p.ID, StatusCompleted); err != nil {
        return fmt.Errorf("update status: %w", err)
    }

    // Write to outbox in the SAME transaction
    event := OutboxEvent{
        ID:          uuid.New().String(),
        AggregateID: p.ID,
        EventType:   "payment.completed",
        Payload:     marshalPaymentEvent(p),
        CreatedAt:   time.Now().UTC(),
    }
    if err := s.outbox.Insert(ctx, tx, &event); err != nil {
        return fmt.Errorf("outbox insert: %w", err)
    }

    return tx.Commit()
}

Tip: Keep your outbox table lean. We store only the event ID, type, payload, status, and timestamps. Anything else belongs in your domain tables. A bloated outbox table will slow down your polling queries as it grows.

Exponential Backoff with Jitter

When a delivery attempt fails, you need to retry — but not immediately, and definitely not all at once. We use exponential backoff with full jitter, which spreads retry attempts across time and prevents thundering herd problems when a merchant endpoint recovers.

1
t+0s
Initial
2
t+1s
~1s wait
3
t+5s
~4s wait
4
t+21s
~16s wait
5
t+85s
~64s wait
DLQ
Manual review

The formula is straightforward: delay = min(base * 2^attempt, maxDelay), then apply random jitter between 0 and that value. Without jitter, all failed events for the same merchant retry at exactly the same instant, which is basically a self-inflicted DDoS.

func retryDelay(attempt int) time.Duration {
    base := 1 * time.Second
    maxDelay := 5 * time.Minute

    delay := base * time.Duration(1<<uint(attempt))
    if delay > maxDelay {
        delay = maxDelay
    }

    // Full jitter: uniform random between 0 and calculated delay
    jitter := time.Duration(rand.Int63n(int64(delay)))
    return jitter
}

After five failed attempts, we route the event to a dead letter queue. At that point, something is genuinely wrong — the merchant's endpoint is misconfigured, their server is down for an extended period, or there's a payload issue. An engineer needs to look at it.

Idempotent Consumers: The Merchant's Side

Even with perfect retry logic on your end, merchants will receive duplicate events. Network timeouts, load balancer retries, and consumer restarts all contribute. Every webhook consumer needs to be idempotent.

We include an X-Event-ID header with every delivery. Merchants should store processed event IDs and skip duplicates. On our side, we document this clearly and provide SDKs that handle deduplication automatically.

Warning: Don't rely on the merchant implementing idempotency correctly. Design your retry logic to minimize duplicates in the first place. We track delivery confirmations (HTTP 2xx responses) and only retry on genuine failures — never on ambiguous timeouts where the merchant might have processed the event.

Signature Verification

Every webhook payload must be signed. We use HMAC-SHA256 with a per-merchant secret, and include a timestamp in the signed content to prevent replay attacks. The merchant verifies the signature before processing.

func signPayload(secret []byte, timestamp int64, body []byte) string {
    msg := fmt.Sprintf("%d.%s", timestamp, body)
    mac := hmac.New(sha256.New, secret)
    mac.Write([]byte(msg))
    return hex.EncodeToString(mac.Sum(nil))
}

// Webhook headers sent with every delivery:
// X-Signature: sha256=<hex-encoded HMAC>
// X-Timestamp: <unix seconds>
// X-Event-ID: <uuid>

We reject any verification request where the timestamp is older than five minutes. This is a simple but effective guard against replay attacks. Stripe does something very similar, and for good reason — it works.

Monitoring What Actually Matters

After the rebuild, we instrumented everything. But the metrics that actually saved us from incidents were surprisingly few:

  1. Delivery success rate per merchant — a sudden drop for one merchant means their endpoint is down. A gradual drop across all merchants means we broke something.
  2. Outbox lag — the time between event creation and first delivery attempt. If this grows, your workers are falling behind.
  3. DLQ depth — should be near zero. Any sustained growth needs immediate attention.
  4. p95/p99 delivery latency — we alert if p95 exceeds 2 seconds, which usually indicates network issues or slow merchant endpoints.

We pipe all of this into Prometheus and Grafana, with PagerDuty alerts on the critical thresholds. The outbox lag metric alone has caught three potential incidents before they became customer-facing.

Tip: Build a webhook delivery dashboard that your support team can access. When a merchant reports missing notifications, the first thing you want is a per-merchant delivery log showing every attempt, response code, and latency. This cuts incident resolution time dramatically.

Lessons from Production

A few things I wish I'd known before starting this rebuild:

The notification system is one of those things that nobody thinks about when it works and everyone notices when it doesn't. Getting it right takes deliberate engineering, not just bolting on a retry loop. The outbox pattern, proper backoff, idempotency, and honest monitoring — together, they turned our "webhook hell" into infrastructure we actually trust.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.