April 10, 2026 10 min read

Chaos Engineering for Payment Systems — Breaking Things on Purpose So Production Doesn't Break You

Our first chaos experiment was embarrassingly simple: we killed a single payment service pod during off-peak hours. The service was supposed to self-heal in under 10 seconds. It took 4 minutes. That gap between what we assumed and what actually happened is exactly why chaos engineering exists.

Why Payment Systems Need This More Than Most

Every system has bugs that only surface under failure conditions. The difference with payment systems is the cost of discovering them in production. A bug in a social media feed means someone sees a stale post. A bug in a payment pipeline means someone gets charged twice, a settlement file goes missing, or a merchant doesn't get paid for three days.

We found 3 critical bugs in our first quarterly GameDay that had been lurking in production for months. One of them: our Redis failover logic had a race condition that caused a 45-second window where every cache miss triggered a direct database query. Under normal load, this was invisible. Under the thundering herd of a cache flush, it would have taken down our primary database.

The cost of finding that bug in a controlled experiment: zero. The cost of finding it during Black Friday: I don't want to think about it.

Chaos Experiment Lifecycle
1
Define Steady State
2
Hypothesize
3
Inject Fault
4
Observe
5
Analyze & Fix

Five Experiments Every Payment Team Should Run

1. Payment Provider Timeout

What happens when Stripe takes 30 seconds to respond instead of the usual 200ms? Most teams assume their timeout is set correctly. We found ours was set to 60 seconds — the Go HTTP client default. That meant a single slow Stripe response held a goroutine and a database connection for a full minute. Multiply by 100 concurrent requests and you've exhausted your connection pool.

The fix was obvious once we saw it: set explicit timeouts of 5 seconds for authorization, 10 seconds for capture, and wire up our circuit breaker to trip after 3 consecutive timeouts.

2. Database Failover

Kill your primary database and see if your app reconnects to the replica. Sounds basic, right? Our app reconnected fine — but the connection pool held stale connections for 30 seconds before recycling them. During those 30 seconds, every query failed with "connection reset by peer." We added a health check query (SELECT 1) to our pool's ConnMaxIdleTime config and the failover became seamless.

3. Redis Cache Eviction

Flush your entire Redis cache and watch what happens. If every cache miss hits your database simultaneously, you've got a thundering herd problem. We added a probabilistic early expiration (jittered TTLs) and a single-flight pattern using singleflight.Group so only one goroutine fetches a missing key while others wait.

4. Certificate Expiry Simulation

mTLS connections between services break silently when certificates expire. We rotate certs every 90 days, but our chaos experiment revealed that one internal service had a hardcoded cert path that wasn't part of the rotation. It would have failed silently in 47 days.

5. Network Partition Between Services

Use tc (traffic control) to add 500ms latency between your payment service and your ledger service. We discovered that our synchronous ledger write had no timeout, so a slow ledger caused payment confirmations to hang indefinitely. We made the ledger write async with a reconciliation job to catch any missed entries.

Safe practice: Always run chaos experiments during off-peak hours first. Start with staging. When you move to production, limit blast radius to a single availability zone or a percentage of traffic. Never run your first experiment on a Friday afternoon.

Blast Radius Control

The scariest part of chaos engineering in payment systems is the "what if we break something for real" factor. Here's how we limit the damage:

A Chaos Middleware in Go

Here's a simplified version of the chaos middleware we use in staging. It randomly injects latency or errors for a configurable percentage of requests:

type ChaosConfig struct {
    Enabled       bool
    LatencyMs     int     // injected latency in milliseconds
    ErrorRate     float64 // 0.0 to 1.0
    AffectedPct   float64 // percentage of requests affected
}

func ChaosMiddleware(cfg *ChaosConfig) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            if !cfg.Enabled {
                next.ServeHTTP(w, r)
                return
            }

            // Only affect a percentage of requests
            if rand.Float64() > cfg.AffectedPct {
                next.ServeHTTP(w, r)
                return
            }

            // Inject latency
            if cfg.LatencyMs > 0 {
                jitter := rand.Intn(cfg.LatencyMs / 2)
                time.Sleep(time.Duration(cfg.LatencyMs+jitter) * time.Millisecond)
            }

            // Inject errors
            if rand.Float64() < cfg.ErrorRate {
                w.WriteHeader(http.StatusServiceUnavailable)
                json.NewEncoder(w).Encode(map[string]string{
                    "error": "chaos_injection",
                    "message": "simulated service unavailable",
                })
                return
            }

            next.ServeHTTP(w, r)
        })
    }
}

Warning: Never deploy chaos middleware to production with Enabled: true by default. Use a feature flag service (LaunchDarkly, Unleash, or even a simple config endpoint) to toggle it. One accidental deploy with chaos enabled cost a team I know about $8K in failed transactions before someone noticed.

GameDay: The Quarterly Exercise

Every quarter, we run a full GameDay. The whole payment engineering team participates — backend, frontend, SRE, and one person from the finance team (they care about this stuff more than you'd think).

The format:

  1. Morning: plan the scenarios. We pick 3-4 failure scenarios based on recent incidents, new infrastructure changes, or areas we haven't tested. Each scenario has a written hypothesis: "If X fails, we expect Y to happen within Z seconds."
  2. Afternoon: execute. One person runs the experiments. Everyone else monitors dashboards, logs, and alerts. We record everything — screen recordings of Grafana, Slack timestamps, who noticed what and when.
  3. End of day: retro. For each experiment, did reality match our hypothesis? If not, why? What do we need to fix? Each finding becomes a ticket with a severity and an owner.

Our last GameDay found that our alerting for payment provider degradation had a 7-minute delay because the alert was based on a 5-minute rolling average with a 2-minute evaluation interval. By the time the alert fired, the circuit breaker had already tripped and recovered. The alert was useless. We switched to a 1-minute window with instant evaluation.

Tool Best For Complexity Cost
LitmusChaos Kubernetes-native experiments Medium Free (OSS)
Gremlin Enterprise teams, compliance-friendly Low $$$ (SaaS)
Chaos Monkey Random instance termination Low Free (OSS)
DIY (tc + iptables) Network-level faults, full control High Free

We use a mix: LitmusChaos for Kubernetes pod failures, custom Go middleware for application-level faults, and plain tc commands for network simulation. Gremlin is great if you have the budget and need audit trails for compliance.

Getting Started

You don't need a fancy tool to start. Pick one thing that scares you about your payment infrastructure — "what if Redis goes down?" — and test it in staging. Write down what you expect to happen. Then make it happen and see if you were right. That gap between expectation and reality is where the value lives.

Start small, run it during off-peak, have a rollback plan, and make sure someone is watching the dashboards. The first experiment is always the hardest. After that, it becomes part of how your team thinks about reliability.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Always get proper authorization before running chaos experiments, especially in production environments handling financial transactions.