Why Payment Systems Need This More Than Most
Every system has bugs that only surface under failure conditions. The difference with payment systems is the cost of discovering them in production. A bug in a social media feed means someone sees a stale post. A bug in a payment pipeline means someone gets charged twice, a settlement file goes missing, or a merchant doesn't get paid for three days.
We found 3 critical bugs in our first quarterly GameDay that had been lurking in production for months. One of them: our Redis failover logic had a race condition that caused a 45-second window where every cache miss triggered a direct database query. Under normal load, this was invisible. Under the thundering herd of a cache flush, it would have taken down our primary database.
The cost of finding that bug in a controlled experiment: zero. The cost of finding it during Black Friday: I don't want to think about it.
Five Experiments Every Payment Team Should Run
1. Payment Provider Timeout
What happens when Stripe takes 30 seconds to respond instead of the usual 200ms? Most teams assume their timeout is set correctly. We found ours was set to 60 seconds — the Go HTTP client default. That meant a single slow Stripe response held a goroutine and a database connection for a full minute. Multiply by 100 concurrent requests and you've exhausted your connection pool.
The fix was obvious once we saw it: set explicit timeouts of 5 seconds for authorization, 10 seconds for capture, and wire up our circuit breaker to trip after 3 consecutive timeouts.
2. Database Failover
Kill your primary database and see if your app reconnects to the replica. Sounds basic, right? Our app reconnected fine — but the connection pool held stale connections for 30 seconds before recycling them. During those 30 seconds, every query failed with "connection reset by peer." We added a health check query (SELECT 1) to our pool's ConnMaxIdleTime config and the failover became seamless.
3. Redis Cache Eviction
Flush your entire Redis cache and watch what happens. If every cache miss hits your database simultaneously, you've got a thundering herd problem. We added a probabilistic early expiration (jittered TTLs) and a single-flight pattern using singleflight.Group so only one goroutine fetches a missing key while others wait.
4. Certificate Expiry Simulation
mTLS connections between services break silently when certificates expire. We rotate certs every 90 days, but our chaos experiment revealed that one internal service had a hardcoded cert path that wasn't part of the rotation. It would have failed silently in 47 days.
5. Network Partition Between Services
Use tc (traffic control) to add 500ms latency between your payment service and your ledger service. We discovered that our synchronous ledger write had no timeout, so a slow ledger caused payment confirmations to hang indefinitely. We made the ledger write async with a reconciliation job to catch any missed entries.
Safe practice: Always run chaos experiments during off-peak hours first. Start with staging. When you move to production, limit blast radius to a single availability zone or a percentage of traffic. Never run your first experiment on a Friday afternoon.
Blast Radius Control
The scariest part of chaos engineering in payment systems is the "what if we break something for real" factor. Here's how we limit the damage:
- Feature flags — our chaos middleware is behind a flag that targets specific merchant IDs or a percentage of traffic. We start at 0.1% and ramp up.
- Kill switch — a single API call disables all active experiments immediately. Response time: under 500ms to propagate.
- Monitoring gates — experiments auto-abort if error rate exceeds 5% or p99 latency exceeds 3x baseline.
- Time-boxed — every experiment has a maximum duration. If nobody explicitly extends it, it stops.
A Chaos Middleware in Go
Here's a simplified version of the chaos middleware we use in staging. It randomly injects latency or errors for a configurable percentage of requests:
type ChaosConfig struct {
Enabled bool
LatencyMs int // injected latency in milliseconds
ErrorRate float64 // 0.0 to 1.0
AffectedPct float64 // percentage of requests affected
}
func ChaosMiddleware(cfg *ChaosConfig) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if !cfg.Enabled {
next.ServeHTTP(w, r)
return
}
// Only affect a percentage of requests
if rand.Float64() > cfg.AffectedPct {
next.ServeHTTP(w, r)
return
}
// Inject latency
if cfg.LatencyMs > 0 {
jitter := rand.Intn(cfg.LatencyMs / 2)
time.Sleep(time.Duration(cfg.LatencyMs+jitter) * time.Millisecond)
}
// Inject errors
if rand.Float64() < cfg.ErrorRate {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"error": "chaos_injection",
"message": "simulated service unavailable",
})
return
}
next.ServeHTTP(w, r)
})
}
}
Warning: Never deploy chaos middleware to production with Enabled: true by default. Use a feature flag service (LaunchDarkly, Unleash, or even a simple config endpoint) to toggle it. One accidental deploy with chaos enabled cost a team I know about $8K in failed transactions before someone noticed.
GameDay: The Quarterly Exercise
Every quarter, we run a full GameDay. The whole payment engineering team participates — backend, frontend, SRE, and one person from the finance team (they care about this stuff more than you'd think).
The format:
- Morning: plan the scenarios. We pick 3-4 failure scenarios based on recent incidents, new infrastructure changes, or areas we haven't tested. Each scenario has a written hypothesis: "If X fails, we expect Y to happen within Z seconds."
- Afternoon: execute. One person runs the experiments. Everyone else monitors dashboards, logs, and alerts. We record everything — screen recordings of Grafana, Slack timestamps, who noticed what and when.
- End of day: retro. For each experiment, did reality match our hypothesis? If not, why? What do we need to fix? Each finding becomes a ticket with a severity and an owner.
Our last GameDay found that our alerting for payment provider degradation had a 7-minute delay because the alert was based on a 5-minute rolling average with a 2-minute evaluation interval. By the time the alert fired, the circuit breaker had already tripped and recovered. The alert was useless. We switched to a 1-minute window with instant evaluation.
We use a mix: LitmusChaos for Kubernetes pod failures, custom Go middleware for application-level faults, and plain tc commands for network simulation. Gremlin is great if you have the budget and need audit trails for compliance.
Getting Started
You don't need a fancy tool to start. Pick one thing that scares you about your payment infrastructure — "what if Redis goes down?" — and test it in staging. Write down what you expect to happen. Then make it happen and see if you were right. That gap between expectation and reality is where the value lives.
Start small, run it during off-peak, have a rollback plan, and make sure someone is watching the dashboards. The first experiment is always the hardest. After that, it becomes part of how your team thinks about reliability.
References
- Principles of Chaos Engineering
- Netflix Tech Blog — Chaos Engineering
- Gremlin Documentation
- LitmusChaos Documentation
- Chaos Engineering — Casey Rosenthal & Nora Jones (O'Reilly)
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Always get proper authorization before running chaos experiments, especially in production environments handling financial transactions.