Why Payment Gateways Need Circuit Breakers
Most payment systems integrate with multiple external gateways — Stripe for cards, PayPal for wallets, maybe a local acquirer for domestic transactions. Each of these is a remote dependency you don't control. They go down, they get slow, they rate-limit you. Without circuit breakers, a single degraded gateway poisons your entire checkout flow.
The problem is subtle. When a gateway starts timing out instead of failing fast, your HTTP client threads pile up waiting. Your connection pool fills. Requests to healthy gateways start queuing behind the stuck ones. Within seconds, your entire payment service looks down to the outside world — even though only one provider is having issues.
A circuit breaker fixes this by detecting the failure pattern early and short-circuiting requests to the degraded gateway. Instead of waiting 30 seconds for a timeout, you fail in under a millisecond and route to a fallback. The healthy gateways keep processing normally.
The Three States
The circuit breaker pattern borrows its name from electrical engineering. It has three states, and understanding the transitions between them is the whole game.
normally
(fail fast)
sent to test
Closed is the normal operating state. Every request goes through to the gateway. The breaker tracks failures — consecutive errors, error rate over a window, whatever metric you choose. As long as failures stay below the threshold, nothing changes.
Open means the breaker has tripped. No requests reach the gateway. Instead, callers get an immediate error (or a fallback response). This protects your system from wasting resources on a gateway that's clearly broken. The breaker stays open for a configurable timeout period.
Half-Open is the recovery probe. After the timeout expires, the breaker lets a single request through to test if the gateway has recovered. If it succeeds, the breaker resets to Closed. If it fails, back to Open for another timeout cycle.
Key insight for payments: the half-open probe should use a lightweight operation like a health check or a zero-amount authorization — not a real charge. You don't want to test gateway recovery by charging a customer's card and hoping it works.
Implementation in Go
Here's a circuit breaker I've used in production. It's intentionally simple — about 80 lines of actual logic. I've found that rolling your own for payment-critical paths gives you more control than a generic library, especially around what counts as a "failure" (hint: a declined card is not a gateway failure).
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
mu sync.Mutex
state State
failureCount int
successCount int
lastFailureTime time.Time
// Configuration
maxFailures int
timeout time.Duration
halfOpenMax int
}
func NewCircuitBreaker(maxFailures int, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
state: StateClosed,
maxFailures: maxFailures,
timeout: timeout,
halfOpenMax: 3,
}
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
cb.mu.Lock()
state := cb.currentState()
switch state {
case StateOpen:
cb.mu.Unlock()
return ErrCircuitOpen
case StateHalfOpen:
if cb.successCount >= cb.halfOpenMax {
cb.mu.Unlock()
return ErrCircuitOpen
}
}
cb.mu.Unlock()
// Execute the actual call
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.recordFailure()
return err
}
cb.recordSuccess()
return nil
}
func (cb *CircuitBreaker) currentState() State {
if cb.state == StateOpen {
if time.Since(cb.lastFailureTime) > cb.timeout {
cb.state = StateHalfOpen
cb.successCount = 0
return StateHalfOpen
}
}
return cb.state
}
func (cb *CircuitBreaker) recordFailure() {
cb.failureCount++
cb.lastFailureTime = time.Now()
if cb.state == StateHalfOpen || cb.failureCount >= cb.maxFailures {
cb.state = StateOpen
}
}
func (cb *CircuitBreaker) recordSuccess() {
if cb.state == StateHalfOpen {
cb.successCount++
if cb.successCount >= cb.halfOpenMax {
cb.state = StateClosed
cb.failureCount = 0
}
return
}
cb.failureCount = 0
}
A few things worth noting. The currentState() method handles the Open-to-HalfOpen transition lazily — it checks the timeout on every call rather than using a timer goroutine. This avoids the complexity of managing timer lifecycle and is perfectly fine when you're already making calls frequently. The halfOpenMax field requires multiple consecutive successes before closing the circuit, which prevents a single lucky request from declaring the gateway healthy.
Wrapping a Payment Gateway
Here's how you'd wire this into an actual gateway client:
type ResilientGateway struct {
client PaymentGateway
breaker *CircuitBreaker
}
func (g *ResilientGateway) Charge(ctx context.Context, req ChargeRequest) (ChargeResponse, error) {
var resp ChargeResponse
err := g.breaker.Execute(func() error {
var callErr error
resp, callErr = g.client.Charge(ctx, req)
// Only count infrastructure failures, not business errors
if callErr != nil && isInfrastructureError(callErr) {
return callErr
}
return nil
})
if errors.Is(err, ErrCircuitOpen) {
return ChargeResponse{}, fmt.Errorf("gateway %s unavailable: circuit open", g.client.Name())
}
return resp, err
}
func isInfrastructureError(err error) bool {
// Timeouts, connection refused, 5xx responses = infrastructure
// Declined cards, invalid amounts, auth failures = NOT infrastructure
var netErr net.Error
if errors.As(err, &netErr) {
return true
}
var httpErr *HTTPError
if errors.As(err, &httpErr) {
return httpErr.StatusCode >= 500
}
return false
}
Critical: distinguish infrastructure failures from business errors. A declined card (HTTP 402) means the gateway is working fine — it processed your request and said no. If you count declines as failures, a batch of stolen cards will trip your circuit breaker and block legitimate transactions. Only count timeouts, connection errors, and 5xx responses.
Configuration Tuning
The default values you pick for your circuit breaker will either save you during an outage or cause false trips during normal traffic spikes. I've tuned these numbers across three different payment platforms, and here's where I've landed.
Failure threshold: I use 5 consecutive failures for low-traffic gateways and a 50% error rate over a 10-second sliding window for high-traffic ones. Consecutive counts are simpler but fragile — a brief network blip that causes exactly 5 timeouts will trip the breaker even if the next 1,000 requests would succeed. Percentage-based thresholds with a minimum sample size (say, at least 20 requests in the window) are more robust.
Open timeout: 30 seconds is my starting point. Too short and you hammer a recovering gateway with probe requests. Too long and you're routing around a gateway that recovered 25 seconds ago. For payment gateways specifically, 30–60 seconds works well because most gateway incidents either resolve quickly (transient network issue) or last long enough that 30 seconds doesn't matter.
Half-open success threshold: I require 3 consecutive successes before closing the circuit. One success isn't enough — I've seen gateways that respond to the first request after a timeout but fail again immediately under load. Three successes gives you reasonable confidence without being overly cautious.
For payment gateways, I lean toward hand-rolled or gobreaker. The service mesh approach sounds appealing, but the inability to distinguish a declined card from a gateway outage is a dealbreaker. You need application-level awareness of what constitutes a real failure.
Fallback Strategies
When the circuit opens, you have a few options. The right one depends on your business context.
Gateway failover is the most common pattern. If your Stripe circuit opens, route to Adyen. This requires maintaining multiple gateway integrations, which is work, but it's the only strategy that keeps revenue flowing during an outage. We keep a priority list of gateways per payment method and currency, and the circuit breaker state determines which one gets traffic.
Queued retry works for non-real-time flows. If a payout gateway is down, push the payout into a persistent queue and process it when the circuit closes. The customer sees "payout pending" instead of "payout failed." This only works when the operation isn't time-sensitive.
Graceful degradation means accepting the order without charging immediately. This is risky — you're extending credit to the customer — but for high-value merchants with low fraud rates, it can be worth it. Capture the payment details, confirm the order, and charge when the gateway recovers. You need solid reconciliation to make this work.
Fallback tip: always log the reason a fallback was triggered, including which circuit opened and the failure count at the time. Without this, debugging why transactions routed to a secondary gateway becomes a guessing game during post-incident review.
Monitoring Your Breakers
A circuit breaker that trips silently is almost worse than not having one. You need to know the moment a circuit opens, how long it stays open, and how many requests were affected.
At minimum, emit these metrics:
circuit_breaker_state— gauge per gateway (0=closed, 1=open, 2=half-open). Alert when any breaker enters Open.circuit_breaker_trip_total— counter of how many times each breaker has tripped. A breaker that trips 10 times a day is telling you something about that gateway's reliability.circuit_breaker_rejected_total— counter of requests rejected by an open circuit. This is your "revenue at risk" metric.circuit_breaker_fallback_total— counter of successful fallback executions. If this is zero when the circuit is open, your fallback isn't working.
func (cb *CircuitBreaker) recordFailure() {
cb.failureCount++
cb.lastFailureTime = time.Now()
if cb.state == StateHalfOpen || cb.failureCount >= cb.maxFailures {
cb.state = StateOpen
// Emit metrics on state transition
circuitStateGauge.WithLabelValues(cb.name).Set(1)
circuitTripCounter.WithLabelValues(cb.name).Inc()
log.Warn("circuit breaker opened",
"gateway", cb.name,
"failures", cb.failureCount,
"last_error", cb.lastError,
)
}
}
I also recommend setting up a dashboard that shows circuit breaker state alongside gateway latency percentiles and error rates. When you see p99 latency spike on a gateway, you should see the breaker trip shortly after. If it doesn't, your thresholds are too lenient.
Lessons from Production
A few things I've learned the hard way that aren't in the textbooks:
- Test your circuit breaker with chaos engineering. Inject gateway failures in staging and verify the breaker trips, the fallback activates, and the breaker recovers. We run this monthly. The first time we did it, we discovered our fallback gateway credentials had expired three months earlier.
- Don't share a circuit breaker across unrelated operations. A single breaker for "Stripe" that covers both charges and refunds means a refund API outage blocks charges too. Use separate breakers per operation type.
- Watch out for the thundering herd. When a circuit closes after an outage, all queued requests hit the gateway at once. Add a short ramp-up period in your half-open state — let through 1 request, then 5, then 20, before fully closing.
- Circuit breakers don't replace retries — they complement them. Retry transient errors within the closed state. The circuit breaker catches the pattern when retries consistently fail.
References
- Go Standard Library — sync Package Documentation
- Go Standard Library — context Package Documentation
- Go Standard Library — net Package Documentation
- sony/gobreaker — Circuit Breaker Library for Go
- Microsoft Azure — Circuit Breaker Pattern (Cloud Design Patterns)
- Martin Fowler — Circuit Breaker
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.