Circuit Breaker Patterns for Payment Gateway Integrations

Why Payment Gateways Need Circuit Breakers

Most payment systems integrate with multiple external gateways — Stripe for cards, PayPal for wallets, maybe a local acquirer for domestic transactions. Each of these is a remote dependency you don't control. They go down, they get slow, they rate-limit you. Without circuit breakers, a single degraded gateway poisons your entire checkout flow.

The problem is subtle. When a gateway starts timing out instead of failing fast, your HTTP client threads pile up waiting. Your connection pool fills. Requests to healthy gateways start queuing behind the stuck ones. Within seconds, your entire payment service looks down to the outside world — even though only one provider is having issues.

A circuit breaker fixes this by detecting the failure pattern early and short-circuiting requests to the degraded gateway. Instead of waiting 30 seconds for a timeout, you fail in under a millisecond and route to a fallback. The healthy gateways keep processing normally.

~300ms

Avg gateway timeout before circuit opens

<1ms

Fail-fast response when circuit is open

94%

Checkout success rate during partial outage (with CB)

The Three States

The circuit breaker pattern borrows its name from electrical engineering. It has three states, and understanding the transitions between them is the whole game.

Circuit Breaker State Machine

CLOSED

Requests flow
normally

Failure threshold exceeded

→

OPEN

Requests blocked
(fail fast)

Timeout expires

→

HALF-OPEN

Probe request
sent to test

↶ Half-Open → Closed (probe succeeds)

↶ Half-Open → Open (probe fails)

Closed is the normal operating state. Every request goes through to the gateway. The breaker tracks failures — consecutive errors, error rate over a window, whatever metric you choose. As long as failures stay below the threshold, nothing changes.

Open means the breaker has tripped. No requests reach the gateway. Instead, callers get an immediate error (or a fallback response). This protects your system from wasting resources on a gateway that's clearly broken. The breaker stays open for a configurable timeout period.

Half-Open is the recovery probe. After the timeout expires, the breaker lets a single request through to test if the gateway has recovered. If it succeeds, the breaker resets to Closed. If it fails, back to Open for another timeout cycle.

Key insight for payments: the half-open probe should use a lightweight operation like a health check or a zero-amount authorization — not a real charge. You don't want to test gateway recovery by charging a customer's card and hoping it works.

Implementation in Go

Here's a circuit breaker I've used in production. It's intentionally simple — about 80 lines of actual logic. I've found that rolling your own for payment-critical paths gives you more control than a generic library, especially around what counts as a "failure" (hint: a declined card is not a gateway failure).

type State int

const (
    StateClosed   State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu              sync.Mutex
    state           State
    failureCount    int
    successCount    int
    lastFailureTime time.Time

    // Configuration
    maxFailures     int
    timeout         time.Duration
    halfOpenMax     int
}

func NewCircuitBreaker(maxFailures int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        state:       StateClosed,
        maxFailures: maxFailures,
        timeout:     timeout,
        halfOpenMax: 3,
    }
}

func (cb *CircuitBreaker) Execute(fn func() error) error {
    cb.mu.Lock()
    state := cb.currentState()
    
    switch state {
    case StateOpen:
        cb.mu.Unlock()
        return ErrCircuitOpen
    case StateHalfOpen:
        if cb.successCount >= cb.halfOpenMax {
            cb.mu.Unlock()
            return ErrCircuitOpen
        }
    }
    cb.mu.Unlock()

    // Execute the actual call
    err := fn()

    cb.mu.Lock()
    defer cb.mu.Unlock()

    if err != nil {
        cb.recordFailure()
        return err
    }

    cb.recordSuccess()
    return nil
}

func (cb *CircuitBreaker) currentState() State {
    if cb.state == StateOpen {
        if time.Since(cb.lastFailureTime) > cb.timeout {
            cb.state = StateHalfOpen
            cb.successCount = 0
            return StateHalfOpen
        }
    }
    return cb.state
}

func (cb *CircuitBreaker) recordFailure() {
    cb.failureCount++
    cb.lastFailureTime = time.Now()
    if cb.state == StateHalfOpen || cb.failureCount >= cb.maxFailures {
        cb.state = StateOpen
    }
}

func (cb *CircuitBreaker) recordSuccess() {
    if cb.state == StateHalfOpen {
        cb.successCount++
        if cb.successCount >= cb.halfOpenMax {
            cb.state = StateClosed
            cb.failureCount = 0
        }
        return
    }
    cb.failureCount = 0
}

A few things worth noting. The currentState() method handles the Open-to-HalfOpen transition lazily — it checks the timeout on every call rather than using a timer goroutine. This avoids the complexity of managing timer lifecycle and is perfectly fine when you're already making calls frequently. The halfOpenMax field requires multiple consecutive successes before closing the circuit, which prevents a single lucky request from declaring the gateway healthy.

Wrapping a Payment Gateway

Here's how you'd wire this into an actual gateway client:

type ResilientGateway struct {
    client  PaymentGateway
    breaker *CircuitBreaker
}

func (g *ResilientGateway) Charge(ctx context.Context, req ChargeRequest) (ChargeResponse, error) {
    var resp ChargeResponse

    err := g.breaker.Execute(func() error {
        var callErr error
        resp, callErr = g.client.Charge(ctx, req)

        // Only count infrastructure failures, not business errors
        if callErr != nil && isInfrastructureError(callErr) {
            return callErr
        }
        return nil
    })

    if errors.Is(err, ErrCircuitOpen) {
        return ChargeResponse{}, fmt.Errorf("gateway %s unavailable: circuit open", g.client.Name())
    }
    return resp, err
}

func isInfrastructureError(err error) bool {
    // Timeouts, connection refused, 5xx responses = infrastructure
    // Declined cards, invalid amounts, auth failures = NOT infrastructure
    var netErr net.Error
    if errors.As(err, &netErr) {
        return true
    }
    var httpErr *HTTPError
    if errors.As(err, &httpErr) {
        return httpErr.StatusCode >= 500
    }
    return false
}

Critical: distinguish infrastructure failures from business errors. A declined card (HTTP 402) means the gateway is working fine — it processed your request and said no. If you count declines as failures, a batch of stolen cards will trip your circuit breaker and block legitimate transactions. Only count timeouts, connection errors, and 5xx responses.

Configuration Tuning

The default values you pick for your circuit breaker will either save you during an outage or cause false trips during normal traffic spikes. I've tuned these numbers across three different payment platforms, and here's where I've landed.

Failure threshold: I use 5 consecutive failures for low-traffic gateways and a 50% error rate over a 10-second sliding window for high-traffic ones. Consecutive counts are simpler but fragile — a brief network blip that causes exactly 5 timeouts will trip the breaker even if the next 1,000 requests would succeed. Percentage-based thresholds with a minimum sample size (say, at least 20 requests in the window) are more robust.

Open timeout: 30 seconds is my starting point. Too short and you hammer a recovering gateway with probe requests. Too long and you're routing around a gateway that recovered 25 seconds ago. For payment gateways specifically, 30–60 seconds works well because most gateway incidents either resolve quickly (transient network issue) or last long enough that 30 seconds doesn't matter.

Half-open success threshold: I require 3 consecutive successes before closing the circuit. One success isn't enough — I've seen gateways that respond to the first request after a timeout but fail again immediately under load. Three successes gives you reasonable confidence without being overly cautious.

Approach	Pros	Cons	Best For
Hand-rolled (stdlib)	Full control over failure classification; no dependencies	More code to maintain; need to handle edge cases yourself	Payment-critical paths where failure semantics matter
sony/gobreaker	Battle-tested; configurable sliding window; clean API	Limited fallback support; callback-based config	General-purpose services with standard failure modes
mercari/go-circuitbreaker	Context-aware; supports half-open concurrency limits	Smaller community; fewer production war stories	Context-heavy Go services with strict cancellation needs
Service mesh (Istio/Envoy)	No code changes; works across languages; centralized config	Coarse-grained; can't distinguish business vs infra errors	Polyglot environments; non-payment auxiliary services

For payment gateways, I lean toward hand-rolled or gobreaker. The service mesh approach sounds appealing, but the inability to distinguish a declined card from a gateway outage is a dealbreaker. You need application-level awareness of what constitutes a real failure.

Fallback Strategies

When the circuit opens, you have a few options. The right one depends on your business context.

Gateway failover is the most common pattern. If your Stripe circuit opens, route to Adyen. This requires maintaining multiple gateway integrations, which is work, but it's the only strategy that keeps revenue flowing during an outage. We keep a priority list of gateways per payment method and currency, and the circuit breaker state determines which one gets traffic.

Queued retry works for non-real-time flows. If a payout gateway is down, push the payout into a persistent queue and process it when the circuit closes. The customer sees "payout pending" instead of "payout failed." This only works when the operation isn't time-sensitive.

Graceful degradation means accepting the order without charging immediately. This is risky — you're extending credit to the customer — but for high-value merchants with low fraud rates, it can be worth it. Capture the payment details, confirm the order, and charge when the gateway recovers. You need solid reconciliation to make this work.

Fallback tip: always log the reason a fallback was triggered, including which circuit opened and the failure count at the time. Without this, debugging why transactions routed to a secondary gateway becomes a guessing game during post-incident review.

Monitoring Your Breakers

A circuit breaker that trips silently is almost worse than not having one. You need to know the moment a circuit opens, how long it stays open, and how many requests were affected.

At minimum, emit these metrics:

circuit_breaker_state — gauge per gateway (0=closed, 1=open, 2=half-open). Alert when any breaker enters Open.
circuit_breaker_trip_total — counter of how many times each breaker has tripped. A breaker that trips 10 times a day is telling you something about that gateway's reliability.
circuit_breaker_rejected_total — counter of requests rejected by an open circuit. This is your "revenue at risk" metric.
circuit_breaker_fallback_total — counter of successful fallback executions. If this is zero when the circuit is open, your fallback isn't working.

func (cb *CircuitBreaker) recordFailure() {
    cb.failureCount++
    cb.lastFailureTime = time.Now()

    if cb.state == StateHalfOpen || cb.failureCount >= cb.maxFailures {
        cb.state = StateOpen
        // Emit metrics on state transition
        circuitStateGauge.WithLabelValues(cb.name).Set(1)
        circuitTripCounter.WithLabelValues(cb.name).Inc()
        log.Warn("circuit breaker opened",
            "gateway", cb.name,
            "failures", cb.failureCount,
            "last_error", cb.lastError,
        )
    }
}

I also recommend setting up a dashboard that shows circuit breaker state alongside gateway latency percentiles and error rates. When you see p99 latency spike on a gateway, you should see the breaker trip shortly after. If it doesn't, your thresholds are too lenient.

Lessons from Production

A few things I've learned the hard way that aren't in the textbooks:

Test your circuit breaker with chaos engineering. Inject gateway failures in staging and verify the breaker trips, the fallback activates, and the breaker recovers. We run this monthly. The first time we did it, we discovered our fallback gateway credentials had expired three months earlier.
Don't share a circuit breaker across unrelated operations. A single breaker for "Stripe" that covers both charges and refunds means a refund API outage blocks charges too. Use separate breakers per operation type.
Watch out for the thundering herd. When a circuit closes after an outage, all queued requests hit the gateway at once. Add a short ramp-up period in your half-open state — let through 1 request, then 5, then 20, before fully closing.
Circuit breakers don't replace retries — they complement them. Retry transient errors within the closed state. The circuit breaker catches the pattern when retries consistently fail.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.

Why Payment Gateways Need Circuit Breakers

The Three States

Implementation in Go

Wrapping a Payment Gateway

Configuration Tuning

Fallback Strategies

Monitoring Your Breakers

Lessons from Production

References

Related Articles