Designing a Payment Orchestration Layer — Route, Retry, and Reconcile

Why Direct PSP Integration Doesn't Scale

Here's how most companies start: you pick Stripe, integrate their SDK, and move on. It works great — until it doesn't. Maybe Stripe's fees are eating your margins on high-volume European transactions. Maybe your expansion into Southeast Asia needs a local acquirer. Maybe you had a four-hour Stripe outage last quarter that cost you six figures in lost transactions.

So you add Adyen. Now you have two sets of webhook handlers, two reconciliation pipelines, two different error code taxonomies, and a checkout flow that needs to decide which provider to use. Add Checkout.com for UK-optimized routing, and you've got a maintenance nightmare.

A payment orchestration layer sits between your application and your PSPs. It gives you one interface to talk to, and it handles the messy reality of multiple providers underneath. Think of it as an API gateway, but specifically designed for the quirks of payment processing.

Your Application

↓

Orchestration Layer

Router Retry Engine Reconciler State Machine

↓

Stripe
Primary

Adyen
EU / APAC

Checkout.com
UK Optimized

The Provider Abstraction Pattern

The foundation of any orchestration layer is a unified interface. Every PSP does the same basic things — authorize, capture, refund, void — but they all do it differently. Stripe uses PaymentIntents, Adyen uses /payments with a completely different payload shape, and Checkout.com has its own request format entirely.

I've found the adapter pattern works best here. You define a common interface, then write a thin adapter for each provider:

// The unified interface every provider must implement
type PaymentProvider interface {
    Authorize(ctx context.Context, req AuthRequest) (*AuthResponse, error)
    Capture(ctx context.Context, txnID string, amount Money) (*CaptureResponse, error)
    Refund(ctx context.Context, txnID string, amount Money) (*RefundResponse, error)
    Void(ctx context.Context, txnID string) (*VoidResponse, error)
}

// Each PSP gets its own adapter
type StripeAdapter struct { client *stripe.Client }
type AdyenAdapter struct  { client *adyen.Client }
type CheckoutAdapter struct { client *checkout.Client }

The key insight I missed early on: don't just normalize the request/response shapes. You also need to normalize error codes. Stripe's card_declined and Adyen's Refused and Checkout.com's 20005 all mean the same thing. Build a canonical error taxonomy and map every provider's codes into it. Without this, your retry logic can't make intelligent decisions.

Lesson learned: We initially tried to build a "perfect" abstraction that exposed every provider-specific feature. Don't do this. Start with the 80% — authorize, capture, refund, void, and webhooks. Add provider-specific capabilities later through extension points, not by bloating the core interface.

Smart Routing Logic

Once you have multiple providers behind a unified interface, you need to decide where to send each transaction. This is where the real value of orchestration kicks in. We use a scoring system that evaluates three dimensions:

1. Cost-Based Routing

Different providers charge different rates depending on card type, currency, and region. Stripe might charge 2.9% + 30c for a US domestic transaction, but Adyen could be cheaper for European cards through local acquiring. We maintain a fee schedule per provider and factor it into routing decisions. On $2M monthly volume, even a 0.3% difference in processing fees saves $6K/month.

2. Success-Rate-Based Routing

We track authorization rates per provider, per BIN range, per country, in sliding 24-hour windows. If Stripe's auth rate for Brazilian cards drops below 85% while Adyen is holding at 92%, the router shifts Brazilian traffic to Adyen. This alone improved our overall auth rate by 3-5% after the first month.

3. Geography-Based Routing

Local acquiring almost always beats cross-border processing. A UK-issued card processed through a UK acquirer gets better auth rates and lower interchange fees than routing it through a US acquirer. We map issuing country (from the BIN) to the provider with the best local acquiring relationship in that region.

func (r *Router) SelectProvider(txn *Transaction) (PaymentProvider, error) {
    candidates := r.providers.Active()
    scored := make([]ScoredProvider, 0, len(candidates))

    for _, p := range candidates {
        score := 0.0
        score += r.costScore(p, txn) * 0.3       // 30% weight
        score += r.successRateScore(p, txn) * 0.5 // 50% weight
        score += r.geoScore(p, txn) * 0.2         // 20% weight
        scored = append(scored, ScoredProvider{p, score})
    }

    sort.Slice(scored, func(i, j int) bool {
        return scored[i].Score > scored[j].Score
    })
    return scored[0].Provider, nil
}

3-5%

Auth rate improvement

with smart routing

99.7%

Effective uptime

with multi-PSP failover

<200ms

Routing decision time

p99 latency overhead

Retry and Failover Strategies

Not all failures are equal, and treating them the same is a fast path to either lost revenue or duplicate charges. We classify every failure into one of three buckets:

Hard declines — the issuer said no. Insufficient funds, stolen card, invalid number. Don't retry these. Don't fail over. The answer won't change with a different provider.
Soft declines — temporary issues. "Do not honor" (the vague catch-all), velocity limits, issuer temporarily unavailable. These are worth retrying once with the same provider, then failing over.
Provider errors — timeouts, 500s, network failures. These are the prime candidates for immediate failover to another provider. The problem is on the PSP side, not the card side.

The critical rule: never retry an authorization on a different provider unless you're certain the first attempt didn't go through. If Stripe times out but actually processed the charge, and you send the same card to Adyen, the customer gets charged twice. We use idempotency keys and a local transaction log to guard against this. Before failing over, we check: did the first provider actually create a charge? If we can't confirm either way, we wait and reconcile rather than risk a duplicate.

The 30-second rule: If a provider doesn't respond within 30 seconds, we mark the transaction as "uncertain" and queue it for async resolution. We don't fail over immediately — we check the provider's transaction API first. This added complexity saved us from dozens of double-charge incidents per month.

The Reconciliation Challenge

Reconciliation with a single provider is already tedious. With three providers, it's a full-time job — unless you automate it properly. Every provider sends settlement reports in different formats, at different times, with different levels of detail.

Our reconciliation pipeline runs in three stages:

Ingest — pull settlement files from each provider (Stripe sends webhooks, Adyen drops CSVs on SFTP, Checkout.com has a reporting API). Normalize everything into a common settlement record format.
Match — join settlement records against our internal transaction log using the provider's transaction ID. Flag anything that doesn't match: missing settlements, amount mismatches, unexpected refunds.
Resolve — unmatched records go into a queue for investigation. Most are timing issues (settlement arrived before our webhook processed). True discrepancies get escalated.

We run this daily and track a "reconciliation rate" — the percentage of transactions that match cleanly within 48 hours. Anything below 99.5% triggers an alert. In practice, we hover around 99.8%, with the remaining 0.2% being edge cases like partial captures and currency conversion rounding.

State Machine Design for Payment Lifecycle

Every payment in our system is modeled as a state machine. This was probably the single best architectural decision we made. Instead of scattered status flags and boolean columns, every transaction has a well-defined state and a set of valid transitions.

Created

↓

Authorized

decline

↓

Failed

capture

↓

Captured

↓

Settled

refund

↓

Refunded

The state machine enforces invariants that would otherwise be bugs. You can't capture a payment that wasn't authorized. You can't refund a payment that hasn't settled. You can't authorize a payment that's already been captured. Every state transition is logged with a timestamp, the provider that handled it, and the raw provider response. This audit trail has saved us during disputes more times than I can count.

We store the state machine in PostgreSQL with an events table (event sourcing lite). The current state is derived from the latest event, but we keep the full history. When reconciliation finds a discrepancy, we can replay the event chain and pinpoint exactly where things diverged.

Lessons from Building Orchestration in Production

Start with two providers, not three. The jump from one to two forces you to build the abstraction. The jump from two to three is just adding another adapter. But starting with three means you're designing abstractions while simultaneously learning three different APIs. We added Checkout.com four months after launching with Stripe and Adyen, and the integration took two days instead of two weeks.
Instrument everything from day one. We use Datadog to track auth rates, latency percentiles, and error rates per provider, per card brand, per country. When Adyen had a 15-minute degradation affecting Australian cards, we caught it in under two minutes because the dashboard lit up. Without per-provider observability, you're flying blind.
Don't build your own — unless you have to. Platforms like Spreedly and Primer exist specifically for payment orchestration. If your routing logic is simple (primary/fallback), use one of these. We built custom because our routing rules are deeply tied to our risk engine and pricing model, but I'd estimate 70% of companies would be better served by an off-the-shelf solution.
Webhook deduplication is non-negotiable. When you have three providers sending webhooks, and each has its own retry logic for failed deliveries, you will get duplicate events. Every webhook handler needs to be idempotent, keyed on the provider's event ID. We learned this the hard way when a Stripe webhook retry storm caused 200+ duplicate refund attempts in ten minutes.
Plan for provider migration, not just failover. At some point, you'll want to move traffic permanently from one provider to another — maybe for cost reasons, maybe because a contract ended. Your orchestration layer should support gradual traffic shifting (10% to new provider, then 25%, then 50%) with automatic rollback if error rates spike. We call this "canary routing" and it's saved us from two bad provider deployments.

Final thought: Payment orchestration isn't about building the most sophisticated routing engine. It's about making your payment infrastructure resilient enough that a single provider's bad day doesn't become your bad day. Start simple, measure everything, and let the data tell you where to optimize.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

Why Direct PSP Integration Doesn't Scale

The Provider Abstraction Pattern

Smart Routing Logic

1. Cost-Based Routing

2. Success-Rate-Based Routing

3. Geography-Based Routing

Retry and Failover Strategies

The Reconciliation Challenge

State Machine Design for Payment Lifecycle

Lessons from Building Orchestration in Production

References

Related Articles