Multi-Region Active-Active Deployment for Payment Systems

Why Active-Passive Falls Short for Payments

Most teams start with active-passive because it's conceptually simple. One region handles traffic, the other sits idle waiting for disaster. On paper it works. In practice, for payment systems, it's a ticking time bomb.

The first problem is cold standby. Your passive region hasn't served real traffic in weeks — maybe months. Connection pools are cold. Caches are empty. Auto-scaling groups are at minimum capacity. When failover actually happens, you're not switching to a warm backup. You're cold-starting an entire payment stack under peak load from panicked retry storms.

Then there's DNS TTL. Even if you set your TTLs aggressively low (say 60 seconds), many resolvers and client libraries cache beyond what you tell them. We measured real-world propagation times of 3–8 minutes for our failover DNS changes. For a payment gateway processing 2,000 transactions per second, that's somewhere between 360K and 960K transactions hitting a dead endpoint.

The final nail: connection pool warm-up. Our Go services maintain persistent gRPC connections to downstream processors. After failover, every service instance needs to re-establish connections, negotiate TLS, and authenticate. Under load, this creates a thundering herd that can take another 5–10 minutes to stabilize. We timed the full recovery at 23 minutes. That's not a failover — that's an outage with extra steps.

Key insight: If your "failover" takes longer than your SLA window, you don't have high availability. You have a disaster recovery plan with a marketing problem.

The Architecture We Landed On

After evaluating several topologies, we settled on a three-region active-active setup. Each region independently processes transactions, with a global routing layer distributing traffic based on latency and health. Here's the high-level view:

Global Load Balancer / DNS Routing

↓

🇬🇸 Singapore

ap-southeast-1

API + Workers + DB

🇩🇪 Frankfurt

eu-central-1

API + Workers + DB

🇺🇸 US-East

us-east-1

API + Workers + DB

↕

Cross-Region Replication Layer (CockroachDB)

Each region runs the full stack: API gateway, payment processing workers, fraud detection, and a local database node. There's no single primary. Every region can accept writes. The replication layer handles synchronization under the hood.

The critical design choice: we don't try to make every region handle every merchant's traffic. Each merchant is "homed" to a region based on where most of their customers are. The global load balancer routes accordingly, but any region can serve any merchant if their home region degrades.

Data Replication — The Hard Part

Let's be honest: the compute layer is the easy part of active-active. Stateless services scale horizontally. The real beast is the data layer. How do you let three regions write to the same logical database without corrupting payment records?

CockroachDB vs. PostgreSQL Logical Replication

We evaluated two paths. PostgreSQL with logical replication is battle-tested and our team knew it well. But logical replication is fundamentally async, and conflict resolution for payment data is not something you want to hand-roll. A double-charge because two regions both approved the same transaction? That's a compliance nightmare, not just a bug.

We went with CockroachDB. Its serializable isolation across regions means we get strong consistency guarantees without building our own conflict resolution. The trade-off is latency — cross-region consensus adds 80–150ms to writes that span regions. For payment authorization, that's acceptable. For real-time balance checks, it's tight.

The "Write-Home" Pattern

To minimize cross-region write latency, we use what we call the "write-home" pattern. Every transaction has a home region, determined by the merchant's primary region assignment. Writes for that transaction — authorization, capture, settlement — always go to the home region's leaseholder. Reads can happen anywhere since CockroachDB's follower reads are fast and consistent enough for our needs.

Warning: If you're considering PostgreSQL logical replication for active-active payments, you need a rock-solid conflict resolution strategy. Last-write-wins is not acceptable for financial data. Look into application-level conflict detection with compensating transactions, or just use a database that handles distributed consensus natively.

DNS and Traffic Routing

Getting traffic to the right region is more nuanced than pointing DNS at three IP addresses. We evaluated three approaches before settling on our current setup:

Approach	Pros	Cons
Route53 Latency-Based	Automatic region selection based on real latency measurements; built-in health checks; easy to configure	Tied to AWS ecosystem; latency measurements can be stale; limited control over routing logic
GeoDNS	Predictable routing by geography; works across providers; good for data residency compliance	Doesn't account for actual network latency; VPN users get misrouted; GeoIP databases can be inaccurate
Anycast	Fastest path via BGP; no DNS propagation delay; instant failover at network layer	Complex to operate; requires own ASN; TCP connection resets during route changes; limited health check granularity

We use Route53 latency-based routing as the primary mechanism, with health checks hitting a deep /health/ready endpoint that verifies database connectivity, downstream processor reachability, and queue depth. If any of those fail, the region gets pulled from rotation within 10 seconds. We layer GeoDNS on top for merchants with strict data residency requirements — EU merchants always land in Frankfurt first, regardless of latency.

One thing we learned the hard way: set your health check threshold to 2 consecutive failures, not 1. A single dropped health check packet caused us to pull a perfectly healthy region out of rotation during a busy Saturday. Thousands of transactions got rerouted for no reason.

What We Got Wrong

I'd love to tell you we nailed this on the first try. We didn't. Here are the three worst production incidents from our migration to active-active:

Split-Brain During Network Partition

Three months in, a submarine cable cut between Singapore and Frankfurt caused a network partition. CockroachDB handled it correctly — the minority side stopped accepting writes. But our application layer didn't handle the resulting errors gracefully. Instead of returning a clean "service unavailable" to merchants, our API returned partial 500 errors that some merchant integrations interpreted as "transaction failed — retry." The retry storm made everything worse. Lesson: your application needs explicit partition-aware error handling, not just database-level consensus.

Clock Skew Breaking Transaction Ordering

We had a subtle bug where settlement batch processing assumed transaction timestamps were globally ordered. They weren't. Even with NTP, clock skew between regions can be 10–50ms. Two transactions hitting different regions within that window could appear out of order in settlement reports. We switched to hybrid logical clocks (which CockroachDB uses internally) for all application-level ordering. Never trust wall clocks in a distributed system.

Connection Pool Exhaustion During Failover Surge

When we intentionally drained US-East for maintenance, the traffic shift to Frankfurt and Singapore caused connection pool exhaustion on both receiving regions. We'd sized pools for 150% of normal regional traffic, but the surge hit 280% because of client-side retries stacking up. Now we pre-scale receiving regions before any planned drain, and our circuit breakers shed load more aggressively during traffic shifts.

Tip: Run game days monthly. Simulate region failures, network partitions, and traffic surges. The bugs you find in controlled chaos are infinitely cheaper than the ones that find you at 3 AM on a holiday weekend.

The Numbers

After 14 months of running active-active across three regions, here's where we landed:

99.995%

Uptime (rolling 12mo)

<150ms

p99 Latency

23m → 0

Failover Time

Active Regions

The "failover time" metric is the one I'm proudest of. It's effectively zero because there's no failover event anymore. When a region degrades, traffic just flows to the other two. No DNS changes, no cold starts, no human intervention. The system absorbs it.

Was it worth it? The migration took our platform team 8 months and cost roughly 40% more in infrastructure than active-passive. But we went from 99.95% to 99.995% availability, and we eliminated the single scariest entry in our risk register. For a payment system processing billions annually, that math works out pretty clearly.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

Why Active-Passive Falls Short for Payments

The Architecture We Landed On

Data Replication — The Hard Part

CockroachDB vs. PostgreSQL Logical Replication

The "Write-Home" Pattern

DNS and Traffic Routing

What We Got Wrong

Split-Brain During Network Partition

Clock Skew Breaking Transaction Ordering

Connection Pool Exhaustion During Failover Surge

The Numbers

References

Related Articles