Why Active-Passive Falls Short for Payments
Most teams start with active-passive because it's conceptually simple. One region handles traffic, the other sits idle waiting for disaster. On paper it works. In practice, for payment systems, it's a ticking time bomb.
The first problem is cold standby. Your passive region hasn't served real traffic in weeks — maybe months. Connection pools are cold. Caches are empty. Auto-scaling groups are at minimum capacity. When failover actually happens, you're not switching to a warm backup. You're cold-starting an entire payment stack under peak load from panicked retry storms.
Then there's DNS TTL. Even if you set your TTLs aggressively low (say 60 seconds), many resolvers and client libraries cache beyond what you tell them. We measured real-world propagation times of 3–8 minutes for our failover DNS changes. For a payment gateway processing 2,000 transactions per second, that's somewhere between 360K and 960K transactions hitting a dead endpoint.
The final nail: connection pool warm-up. Our Go services maintain persistent gRPC connections to downstream processors. After failover, every service instance needs to re-establish connections, negotiate TLS, and authenticate. Under load, this creates a thundering herd that can take another 5–10 minutes to stabilize. We timed the full recovery at 23 minutes. That's not a failover — that's an outage with extra steps.
Key insight: If your "failover" takes longer than your SLA window, you don't have high availability. You have a disaster recovery plan with a marketing problem.
The Architecture We Landed On
After evaluating several topologies, we settled on a three-region active-active setup. Each region independently processes transactions, with a global routing layer distributing traffic based on latency and health. Here's the high-level view:
Each region runs the full stack: API gateway, payment processing workers, fraud detection, and a local database node. There's no single primary. Every region can accept writes. The replication layer handles synchronization under the hood.
The critical design choice: we don't try to make every region handle every merchant's traffic. Each merchant is "homed" to a region based on where most of their customers are. The global load balancer routes accordingly, but any region can serve any merchant if their home region degrades.
Data Replication — The Hard Part
Let's be honest: the compute layer is the easy part of active-active. Stateless services scale horizontally. The real beast is the data layer. How do you let three regions write to the same logical database without corrupting payment records?
CockroachDB vs. PostgreSQL Logical Replication
We evaluated two paths. PostgreSQL with logical replication is battle-tested and our team knew it well. But logical replication is fundamentally async, and conflict resolution for payment data is not something you want to hand-roll. A double-charge because two regions both approved the same transaction? That's a compliance nightmare, not just a bug.
We went with CockroachDB. Its serializable isolation across regions means we get strong consistency guarantees without building our own conflict resolution. The trade-off is latency — cross-region consensus adds 80–150ms to writes that span regions. For payment authorization, that's acceptable. For real-time balance checks, it's tight.
The "Write-Home" Pattern
To minimize cross-region write latency, we use what we call the "write-home" pattern. Every transaction has a home region, determined by the merchant's primary region assignment. Writes for that transaction — authorization, capture, settlement — always go to the home region's leaseholder. Reads can happen anywhere since CockroachDB's follower reads are fast and consistent enough for our needs.
Warning: If you're considering PostgreSQL logical replication for active-active payments, you need a rock-solid conflict resolution strategy. Last-write-wins is not acceptable for financial data. Look into application-level conflict detection with compensating transactions, or just use a database that handles distributed consensus natively.
DNS and Traffic Routing
Getting traffic to the right region is more nuanced than pointing DNS at three IP addresses. We evaluated three approaches before settling on our current setup:
We use Route53 latency-based routing as the primary mechanism, with health checks hitting a deep /health/ready endpoint that verifies database connectivity, downstream processor reachability, and queue depth. If any of those fail, the region gets pulled from rotation within 10 seconds. We layer GeoDNS on top for merchants with strict data residency requirements — EU merchants always land in Frankfurt first, regardless of latency.
One thing we learned the hard way: set your health check threshold to 2 consecutive failures, not 1. A single dropped health check packet caused us to pull a perfectly healthy region out of rotation during a busy Saturday. Thousands of transactions got rerouted for no reason.
What We Got Wrong
I'd love to tell you we nailed this on the first try. We didn't. Here are the three worst production incidents from our migration to active-active:
Split-Brain During Network Partition
Three months in, a submarine cable cut between Singapore and Frankfurt caused a network partition. CockroachDB handled it correctly — the minority side stopped accepting writes. But our application layer didn't handle the resulting errors gracefully. Instead of returning a clean "service unavailable" to merchants, our API returned partial 500 errors that some merchant integrations interpreted as "transaction failed — retry." The retry storm made everything worse. Lesson: your application needs explicit partition-aware error handling, not just database-level consensus.
Clock Skew Breaking Transaction Ordering
We had a subtle bug where settlement batch processing assumed transaction timestamps were globally ordered. They weren't. Even with NTP, clock skew between regions can be 10–50ms. Two transactions hitting different regions within that window could appear out of order in settlement reports. We switched to hybrid logical clocks (which CockroachDB uses internally) for all application-level ordering. Never trust wall clocks in a distributed system.
Connection Pool Exhaustion During Failover Surge
When we intentionally drained US-East for maintenance, the traffic shift to Frankfurt and Singapore caused connection pool exhaustion on both receiving regions. We'd sized pools for 150% of normal regional traffic, but the surge hit 280% because of client-side retries stacking up. Now we pre-scale receiving regions before any planned drain, and our circuit breakers shed load more aggressively during traffic shifts.
Tip: Run game days monthly. Simulate region failures, network partitions, and traffic surges. The bugs you find in controlled chaos are infinitely cheaper than the ones that find you at 3 AM on a holiday weekend.
The Numbers
After 14 months of running active-active across three regions, here's where we landed:
The "failover time" metric is the one I'm proudest of. It's effectively zero because there's no failover event anymore. When a region degrades, traffic just flows to the other two. No DNS changes, no cold starts, no human intervention. The system absorbs it.
Was it worth it? The migration took our platform team 8 months and cost roughly 40% more in infrastructure than active-passive. But we went from 99.95% to 99.995% availability, and we eliminated the single scariest entry in our risk register. For a payment system processing billions annually, that math works out pretty clearly.
References
- Google Cloud Architecture Framework — Design for scale and high availability
- AWS Disaster Recovery Architecture — Multi-Site Active/Active
- CockroachDB Architecture Overview
- PostgreSQL Logical Replication Documentation
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.