April 11, 2026 5 min read

Disaster Recovery Planning for Payment Systems — How We Built a Failover That Actually Works

We learned the hard way that a DR plan sitting in Confluence isn't a DR plan at all. Here's how we rebuilt our failover architecture from scratch after a database promotion took down our payment pipeline for 47 minutes.

The Incident That Changed Everything

It was a Tuesday afternoon — peak transaction volume — when our primary PostgreSQL instance in us-east-1 started throwing connection errors. The monitoring lit up, PagerDuty fired, and we kicked off the failover runbook we'd written eight months earlier. That runbook had never been tested against production traffic.

The replica promotion itself went fine. Took about 90 seconds. But then things fell apart. Our application connection strings were hardcoded in three different config files across two services. The DNS TTL was set to 300 seconds, so even after we updated the CNAME, half our fleet was still hammering the dead primary. One service had a connection pool that cached the resolved IP at startup and never re-resolved. We lost 47 minutes of payment processing. For a system handling a few thousand transactions per hour, that was painful — both financially and in terms of merchant trust.

That incident became our catalyst. We didn't just patch the gaps — we rebuilt the entire DR strategy from the ground up.

Warning: If your DR runbook hasn't been executed against real traffic in the last 90 days, treat it as untested. Configuration drift, new services, and infrastructure changes will silently invalidate your assumptions.

RTO vs RPO — What They Actually Mean for Payments

These two acronyms get thrown around a lot, but in payment systems they carry specific weight. RTO (Recovery Time Objective) is how long you can be down before merchants start losing revenue and your SLA penalties kick in. RPO (Recovery Point Objective) is how much data you can afford to lose — and in payments, the answer is almost always "zero."

We settled on an RTO of 5 minutes and an RPO of 0 (synchronous replication for the transaction ledger, asynchronous for everything else). That RPO target drove most of our architecture decisions. You can't have zero data loss with async replication alone — we had to accept the latency cost of synchronous commits on the critical path.

Strategy RTO RPO Cost Complexity
Active-Active ~0 (near-instant) 0 Very High Very High
Active-Passive Minutes 0 – seconds High Moderate
Pilot Light 10 – 30 min Minutes Medium Moderate
Backup & Restore Hours Hours Low Low

We went with Active-Passive. Active-Active sounds great on paper, but the conflict resolution complexity for financial transactions is brutal. When two regions both accept a payment for the same order at the same time, someone has to lose — and in payments, "someone loses" means money is wrong. Active-Passive gave us the RTO/RPO we needed without the distributed consensus headaches.

Our DR Architecture

The setup is multi-region PostgreSQL with streaming replication. Primary runs in us-east-1, standby in us-west-2. The transaction ledger uses synchronous replication (synchronous_commit = remote_apply) so we guarantee zero data loss on the critical path. Reporting tables and audit logs replicate asynchronously to keep write latency reasonable.

On the application side, every service resolves the database endpoint through a Route 53 health-checked CNAME with a 30-second TTL. We also switched our connection pooler (PgBouncer) to re-resolve DNS on every new connection rather than caching the IP. That single change would have saved us 20 minutes during the original incident.

Tip: Set your database DNS TTL to 30 seconds or less. The marginal DNS query cost is negligible compared to the minutes you'll save during a failover. Also configure your connection pooler to respect TTL — most don't by default.

The Failover Runbook

After the incident, we formalized a six-step failover sequence. Every step has an owner, a timeout, and a rollback procedure. Here's the flow:

1
Detect
Failure
2
Verify
Outage
3
Switch
DNS
4
Promote
Replica
5
Validate
Data
6
Resume
Traffic
  1. Detect Failure — Automated health checks (Route 53 + custom probes) flag the primary as unhealthy. We require 3 consecutive failures over 30 seconds to avoid false positives.
  2. Verify Outage — On-call engineer confirms the outage isn't a monitoring blip. This is a human gate — we don't auto-failover for payments because a split-brain scenario is worse than a few minutes of downtime.
  3. Switch DNS — Update the Route 53 CNAME to point to the standby region. With a 30-second TTL, propagation completes within a minute.
  4. Promote Replica — Run pg_promote() on the standby. With streaming replication already caught up, this takes under 10 seconds.
  5. Validate Data — Run a quick integrity check: compare the last known transaction ID from the primary's WAL position against the promoted replica. Confirm no gaps.
  6. Resume Traffic — Enable the health check endpoint on the new primary. Load balancers start routing payment traffic within seconds.

End to end, we've gotten this down to under 4 minutes in our last three game day runs.

Testing DR — Quarterly Game Days

The single most valuable thing we did was commit to quarterly game day exercises. Every three months, we simulate a regional failure during business hours (low-traffic window, but real production traffic). The on-call engineer runs the full runbook while the rest of the team observes.

The first game day was humbling. We discovered that a new microservice deployed two months prior had its database connection string hardcoded in a Kubernetes ConfigMap that nobody updated. The runbook didn't mention it because the service didn't exist when the runbook was written. That's the whole point — game days surface the drift that documentation can't keep up with.

Key Takeaway: Your DR plan degrades every single day you don't test it. New services get deployed, configs change, team members rotate. Quarterly game days are the minimum frequency to keep your runbook honest.

Common Mistakes

After going through this process and talking with other teams running payment infrastructure, the same mistakes keep coming up:

Cost vs Risk Tradeoff

Running a hot standby in a second region isn't cheap. For us, the DR infrastructure adds roughly 40% to our database hosting costs. That's a real number. But here's how we justified it: we calculated the cost of a 1-hour outage — lost transaction fees, SLA penalty payouts, merchant churn risk, and the engineering hours for incident response and post-mortem. One significant outage per year would cost more than the annual DR infrastructure spend.

The math won't be the same for every team. If you're processing a handful of transactions per day, Backup & Restore with a 1-hour RTO might be perfectly fine. But once you're handling real volume with contractual SLAs, Active-Passive with synchronous replication pays for itself. The key is being honest about your actual risk tolerance rather than defaulting to the cheapest option and hoping for the best.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.