Disaster Recovery Planning for Payment Systems

The Incident That Changed Everything

It was a Tuesday afternoon — peak transaction volume — when our primary PostgreSQL instance in us-east-1 started throwing connection errors. The monitoring lit up, PagerDuty fired, and we kicked off the failover runbook we'd written eight months earlier. That runbook had never been tested against production traffic.

The replica promotion itself went fine. Took about 90 seconds. But then things fell apart. Our application connection strings were hardcoded in three different config files across two services. The DNS TTL was set to 300 seconds, so even after we updated the CNAME, half our fleet was still hammering the dead primary. One service had a connection pool that cached the resolved IP at startup and never re-resolved. We lost 47 minutes of payment processing. For a system handling a few thousand transactions per hour, that was painful — both financially and in terms of merchant trust.

That incident became our catalyst. We didn't just patch the gaps — we rebuilt the entire DR strategy from the ground up.

Warning: If your DR runbook hasn't been executed against real traffic in the last 90 days, treat it as untested. Configuration drift, new services, and infrastructure changes will silently invalidate your assumptions.

RTO vs RPO — What They Actually Mean for Payments

These two acronyms get thrown around a lot, but in payment systems they carry specific weight. RTO (Recovery Time Objective) is how long you can be down before merchants start losing revenue and your SLA penalties kick in. RPO (Recovery Point Objective) is how much data you can afford to lose — and in payments, the answer is almost always "zero."

We settled on an RTO of 5 minutes and an RPO of 0 (synchronous replication for the transaction ledger, asynchronous for everything else). That RPO target drove most of our architecture decisions. You can't have zero data loss with async replication alone — we had to accept the latency cost of synchronous commits on the critical path.

Strategy	RTO	RPO	Cost	Complexity
Active-Active	~0 (near-instant)	0	Very High	Very High
Active-Passive	Minutes	0 – seconds	High	Moderate
Pilot Light	10 – 30 min	Minutes	Medium	Moderate
Backup & Restore	Hours	Hours	Low	Low

We went with Active-Passive. Active-Active sounds great on paper, but the conflict resolution complexity for financial transactions is brutal. When two regions both accept a payment for the same order at the same time, someone has to lose — and in payments, "someone loses" means money is wrong. Active-Passive gave us the RTO/RPO we needed without the distributed consensus headaches.

Our DR Architecture

The setup is multi-region PostgreSQL with streaming replication. Primary runs in us-east-1, standby in us-west-2. The transaction ledger uses synchronous replication (synchronous_commit = remote_apply) so we guarantee zero data loss on the critical path. Reporting tables and audit logs replicate asynchronously to keep write latency reasonable.

On the application side, every service resolves the database endpoint through a Route 53 health-checked CNAME with a 30-second TTL. We also switched our connection pooler (PgBouncer) to re-resolve DNS on every new connection rather than caching the IP. That single change would have saved us 20 minutes during the original incident.

Tip: Set your database DNS TTL to 30 seconds or less. The marginal DNS query cost is negligible compared to the minutes you'll save during a failover. Also configure your connection pooler to respect TTL — most don't by default.

The Failover Runbook

After the incident, we formalized a six-step failover sequence. Every step has an owner, a timeout, and a rollback procedure. Here's the flow:

Detect
Failure

Verify
Outage

Switch
DNS

Promote
Replica

Validate
Data

Resume
Traffic

Detect Failure — Automated health checks (Route 53 + custom probes) flag the primary as unhealthy. We require 3 consecutive failures over 30 seconds to avoid false positives.
Verify Outage — On-call engineer confirms the outage isn't a monitoring blip. This is a human gate — we don't auto-failover for payments because a split-brain scenario is worse than a few minutes of downtime.
Switch DNS — Update the Route 53 CNAME to point to the standby region. With a 30-second TTL, propagation completes within a minute.
Promote Replica — Run pg_promote() on the standby. With streaming replication already caught up, this takes under 10 seconds.
Validate Data — Run a quick integrity check: compare the last known transaction ID from the primary's WAL position against the promoted replica. Confirm no gaps.
Resume Traffic — Enable the health check endpoint on the new primary. Load balancers start routing payment traffic within seconds.

End to end, we've gotten this down to under 4 minutes in our last three game day runs.

Testing DR — Quarterly Game Days

The single most valuable thing we did was commit to quarterly game day exercises. Every three months, we simulate a regional failure during business hours (low-traffic window, but real production traffic). The on-call engineer runs the full runbook while the rest of the team observes.

The first game day was humbling. We discovered that a new microservice deployed two months prior had its database connection string hardcoded in a Kubernetes ConfigMap that nobody updated. The runbook didn't mention it because the service didn't exist when the runbook was written. That's the whole point — game days surface the drift that documentation can't keep up with.

Key Takeaway: Your DR plan degrades every single day you don't test it. New services get deployed, configs change, team members rotate. Quarterly game days are the minimum frequency to keep your runbook honest.

Common Mistakes

After going through this process and talking with other teams running payment infrastructure, the same mistakes keep coming up:

Never testing the failover. This is the big one. A runbook that hasn't been executed is fiction. We've seen teams with beautifully documented DR plans that fail on step two because an IAM role was rotated six months ago.
DNS TTL set too high. The default TTL on many DNS providers is 300 seconds (5 minutes). During a failover, that's 5 minutes of traffic hitting a dead endpoint. Drop it to 30 seconds for any database or service endpoint involved in DR.
Stale runbooks. If your runbook references a service name that was renamed, or a CLI tool version that changed its flags, the engineer executing it at 2 AM will lose precious minutes figuring out what went wrong.
No human gate for payments. Fully automated failover sounds appealing, but for financial systems, a split-brain scenario where both regions accept writes is catastrophic. Keep a human verification step.
Ignoring the failback. Everyone plans the failover. Few plan the failback. Getting back to your primary region after the incident is resolved is its own complex operation — and it needs its own runbook.

Cost vs Risk Tradeoff

Running a hot standby in a second region isn't cheap. For us, the DR infrastructure adds roughly 40% to our database hosting costs. That's a real number. But here's how we justified it: we calculated the cost of a 1-hour outage — lost transaction fees, SLA penalty payouts, merchant churn risk, and the engineering hours for incident response and post-mortem. One significant outage per year would cost more than the annual DR infrastructure spend.

The math won't be the same for every team. If you're processing a handful of transactions per day, Backup & Restore with a 1-hour RTO might be perfectly fine. But once you're handling real volume with contractual SLAs, Active-Passive with synchronous replication pays for itself. The key is being honest about your actual risk tolerance rather than defaulting to the cheapest option and hoping for the best.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

The Incident That Changed Everything

RTO vs RPO — What They Actually Mean for Payments

Our DR Architecture

The Failover Runbook

Testing DR — Quarterly Game Days

Common Mistakes

Cost vs Risk Tradeoff

References

Related Articles