I've been on the wrong end of a 2 AM PagerDuty alert more times than I'd like to admit. The first few times, it was chaos — scrambling through Slack threads, SSHing into the wrong box, and spending 20 minutes just figuring out what was actually broken. After a particularly ugly incident where a settlement batch failed silently for six hours, I decided we needed real playbooks. Not the kind that live in a Confluence page nobody reads, but ones that people actually reach for when things go sideways.
Payment incidents are a different beast from your typical web app outage. A 500 error on a blog post is annoying. A 500 error on a payment capture means real money is in a weird state somewhere between your system and the processor. The stakes change everything about how you respond.
Severity Classification: Not All Fires Are Equal
The single most impactful thing we did was get specific about severity levels. Generic "SEV1 means critical" definitions don't cut it for payments. You need to tie severity directly to financial impact and customer harm.
All payments failing or money moving incorrectly. Settlement halted. Immediate financial loss. All hands on deck — wake everyone up.
Partial payment failures (>5% error rate), single processor down, or delayed settlements. On-call + team lead engaged within 15 min.
Degraded performance, elevated latency on payment calls, or non-critical webhook delays. On-call investigates during business hours.
Cosmetic issues, minor logging gaps, or dashboard display bugs. Ticket created, fixed in next sprint. No pages, no drama.
The key insight: tie your severity to transaction volume affected, not just "is the service up." We had an incident where the API was returning 200s but silently dropping decimal precision on amounts. Technically "up," but charging people $1 instead of $1.50. That's a SEV1 in my book, even though every health check was green.
The Response Flow: What Actually Happens When the Pager Goes Off
After iterating through a dozen incidents, we landed on a six-phase response flow. It sounds formal on paper, but in practice it's just a checklist that keeps you from skipping steps when your brain is running on adrenaline and bad coffee.
The part most teams get wrong is the Communicate step. During a payment incident, you're not just updating your engineering Slack channel. You've got compliance teams who need to know if cardholder data might be affected, finance teams tracking settlement exposure, and sometimes regulators with specific notification windows. We built a communication matrix that auto-triggers based on severity — SEV1 pages the CTO and compliance officer, SEV2 notifies the engineering lead and finance, and so on.
Mitigation Over Root Cause
This is the hardest lesson to internalize: during an active incident, your only job is to stop the bleeding. Don't debug. Don't try to understand why. If transactions are failing through Processor A, flip traffic to Processor B. If a bad deploy is causing amount mismatches, roll back first and ask questions later. I've watched engineers burn 45 minutes trying to find the root cause of a payment failure while transactions kept piling up in a failed state. Those 45 minutes cost real money.
We keep a list of "big red buttons" — pre-approved mitigation actions that any on-call engineer can take without asking permission:
- Kill switch to disable a specific payment processor and route to fallback
- Feature flag to pause non-critical payment operations (refunds, payouts) to reduce load
- Rollback to last known good deployment (one command, no approval needed)
- Circuit breaker override to force-open or force-close processor connections
Metrics That Actually Matter
You can't improve what you don't measure, but you also can't measure everything without drowning in dashboards. After trying dozens of metrics, these four are the ones that actually drive better incident response for payment systems:
We track these weekly and review trends monthly. The most telling metric is recurrence rate — it's the one that tells you whether your post-mortems are producing action items that actually get done, or just generating documents that collect dust.
War story: We once had a settlement batch job that ran every night at 2 AM. One Tuesday, it silently failed because a database connection pool was exhausted — but the job's error handling swallowed the exception and marked the batch as "complete." We didn't catch it until Wednesday afternoon when a merchant called asking where their money was. That's 36 hours of undetected failure. The fix wasn't complicated (better health checks, actual validation that settlement files were non-empty), but the incident led us to build synthetic transaction monitoring — a canary payment that runs every 10 minutes through the full pipeline. If the canary dies, we know before any real merchant does. That single change dropped our MTTD from "whenever someone notices" to under 90 seconds.
On-Call That Doesn't Burn People Out
Payment systems need 24/7 coverage. That's non-negotiable. But burning through your team with brutal on-call rotations is a fast track to attrition. Here's what's worked for us:
- Two-tier rotation: Primary on-call handles the page and initial triage. Secondary is a senior engineer who only gets pulled in for SEV1/SEV2. This means most nights, the secondary sleeps through.
- One week on, three weeks off as a minimum ratio. If your team is too small for this, that's a staffing problem, not a process problem.
- Runbooks for every alert. If an alert fires and there's no runbook, that's a bug. Every PagerDuty alert links directly to a runbook with step-by-step instructions. New on-call engineers should be able to handle 80% of pages just by following the runbook.
- Compensatory time off. If someone gets paged at 3 AM and spends two hours on an incident, they start late the next day. No questions asked, no approval needed.
Post-Mortems: The Part Everyone Skips
I know, I know. Nobody likes writing post-mortems. But for payment systems, they're not optional — they're often a compliance requirement. More importantly, they're the mechanism that turns a bad night into a better system.
Our post-mortem template is deliberately short. If it takes more than an hour to write, it's too long and nobody will read it. We cover five things:
- Timeline — what happened, when, with timestamps from monitoring tools (not memory)
- Impact — number of failed transactions, total dollar amount affected, number of merchants/customers impacted
- Root cause — the actual technical cause, not "human error" (that's never a root cause)
- What went well — what parts of the response process worked. This is important for morale and for knowing what to keep doing
- Action items — specific, assigned, with due dates. "Improve monitoring" is not an action item. "Add alert for settlement batch file size dropping below 1KB, assigned to Sarah, due Friday" is
The most important rule: post-mortems are blameless. The moment someone gets blamed for an incident, people start hiding mistakes instead of reporting them. And in payment systems, hidden mistakes compound fast.
Communication Templates Save Lives (Figuratively)
When you're in the middle of a SEV1 at 3 AM, you don't want to be wordsmithing a status update. We pre-wrote templates for every severity level and every audience — internal engineering, customer support, merchant-facing status page, and compliance/legal. The on-call engineer fills in the blanks (what's broken, what's the impact, what's the ETA) and hits send. It takes 30 seconds instead of 15 agonizing minutes of trying to sound professional while your hands are shaking from adrenaline.
Start Small, Iterate Fast
You don't need to build all of this at once. If you're starting from zero, here's the order I'd recommend:
- Define your severity levels with payment-specific criteria
- Set up a single on-call rotation with PagerDuty or Opsgenie
- Write runbooks for your top 5 most common alerts
- Create one communication template for SEV1 incidents
- Run your first blameless post-mortem after the next incident
You can get through that list in a week. Then iterate. Every incident teaches you something, and every post-mortem action item makes the next incident a little less painful. After a year of this, you'll look back and wonder how you ever operated without it.
References
- PagerDuty Incident Response Documentation
- Google SRE Book — Managing Incidents
- Atlassian Incident Management Guide
- AWS Well-Architected Framework — Reliability Pillar
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.