Monitoring and Observability for Payment Systems — What to Track and Why It Matters

Let me tell you about the worst Monday morning of my career. I walked into the office, coffee in hand, feeling pretty good about the weekend deploy. Then our head of finance pinged me: "Hey, settlement files from Friday night look empty. Did something change?" Turns out, our payment processor had started returning a new error code we weren't handling, and our system was silently swallowing failures for 60 hours. Our monitoring? All green. Every single dashboard was happy. That's the day I learned the difference between monitoring and observability — the hard way.

Monitoring vs. Observability: They're Not the Same Thing

I used to use these words interchangeably. Most engineers do. But after running payment infrastructure for a few years, the distinction became painfully clear to me.

Monitoring answers known questions: Is the service up? Is CPU below 80%? Are we getting 5xx errors? You define the questions in advance, set thresholds, and get alerted when something crosses a line. It's reactive by nature — you're watching for problems you've already imagined.

Observability lets you ask questions you didn't know you'd need to ask. When a merchant reports that their refunds are taking 48 hours instead of the usual 4, you need to slice and dice your telemetry data — by merchant ID, by processor, by refund amount range, by time of day — to figure out what's different. You can't pre-build a dashboard for every possible failure mode in a payment system. There are too many moving parts.

The practical difference: Monitoring tells you that something is broken. Observability helps you figure out why it's broken, even when the failure mode is something you've never seen before. For payment systems, you need both — monitoring for the known failure modes, observability for the inevitable surprises.

The Four Golden Signals, Payment Edition

Google's SRE book introduced the four golden signals — latency, traffic, errors, and saturation. They're a solid framework, but for payment systems, each one needs a payment-specific lens. Here's how I think about them:

Latency

p99 < 800ms

End-to-end auth call time. Track per-processor — a slow Stripe call vs. a slow Adyen call tells very different stories.

Traffic

TPS by type

Transactions per second, split by auth, capture, refund, and payout. A sudden TPS drop is often worse than a spike.

Errors

< 0.1% hard

Distinguish soft declines (insufficient funds) from hard errors (system failures). Only hard errors are your problem.

Saturation

< 70% cap

Connection pools, queue depth, rate limit headroom. Hit your processor's rate limit and you're dropping real revenue.

The traffic signal is the one that catches people off guard. Everyone alerts on spikes, but a drop in payment volume at 2 PM on a Tuesday is often a bigger red flag than a spike. It might mean your checkout page is broken, your payment form JS failed to load, or a processor is rejecting everything silently. We set up anomaly detection on expected transaction volume by hour-of-day and day-of-week, and it's caught issues that threshold-based alerts completely missed.

The Metrics That Actually Matter

After years of iterating on dashboards, I've narrowed it down to the metrics I actually look at every morning. Everything else is noise until an incident makes it relevant.

Authorization Rate

This is the single most important metric in payment operations. Your auth rate is the percentage of payment attempts that get approved by the issuing bank. A healthy auth rate for card-not-present transactions sits around 85-92%, depending on your merchant category and geography. When this drops even 2 percentage points, something is wrong — maybe you're sending bad data to the processor, maybe a BIN range is having issues, or maybe your retry logic is hammering a processor that's already struggling.

We track auth rate per processor, per card network, per BIN country, and per merchant. The per-merchant view is critical because a single merchant sending garbage data can tank your aggregate numbers and mask the real picture.

Settlement Lag

The time between capture and actual settlement into the merchant's account. This is where money "goes missing" in the eyes of your merchants, even though it's just in transit. We track the delta between expected settlement time (based on processor SLAs) and actual settlement confirmation. When the lag exceeds the SLA by more than 2 hours, an alert fires. I can't tell you how many times this has caught a stuck batch file or a failed SFTP transfer before the merchant noticed.

Webhook Delivery Rate

If you're receiving webhooks from processors (and you should be), track delivery success rate and processing latency. A webhook backlog means your system's view of transaction states is stale. We had an incident where our webhook endpoint was returning 200s but a downstream queue was full, so events were being acknowledged but never processed. Transactions showed as "pending" for hours. Now we track end-to-end: webhook received, parsed, processed, and state updated — each step independently.

Warning: Don't rely solely on webhooks for transaction state. Always implement a polling reconciliation job that runs every 15-30 minutes to catch anything webhooks missed. Webhooks are best-effort delivery — treat them that way. I've seen teams build their entire state machine around webhook events and then wonder why transactions get stuck when a webhook silently fails.

The Observability Stack: What Goes Where

There's no single tool that does everything well. After trying various combinations, here's the stack that's worked best for us in production payment infrastructure:

Data Sources

Payment API

Processor SDKs

Settlement Jobs

Webhook Workers

Collection & Instrumentation

OpenTelemetry SDK

Prometheus Client

Filebeat / Fluentd

Storage & Query

Prometheus / Thanos

Elasticsearch

Jaeger / Tempo

Visualization & Alerting

Grafana Dashboards

Kibana

Alertmanager → PagerDuty

Prometheus + Grafana handles metrics — auth rates, latency percentiles, TPS, queue depths. We use Thanos on top for long-term retention because you absolutely need to compare this month's settlement patterns against last month's. Payment data is seasonal, and without historical context, you'll chase false positives every holiday weekend.

ELK (Elasticsearch, Logstash, Kibana) handles structured logs. Every payment event gets a structured log entry with transaction ID, processor, amount, currency, response code, and timing. When something goes wrong, I can query Kibana for all transactions from a specific merchant in the last hour that got response code 05 from Visa. That kind of ad-hoc investigation is where observability earns its keep.

Jaeger or Grafana Tempo for distributed tracing. A single payment can touch your API gateway, fraud service, tokenization vault, processor adapter, and webhook handler. Without traces, debugging latency issues across that chain is pure guesswork. We tag every trace with the transaction ID so we can jump from a log entry straight to the full trace.

Tip: Use OpenTelemetry as your instrumentation layer. It gives you metrics, logs, and traces through a single SDK, and you can swap backends without changing application code. We migrated from Jaeger to Tempo last year and didn't touch a single line of application code — just reconfigured the OTel collector.

Alerting Without the Fatigue

Alert fatigue is real, and it's dangerous. When your on-call engineer gets 50 alerts a day, they stop reading them. Then the one alert that actually matters — the one about auth rates dropping to 40% — gets lost in the noise. I've seen this happen, and it cost us about four hours of degraded payment processing.

Here's the alerting philosophy that finally worked for us:

Page-worthy alerts only go to PagerDuty. If it doesn't require immediate human action, it's not a page. Auth rate below 70%? Page. Single webhook retry failing? Ticket.
Use burn-rate alerting for SLOs. Instead of "error rate > 1%," we alert on "at the current error rate, we'll burn through our monthly error budget in 6 hours." This catches slow-burn degradation that fixed thresholds miss entirely.
Aggregate before alerting. Don't alert on individual transaction failures. Alert on failure rates over a 5-minute window. A single timeout to Stripe is noise. Ten timeouts in 5 minutes is a signal.
Separate business hours from off-hours. A 2% auth rate drop at 3 PM gets a Slack notification. The same drop at 3 AM gets a page. Context matters.

We also run a weekly "alert review" where we look at every alert that fired in the past 7 days and ask: did this require action? If an alert fires repeatedly without needing human intervention, it gets tuned or deleted. Our alert-to-action ratio went from about 15% to over 70% after three months of this practice.

War story: Early on, we had an alert for "processor response time > 500ms." It fired constantly during peak hours because, well, sometimes processors are slow for a few seconds. The on-call engineer muted it after the third night of false alarms. Two weeks later, a processor had a genuine degradation that crept from 600ms to 3 seconds over an hour. Nobody noticed because the alert was muted. We replaced it with a burn-rate alert on our latency SLO, and it's been reliable ever since — it only fires when the slowness is sustained enough to actually matter.

What I'd Build First

If you're starting from scratch or inheriting a payment system with minimal observability, here's the order I'd tackle things:

Instrument your payment API with OpenTelemetry — get basic request metrics and traces flowing into Prometheus and Jaeger/Tempo
Build a single Grafana dashboard with auth rate, TPS, p99 latency, and error rate — broken down by processor
Set up three alerts: auth rate drop, error rate spike, and settlement lag exceeding SLA
Add structured logging for every payment state transition with transaction IDs that correlate across services
Implement a synthetic canary transaction that runs through your full payment pipeline every 10 minutes

You can get steps 1-3 done in a week. Steps 4-5 take another week or two. After that, you'll have better visibility into your payment system than 90% of the teams I've worked with. The rest — custom dashboards, anomaly detection, SLO burn-rate alerts — you build iteratively as incidents teach you what you're missing.

The goal isn't perfect observability. It's making sure that when something goes wrong with money, you find out before your merchants do.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. The metrics, thresholds, and architectural patterns described here are based on specific production environments and may not directly apply to your use case — always validate against your own requirements and consult official documentation.

Monitoring vs. Observability: They're Not the Same Thing

The Four Golden Signals, Payment Edition

The Metrics That Actually Matter

Authorization Rate

Settlement Lag

Webhook Delivery Rate

The Observability Stack: What Goes Where

Alerting Without the Fatigue

What I'd Build First

References

Related Articles