The Argument That Never Ended
For about a year, our payment platform team was stuck in a loop. Product wanted to ship a new acquirer integration. Engineering wanted to pay down tech debt after a nasty settlement bug. The VP wanted both, yesterday. Every planning session devolved into the same philosophical debate: how reliable is reliable enough?
Nobody had a wrong answer because nobody had a framework. We were arguing about feelings. The on-call engineer who got paged at 3 AM last Tuesday had a very different risk tolerance than the PM who hadn't seen a customer complaint in weeks.
Then we adopted SLOs, and the arguments stopped. Not because everyone suddenly agreed, but because we replaced opinions with arithmetic.
Why "Five Nines" Is the Wrong Conversation
The first mistake teams make is reaching for a vanity number. "We need 99.999% uptime" sounds impressive in a slide deck, but for a payment authorization service processing 50,000 transactions per hour, that gives you roughly 26 seconds of downtime per month. That's less time than it takes to roll back a bad deploy.
More importantly, five nines conflates availability with correctness. A payment service can be "up" and still silently double-charging customers or dropping settlement files. Uptime alone is a terrible SLI for financial systems.
Watch out: Don't copy SLO targets from blog posts (including this one). Your targets must come from your actual user expectations, transaction volume, and business risk tolerance. A 99.95% target that's grounded in reality beats a 99.99% target that nobody believes in.
The real question isn't "how many nines?" — it's "what do our users actually need, and how do we measure whether we're delivering it?"
Choosing the Right SLIs for Payment Systems
An SLI (Service Level Indicator) is the metric you actually measure. Picking the wrong SLI is worse than having no SLO at all, because you'll optimize for the wrong thing. After a lot of trial and error, we settled on three categories of SLIs for our payment platform:
| SLI Type | SLO Target | How We Measure |
|---|---|---|
| Auth Success Rate Availability |
99.95% | Ratio of non-5xx auth responses to total auth requests, measured at the gateway edge. Excludes issuer declines (those aren't our fault). |
| Auth Latency (p99) Latency |
< 1200ms | 99th percentile round-trip time from API ingress to response, measured via histogram buckets in Prometheus. Slow auths cause checkout abandonment. |
| Settlement Accuracy Correctness |
99.99% | Ratio of settlement records that reconcile correctly against the ledger within 24 hours. Mismatches trigger manual review and potential financial loss. |
Tip: Measure SLIs at the boundary closest to the user. For APIs, that's the load balancer or gateway — not the application process. Internal health checks lie; they'll report healthy while the load balancer is routing traffic into a black hole.
Setting Targets That Actually Mean Something
We set our auth success SLO at 99.95% — not 99.99%. Here's why. We looked at three months of historical data and found our actual success rate hovered around 99.97%. Setting the target at 99.99% would have meant we were already in violation most weeks, which makes the SLO meaningless. Setting it at 99.9% would have been too generous — we'd never burn through the budget, so it would never influence a decision.
The sweet spot is a target that you meet comfortably most of the time, but that a bad deploy or a degraded dependency can realistically threaten. That tension is what makes error budgets useful.
Error Budgets: The Math and the Politics
The error budget is simply the inverse of your SLO. If your SLO is 99.95% over a 30-day window, your error budget is 0.05% of total requests. On a service handling 36 million auth requests per month, that's 18,000 allowed failures — or roughly 21.9 minutes of complete downtime.
The math is straightforward. The politics are not. The first time we told product management "we've burned 70% of our error budget this month, so we're freezing feature deploys for the remaining 9 days," it did not go well. That conversation only works if you've agreed on the policy before the crisis hits.
The Error Budget Policy Document
We wrote a one-page error budget policy and got sign-off from engineering leadership, product, and the VP. It covers four scenarios:
- Budget > 50% remaining: Ship normally. Feature work proceeds. Deploy at will.
- Budget 20–50% remaining: Increased caution. All deploys require a second reviewer. No risky migrations.
- Budget < 20% remaining: Feature freeze. Engineering focuses exclusively on reliability work — fixing flaky tests, improving rollback speed, addressing the incidents that burned the budget.
- Budget exhausted: Full stop. No changes to production except emergency patches. Post-incident review required before resuming normal operations.
The key insight: this policy doesn't punish anyone. It's a mechanical response. When the budget recovers at the start of the next window, you go right back to shipping. That framing made it much easier for product to accept.
Burn Rate Alerts (Why Threshold Alerts Fail)
Before SLOs, we had a simple alert: "if auth success rate drops below 99.9% for 5 minutes, page someone." The problem? A brief blip at 99.85% would page us even though it barely dented the monthly budget. Meanwhile, a slow degradation to 99.93% over three days would never fire the alert — but it would silently eat the entire budget.
Burn rate alerts fix this. Instead of alerting on the raw error rate, you alert on how fast you're consuming your error budget. A burn rate of 1x means you'll exactly exhaust your budget by the end of the window. A burn rate of 6x means you'll burn through it in 5 days.
We use a multi-window approach: a 6x burn rate over a 1-hour window triggers a page (you're hemorrhaging budget), and a 3x burn rate over a 6-hour window triggers a ticket (slow bleed that needs attention before end of week).
Prometheus Recording Rules and Alerts
Here's a simplified version of what our Prometheus config looks like for the auth success SLO:
# Recording rule: auth success ratio over sliding windows
groups:
- name: slo_auth_success
interval: 30s
rules:
# Total auth requests in the last 30 days
- record: slo:auth_requests:total_30d
expr: sum(increase(payment_auth_total[30d]))
# Failed auth requests (5xx only) in the last 30 days
- record: slo:auth_errors:total_30d
expr: sum(increase(payment_auth_total{status=~"5.."}[30d]))
# Current error ratio (consumed budget fraction)
- record: slo:auth_error_ratio:30d
expr: slo:auth_errors:total_30d / slo:auth_requests:total_30d
# Burn rate: 1-hour window
- record: slo:auth_burn_rate:1h
expr: |
(
sum(rate(payment_auth_total{status=~"5.."}[1h]))
/ sum(rate(payment_auth_total[1h]))
) / 0.0005 # 0.0005 = 1 - 0.9995 (the error budget fraction)
- name: slo_auth_alerts
rules:
# Page: burning budget 6x faster than sustainable
- alert: AuthSLOBurnRateCritical
expr: slo:auth_burn_rate:1h > 6
for: 5m
labels:
severity: page
annotations:
summary: "Auth error budget burning at {{ $value }}x rate"
description: "At this rate, the 30-day error budget will be exhausted in {{ printf \"%.1f\" (30 / $value) }} days."
# Ticket: slow burn that needs investigation
- alert: AuthSLOBurnRateWarning
expr: slo:auth_burn_rate:6h > 3
for: 15m
labels:
severity: ticket
annotations:
summary: "Auth error budget slow burn at {{ $value }}x rate"
Tip: Use sloth or pyrra to generate these recording rules from a declarative SLO spec. Hand-writing PromQL for every SLO doesn't scale, and it's easy to get the math wrong.
What Happens When You Burn Through Your Budget
It happened to us in month two. A card network changed their timeout behavior without notice, and our retry logic amplified the problem. We burned 80% of our auth error budget in 48 hours.
Under the old regime, this would have been a heated post-mortem followed by finger-pointing. Under the error budget policy, the response was mechanical: we entered the feature freeze, the team focused on fixing the retry logic and adding circuit breakers for that acquirer, and we resumed shipping 11 days later when the window rolled forward.
The surprising part? Product was fine with it. They could see the burn rate dashboard. They understood the math. And they knew the freeze was temporary and rule-based — not some engineer's opinion about what was "safe enough."
The Cultural Shift
Six months in, the biggest change wasn't technical. It was that reliability stopped being a separate concern from feature work. When the error budget is healthy, the SRE team actively encourages shipping. When it's tight, product helps prioritize reliability fixes because they understand the policy protects their users too.
We stopped having the velocity-vs-reliability argument entirely. The error budget answers the question for us, every single day, in a way that everyone can see on a shared dashboard. That's the real value of SLOs — not the alerting, not the math, but the shared language.
References
- Google SRE Workbook — Implementing SLOs
- Google SRE Book — Service Level Objectives
- OpenSLO Specification
- Prometheus — Alerting Best Practices
- Sloth — SLO Generation for Prometheus
- Pyrra — SLO Monitoring with Prometheus
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Specific metrics and thresholds mentioned are illustrative — your values will depend on your transaction volume, risk tolerance, and business requirements. Always verify with official documentation.