Every payment service I've worked on started with the same health check: a /health endpoint that returns 200 OK if the process is running. It works fine in staging. Then production happens.
The pod is "healthy" but the database connection pool is exhausted. The pod is "healthy" but the fraud rule engine hasn't finished loading. The pod is "healthy" but it's mid-shutdown and still receiving traffic from the load balancer. Each of these scenarios has caused real incidents on teams I've been part of — and every one was preventable with proper probe design.
Understanding the Probe Lifecycle
Kubernetes gives you three distinct probes, and they fire in a specific order. Getting this sequence wrong is where most payment service issues start.
The startup probe runs first and blocks the other two until it succeeds. Once it passes, Kubernetes begins running readiness and liveness probes concurrently. This matters enormously for payment services because our cold start times are not trivial.
| Probe | Purpose | On Failure | When to Use |
|---|---|---|---|
| Startup | Gate for slow-starting containers | Pod is killed and restarted | Loading fraud rules, warming caches, running migrations |
| Readiness | Controls whether pod receives traffic | Removed from Service endpoints (no traffic) | Downstream dependency checks, connection pool health |
| Liveness | Detects deadlocked or hung processes | Pod is killed and restarted | Deadlock detection, unrecoverable states only |
The critical distinction: readiness failure removes traffic gracefully, liveness failure kills the pod. For payment services, killing a pod that has in-flight transactions is the worst possible outcome.
Warning: Never check downstream dependencies in your liveness probe. If your database goes down and your liveness probe fails, Kubernetes will restart all your pods simultaneously — turning a database blip into a full service outage with dozens of interrupted transactions.
Startup Probes: Handling the Cold Start Problem
Our payment gateway takes roughly 47 seconds to fully initialize. It loads fraud detection rules from S3, warms the BIN lookup cache, establishes connection pools to three payment processors, and validates TLS certificates against the HSM. Without a startup probe, Kubernetes would kill the pod before it ever had a chance to serve traffic.
(fraud rules + cache warming)
proper probe configuration
rolling deployments
Here's the YAML configuration we settled on after several rounds of tuning:
startupProbe:
httpGet:
path: /healthz/startup
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 12 # 5 + (5 * 12) = 65s max startup time
successThreshold: 1
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
periodSeconds: 10
failureThreshold: 3
successThreshold: 1
The failureThreshold: 12 with periodSeconds: 5 gives us a 65-second window for startup. That covers our 47-second average with headroom for slow days.
Readiness Probes: The Dependency Question
This is where it gets nuanced. Your readiness probe should answer: "Can this pod meaningfully process a payment right now?" That means checking the things that would cause a transaction to fail.
func (s *Server) readinessHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
defer cancel()
// Check database connectivity
if err := s.db.PingContext(ctx); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "not_ready",
"reason": "database_unreachable",
})
return
}
// Check primary payment processor connection
if !s.paymentClient.IsCircuitClosed() {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "not_ready",
"reason": "payment_processor_circuit_open",
})
return
}
// Check fraud engine availability
if !s.fraudEngine.IsLoaded() {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]string{
"status": "not_ready",
"reason": "fraud_rules_not_loaded",
})
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "ready"})
}
A few things to note. The 2-second timeout on the context is deliberate — if your readiness check takes longer than your probe's timeoutSeconds, Kubernetes treats it as a failure. Keep the check fast. Also, we're checking the circuit breaker state for the payment processor rather than making a live call. Pinging Stripe or Adyen on every readiness check is wasteful and can trigger their rate limits.
Keep Liveness Simple
The liveness probe should be almost trivially simple. Its job is to catch truly unrecoverable states — goroutine leaks, deadlocks, corrupted internal state. Not transient dependency failures.
func (s *Server) livenessHandler(w http.ResponseWriter, r *http.Request) {
// Only check if the process itself is functioning
// Do NOT check external dependencies here
if s.isShuttingDown.Load() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
Graceful Shutdown: Draining In-Flight Transactions
This is the piece that took us the longest to get right. When Kubernetes sends SIGTERM, there's a race condition: the pod starts shutting down, but the Service endpoints haven't been updated yet. For a brief window, traffic is still being routed to a pod that's trying to die.
The preStop hook buys you time. We use it to stop accepting new transactions while draining the ones already in progress:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "sleep 5 && /app/drain --timeout=25s"
The 5-second sleep is intentional. It gives kube-proxy and ingress controllers time to remove the pod from their routing tables. Without it, you'll see a burst of connection resets right after deployment starts.
On the application side, the drain command flips the readiness probe to unhealthy and waits for in-flight requests to complete:
func (s *Server) gracefulShutdown(timeout time.Duration) {
// Signal readiness probe to return unhealthy
s.isShuttingDown.Store(true)
// Wait for in-flight transactions to complete
done := make(chan struct{})
go func() {
s.wg.Wait()
close(done)
}()
select {
case <-done:
log.Info("all in-flight transactions completed")
case <-time.After(timeout):
log.Warn("shutdown timeout reached, forcing exit",
"pending_transactions", s.wg.Count())
}
}
Make sure your terminationGracePeriodSeconds in the pod spec is longer than your preStop sleep plus your drain timeout. We use terminationGracePeriodSeconds: 45 to cover the 5-second sleep plus 25-second drain with margin.
Lessons from Production
After running this configuration across three payment services for over a year, a few patterns have held up:
- Readiness probes should be the most sophisticated of the three. They're your traffic control mechanism and the safest one to fail — no pod restarts, just traffic rerouting.
- Liveness probes should be dumb on purpose. Every dependency check you add is another way to trigger a cascading restart.
- Startup probes are non-negotiable for payment services. If you're loading fraud rules, warming caches, or establishing processor connections, you need the extra initialization window.
- Always test your probes under failure conditions, not just happy path. Kill the database in staging and watch what happens. You want readiness to fail and liveness to pass.
The difference between a 23-minute outage and a seamless failover came down to about 40 lines of probe configuration and handler code. It's not glamorous work, but in payment systems, the boring infrastructure decisions are the ones that matter most.
References
- Kubernetes — Configure Liveness, Readiness and Startup Probes
- Kubernetes — Container Lifecycle Hooks
- Kubernetes — Pod Lifecycle
- Kubernetes — Feature Gates (Startup Probe)
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.