Kubernetes Health Checks for Payment Services — Why /health Returning 200 Isn't Enough

Every payment service I've worked on started with the same health check: a /health endpoint that returns 200 OK if the process is running. It works fine in staging. Then production happens.

The pod is "healthy" but the database connection pool is exhausted. The pod is "healthy" but the fraud rule engine hasn't finished loading. The pod is "healthy" but it's mid-shutdown and still receiving traffic from the load balancer. Each of these scenarios has caused real incidents on teams I've been part of — and every one was preventable with proper probe design.

Understanding the Probe Lifecycle

Kubernetes gives you three distinct probes, and they fire in a specific order. Getting this sequence wrong is where most payment service issues start.

Kubernetes Probe Lifecycle

Startup Probe

Is the app initialized?

→

Readiness Probe

Can it accept traffic?

→

Liveness Probe

Is it still functioning?

The startup probe runs first and blocks the other two until it succeeds. Once it passes, Kubernetes begins running readiness and liveness probes concurrently. This matters enormously for payment services because our cold start times are not trivial.

Probe	Purpose	On Failure	When to Use
Startup	Gate for slow-starting containers	Pod is killed and restarted	Loading fraud rules, warming caches, running migrations
Readiness	Controls whether pod receives traffic	Removed from Service endpoints (no traffic)	Downstream dependency checks, connection pool health
Liveness	Detects deadlocked or hung processes	Pod is killed and restarted	Deadlock detection, unrecoverable states only

The critical distinction: readiness failure removes traffic gracefully, liveness failure kills the pod. For payment services, killing a pod that has in-flight transactions is the worst possible outcome.

Warning: Never check downstream dependencies in your liveness probe. If your database goes down and your liveness probe fails, Kubernetes will restart all your pods simultaneously — turning a database blip into a full service outage with dozens of interrupted transactions.

Startup Probes: Handling the Cold Start Problem

Our payment gateway takes roughly 47 seconds to fully initialize. It loads fraud detection rules from S3, warms the BIN lookup cache, establishes connection pools to three payment processors, and validates TLS certificates against the HSM. Without a startup probe, Kubernetes would kill the pod before it ever had a chance to serve traffic.

47s

Average cold start time
(fraud rules + cache warming)

99.97%

Uptime after implementing
proper probe configuration

Dropped transactions during
rolling deployments

Here's the YAML configuration we settled on after several rounds of tuning:

startupProbe:
  httpGet:
    path: /healthz/startup
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 12    # 5 + (5 * 12) = 65s max startup time
  successThreshold: 1

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 3
  successThreshold: 1

livenessProbe:
  httpGet:
    path: /healthz/live
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  successThreshold: 1

The failureThreshold: 12 with periodSeconds: 5 gives us a 65-second window for startup. That covers our 47-second average with headroom for slow days.

Readiness Probes: The Dependency Question

This is where it gets nuanced. Your readiness probe should answer: "Can this pod meaningfully process a payment right now?" That means checking the things that would cause a transaction to fail.

func (s *Server) readinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    // Check database connectivity
    if err := s.db.PingContext(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "not_ready",
            "reason": "database_unreachable",
        })
        return
    }

    // Check primary payment processor connection
    if !s.paymentClient.IsCircuitClosed() {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "not_ready",
            "reason": "payment_processor_circuit_open",
        })
        return
    }

    // Check fraud engine availability
    if !s.fraudEngine.IsLoaded() {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "not_ready",
            "reason": "fraud_rules_not_loaded",
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{"status": "ready"})
}

A few things to note. The 2-second timeout on the context is deliberate — if your readiness check takes longer than your probe's timeoutSeconds, Kubernetes treats it as a failure. Keep the check fast. Also, we're checking the circuit breaker state for the payment processor rather than making a live call. Pinging Stripe or Adyen on every readiness check is wasteful and can trigger their rate limits.

Keep Liveness Simple

The liveness probe should be almost trivially simple. Its job is to catch truly unrecoverable states — goroutine leaks, deadlocks, corrupted internal state. Not transient dependency failures.

func (s *Server) livenessHandler(w http.ResponseWriter, r *http.Request) {
    // Only check if the process itself is functioning
    // Do NOT check external dependencies here
    if s.isShuttingDown.Load() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Graceful Shutdown: Draining In-Flight Transactions

This is the piece that took us the longest to get right. When Kubernetes sends SIGTERM, there's a race condition: the pod starts shutting down, but the Service endpoints haven't been updated yet. For a brief window, traffic is still being routed to a pod that's trying to die.

The preStop hook buys you time. We use it to stop accepting new transactions while draining the ones already in progress:

lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - "sleep 5 && /app/drain --timeout=25s"

The 5-second sleep is intentional. It gives kube-proxy and ingress controllers time to remove the pod from their routing tables. Without it, you'll see a burst of connection resets right after deployment starts.

On the application side, the drain command flips the readiness probe to unhealthy and waits for in-flight requests to complete:

func (s *Server) gracefulShutdown(timeout time.Duration) {
    // Signal readiness probe to return unhealthy
    s.isShuttingDown.Store(true)

    // Wait for in-flight transactions to complete
    done := make(chan struct{})
    go func() {
        s.wg.Wait()
        close(done)
    }()

    select {
    case <-done:
        log.Info("all in-flight transactions completed")
    case <-time.After(timeout):
        log.Warn("shutdown timeout reached, forcing exit",
            "pending_transactions", s.wg.Count())
    }
}

Make sure your terminationGracePeriodSeconds in the pod spec is longer than your preStop sleep plus your drain timeout. We use terminationGracePeriodSeconds: 45 to cover the 5-second sleep plus 25-second drain with margin.

Lessons from Production

After running this configuration across three payment services for over a year, a few patterns have held up:

Readiness probes should be the most sophisticated of the three. They're your traffic control mechanism and the safest one to fail — no pod restarts, just traffic rerouting.
Liveness probes should be dumb on purpose. Every dependency check you add is another way to trigger a cascading restart.
Startup probes are non-negotiable for payment services. If you're loading fraud rules, warming caches, or establishing processor connections, you need the extra initialization window.
Always test your probes under failure conditions, not just happy path. Kill the database in staging and watch what happens. You want readiness to fail and liveness to pass.

The difference between a 23-minute outage and a seamless failover came down to about 40 lines of probe configuration and handler code. It's not glamorous work, but in payment systems, the boring infrastructure decisions are the ones that matter most.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

Understanding the Probe Lifecycle

Startup Probes: Handling the Cold Start Problem

Readiness Probes: The Dependency Question

Keep Liveness Simple

Graceful Shutdown: Draining In-Flight Transactions

Lessons from Production

References

Related Articles