The Incident That Changed Everything
It was a Thursday around 2 PM. We pushed a minor config change to our payment processing service — a Go app running on Kubernetes that handles card authorizations and settlement requests. The rolling deploy kicked in, pods got terminated, and within seconds our Slack lit up. Forty-seven transactions failed mid-flight. Customers saw "payment failed" screens. Merchants started filing support tickets. Total damage: $8.2K in dropped transactions and a very uncomfortable post-mortem.
The root cause was embarrassingly simple: our service had no graceful shutdown logic. When Kubernetes sent SIGTERM, the process just died. Any HTTP request being handled, any goroutine talking to a payment gateway, any database write in progress — all killed instantly.
Why Naive Shutdown Destroys Payment Services
Most web services can tolerate a few dropped requests during deploys. A user refreshes the page and retries. But payment services are different. A half-completed authorization can leave money in limbo. A dropped settlement call means the merchant doesn't get paid. And depending on the payment processor, some of these operations aren't safely idempotent — retrying blindly can double-charge a customer.
| Behavior | Naive Shutdown | Graceful Shutdown |
|---|---|---|
| In-flight HTTP requests | Killed immediately, clients get connection reset | Allowed to complete within timeout window |
| Background goroutines | Orphaned mid-execution, no cleanup | Signaled to finish, waited on with WaitGroup |
| Database connections | Dropped, potential partial writes | Flushed and closed after all work completes |
| Data loss risk | High — any active transaction can be lost | Near-zero within timeout budget |
| User impact | Payment failures, double charges on retry | Transparent — users don't notice deploys |
The Shutdown Sequence That Actually Works
After the incident, we designed a strict shutdown ordering. The sequence matters — get it wrong and you'll drain requests that are still writing to a database connection you already closed.
Signal Handling and Server Shutdown
The foundation is catching SIGTERM (sent by Kubernetes) and SIGINT (for local dev with Ctrl+C). Go's os/signal package makes this straightforward:
func main() {
srv := &http.Server{
Addr: ":8080",
Handler: newRouter(),
}
// Start server in a goroutine
go func() {
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("listen: %v", err)
}
}()
// Wait for termination signal
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
sig := <-quit
log.Printf("received signal %s, starting graceful shutdown", sig)
// Give in-flight requests 30 seconds to complete
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := srv.Shutdown(ctx); err != nil {
log.Printf("server forced to shutdown: %v", err)
}
log.Println("server exited cleanly")
}
The key detail: http.Server.Shutdown() stops the listener immediately (no new connections) but waits for active requests to finish. If they don't finish before the context deadline, it force-closes them. For payment services, that 30-second window is critical — some payment gateway calls take 10-15 seconds on a slow day.
Draining Background Workers
HTTP handlers are only half the story. We had goroutines processing async payment callbacks, running settlement batches, and retrying failed charges. These need their own shutdown coordination:
type PaymentWorkerPool struct {
wg sync.WaitGroup
quit chan struct{}
jobs chan PaymentJob
}
func (p *PaymentWorkerPool) Start(workers int) {
for i := 0; i < workers; i++ {
p.wg.Add(1)
go func() {
defer p.wg.Done()
for {
select {
case job, ok := <-p.jobs:
if !ok {
return // channel closed, exit
}
p.processPayment(job)
case <-p.quit:
return
}
}
}()
}
}
func (p *PaymentWorkerPool) Shutdown(timeout time.Duration) error {
close(p.quit) // signal workers to stop picking up new jobs
done := make(chan struct{})
go func() {
p.wg.Wait()
close(done)
}()
select {
case <-done:
return nil
case <-time.After(timeout):
return fmt.Errorf("worker shutdown timed out after %v", timeout)
}
}
The pattern: close the quit channel to broadcast the stop signal, then use sync.WaitGroup to wait for all workers to finish their current job. We don't close the jobs channel immediately — that would panic if a handler tries to send. Instead, workers check quit between jobs.
Key takeaway: Shutdown ordering is everything. Close the front door first (stop accepting requests), then wait for everyone inside to leave (drain workers), and only then turn off the lights (close DB connections). Reversing any of these steps will cause data loss under load.
Kubernetes Coordination
Getting the Go code right is only half the battle. Kubernetes has its own shutdown sequence, and if you don't coordinate with it, you'll still drop requests. The problem: when a pod enters Terminating state, Kubernetes removes it from the Service endpoints and sends SIGTERM roughly in parallel. There's a race condition — the load balancer might still route traffic to your pod for a few seconds after you've started shutting down.
The fix is a preStop hook that adds a small delay before your process receives the signal:
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 45
That 5-second sleep gives kube-proxy and your ingress controller time to update their routing tables. Meanwhile, your readiness probe should start failing as soon as shutdown begins, so no new traffic arrives during the drain window. We set terminationGracePeriodSeconds to 45 — that's 5 seconds for the preStop hook plus 30 seconds for our application's shutdown timeout, with 10 seconds of buffer before Kubernetes sends SIGKILL.
The Timeout Strategy
Choosing the right timeout is a balancing act. Too short and you kill in-flight payments. Too long and deploys take forever, and Kubernetes might SIGKILL you anyway.
Here's what we landed on after profiling our p99 request latencies:
- HTTP server drain: 30 seconds — covers our slowest payment gateway round-trips
- Worker pool drain: 20 seconds — workers process jobs that are already dequeued
- Total application timeout: 35 seconds — sequential phases overlap slightly since workers start draining alongside HTTP
- Kubernetes terminationGracePeriodSeconds: 45 seconds — always higher than your app timeout
If a worker is still stuck after the timeout, we log the state of the in-progress transaction and exit. The transaction will be in an inconsistent state, but our reconciliation job picks it up within 15 minutes and either completes or reverses it. That's the safety net — graceful shutdown handles 99.9% of cases, and reconciliation catches the rest.
After the Fix
We've done over 400 deploys since implementing graceful shutdown. Zero dropped transactions. Deploys during peak hours are a non-event now. The monitoring dashboards don't even blip.
The whole implementation was maybe 150 lines of Go code and a few lines of Kubernetes config. The hard part wasn't the code — it was understanding the shutdown ordering and the Kubernetes timing quirks. If you're running any service that handles money, this isn't optional. It's table stakes.
References
- Go net/http — Server.Shutdown documentation
- Kubernetes Pod Lifecycle — Pod Termination
- Go os/signal package documentation
- Google Cloud — Kubernetes Best Practices: Terminating with Grace
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.