Go Graceful Shutdown in Payment Services — How We Stopped Losing In-Flight Transactions During Deploys

The Incident That Changed Everything

It was a Thursday around 2 PM. We pushed a minor config change to our payment processing service — a Go app running on Kubernetes that handles card authorizations and settlement requests. The rolling deploy kicked in, pods got terminated, and within seconds our Slack lit up. Forty-seven transactions failed mid-flight. Customers saw "payment failed" screens. Merchants started filing support tickets. Total damage: $8.2K in dropped transactions and a very uncomfortable post-mortem.

The root cause was embarrassingly simple: our service had no graceful shutdown logic. When Kubernetes sent SIGTERM, the process just died. Any HTTP request being handled, any goroutine talking to a payment gateway, any database write in progress — all killed instantly.

Why Naive Shutdown Destroys Payment Services

Most web services can tolerate a few dropped requests during deploys. A user refreshes the page and retries. But payment services are different. A half-completed authorization can leave money in limbo. A dropped settlement call means the merchant doesn't get paid. And depending on the payment processor, some of these operations aren't safely idempotent — retrying blindly can double-charge a customer.

Behavior	Naive Shutdown	Graceful Shutdown
In-flight HTTP requests	Killed immediately, clients get connection reset	Allowed to complete within timeout window
Background goroutines	Orphaned mid-execution, no cleanup	Signaled to finish, waited on with WaitGroup
Database connections	Dropped, potential partial writes	Flushed and closed after all work completes
Data loss risk	High — any active transaction can be lost	Near-zero within timeout budget
User impact	Payment failures, double charges on retry	Transparent — users don't notice deploys

The Shutdown Sequence That Actually Works

After the incident, we designed a strict shutdown ordering. The sequence matters — get it wrong and you'll drain requests that are still writing to a database connection you already closed.

Stop accepting new requests

Listener closes, readiness probe fails, load balancer stops routing traffic

Drain in-flight HTTP requests

http.Server.Shutdown() waits for active handlers to return

Wait for background workers

sync.WaitGroup blocks until all payment processing goroutines finish

Flush logs and metrics

Push remaining telemetry data to collectors before exit

Close database and cache connections

Clean up connection pools only after all consumers are done

Signal Handling and Server Shutdown

The foundation is catching SIGTERM (sent by Kubernetes) and SIGINT (for local dev with Ctrl+C). Go's os/signal package makes this straightforward:

func main() {
    srv := &http.Server{
        Addr:    ":8080",
        Handler: newRouter(),
    }

    // Start server in a goroutine
    go func() {
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatalf("listen: %v", err)
        }
    }()

    // Wait for termination signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    sig := <-quit
    log.Printf("received signal %s, starting graceful shutdown", sig)

    // Give in-flight requests 30 seconds to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Printf("server forced to shutdown: %v", err)
    }

    log.Println("server exited cleanly")
}

The key detail: http.Server.Shutdown() stops the listener immediately (no new connections) but waits for active requests to finish. If they don't finish before the context deadline, it force-closes them. For payment services, that 30-second window is critical — some payment gateway calls take 10-15 seconds on a slow day.

Draining Background Workers

HTTP handlers are only half the story. We had goroutines processing async payment callbacks, running settlement batches, and retrying failed charges. These need their own shutdown coordination:

type PaymentWorkerPool struct {
    wg     sync.WaitGroup
    quit   chan struct{}
    jobs   chan PaymentJob
}

func (p *PaymentWorkerPool) Start(workers int) {
    for i := 0; i < workers; i++ {
        p.wg.Add(1)
        go func() {
            defer p.wg.Done()
            for {
                select {
                case job, ok := <-p.jobs:
                    if !ok {
                        return // channel closed, exit
                    }
                    p.processPayment(job)
                case <-p.quit:
                    return
                }
            }
        }()
    }
}

func (p *PaymentWorkerPool) Shutdown(timeout time.Duration) error {
    close(p.quit) // signal workers to stop picking up new jobs

    done := make(chan struct{})
    go func() {
        p.wg.Wait()
        close(done)
    }()

    select {
    case <-done:
        return nil
    case <-time.After(timeout):
        return fmt.Errorf("worker shutdown timed out after %v", timeout)
    }
}

The pattern: close the quit channel to broadcast the stop signal, then use sync.WaitGroup to wait for all workers to finish their current job. We don't close the jobs channel immediately — that would panic if a handler tries to send. Instead, workers check quit between jobs.

Key takeaway: Shutdown ordering is everything. Close the front door first (stop accepting requests), then wait for everyone inside to leave (drain workers), and only then turn off the lights (close DB connections). Reversing any of these steps will cause data loss under load.

Kubernetes Coordination

Getting the Go code right is only half the battle. Kubernetes has its own shutdown sequence, and if you don't coordinate with it, you'll still drop requests. The problem: when a pod enters Terminating state, Kubernetes removes it from the Service endpoints and sends SIGTERM roughly in parallel. There's a race condition — the load balancer might still route traffic to your pod for a few seconds after you've started shutting down.

The fix is a preStop hook that adds a small delay before your process receives the signal:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 45

That 5-second sleep gives kube-proxy and your ingress controller time to update their routing tables. Meanwhile, your readiness probe should start failing as soon as shutdown begins, so no new traffic arrives during the drain window. We set terminationGracePeriodSeconds to 45 — that's 5 seconds for the preStop hook plus 30 seconds for our application's shutdown timeout, with 10 seconds of buffer before Kubernetes sends SIGKILL.

The Timeout Strategy

Choosing the right timeout is a balancing act. Too short and you kill in-flight payments. Too long and deploys take forever, and Kubernetes might SIGKILL you anyway.

Here's what we landed on after profiling our p99 request latencies:

HTTP server drain: 30 seconds — covers our slowest payment gateway round-trips
Worker pool drain: 20 seconds — workers process jobs that are already dequeued
Total application timeout: 35 seconds — sequential phases overlap slightly since workers start draining alongside HTTP
Kubernetes terminationGracePeriodSeconds: 45 seconds — always higher than your app timeout

If a worker is still stuck after the timeout, we log the state of the in-progress transaction and exit. The transaction will be in an inconsistent state, but our reconciliation job picks it up within 15 minutes and either completes or reverses it. That's the safety net — graceful shutdown handles 99.9% of cases, and reconciliation catches the rest.

After the Fix

We've done over 400 deploys since implementing graceful shutdown. Zero dropped transactions. Deploys during peak hours are a non-event now. The monitoring dashboards don't even blip.

The whole implementation was maybe 150 lines of Go code and a few lines of Kubernetes config. The hard part wasn't the code — it was understanding the shutdown ordering and the Kubernetes timing quirks. If you're running any service that handles money, this isn't optional. It's table stakes.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

The Incident That Changed Everything

Why Naive Shutdown Destroys Payment Services

The Shutdown Sequence That Actually Works

Signal Handling and Server Shutdown

Draining Background Workers

Kubernetes Coordination

The Timeout Strategy

After the Fix

References

Related Articles