Database Connection Pooling in Go Payment Services — How We Stopped Dropping Transactions at 3 AM

The 3 AM Wake-Up Call

It was a Tuesday. Our settlement batch kicks off at 2:30 AM — roughly 400,000 merchant payouts that need to land before banks open. I got paged at 3:07 AM. The dashboard was a wall of red. Transaction success rate had cratered from 99.97% to 61%. PagerDuty was losing its mind.

The error logs told a simple story: pq: sorry, too many clients already. Our Go services had exhausted every available PostgreSQL connection. The settlement batch — which opens a lot of long-running transactions — had starved the real-time payment API of connections. Incoming card authorizations were timing out waiting for a connection that would never come.

We lost about 12,000 transactions in 23 minutes before the on-call engineer bumped max_connections on the database and restarted the services. The post-mortem was embarrassing. We'd been running with Go's default pool settings — which, for a payment service doing 2,800 TPS at peak, is like showing up to a Formula 1 race in a minivan.

The real cost: Those 12,000 dropped transactions weren't just numbers. They were people trying to pay for groceries, gas, and late-night online orders. Some of those transactions never retried. That's revenue our merchants never recovered.

How Go's Connection Pool Actually Works

Most Go developers I've talked to think sql.Open() creates a connection. It doesn't. It creates a pool. The actual TCP connection to your database happens lazily, the first time you call Query(), Exec(), or Begin(). This is fine for a CRUD app. It's a landmine for payment services.

Here's the lifecycle of a connection in Go's pool:

App Request

→

Pool Check

→

Idle conn available?

Create new conn?

Wait / fail?

→

Execute Query

→

Return to Pool

When your code asks for a connection, the pool first checks if there's an idle one sitting around. If yes, great — you get it immediately. If not, and you haven't hit MaxOpenConns, it creates a new one. If you have hit the limit, your goroutine blocks and waits. In a payment service, that wait can cascade fast.

The Four Knobs You Need to Get Right

Go gives you exactly four settings to tune your pool. Every single one matters for payment workloads.

MaxOpenConns

The maximum number of connections (both in-use and idle) the pool will hold open. The default is 0 — unlimited. That sounds generous until your settlement batch opens 300 connections and your database falls over. We set ours to 25 per service instance. With 8 instances, that's 200 total connections against a PostgreSQL server configured for 300. Leaves headroom for migrations, monitoring, and the occasional psql session from an engineer debugging at 3 AM.

MaxIdleConns

How many connections the pool keeps warm when they're not being used. The default is 2. Two. For a service handling thousands of requests per second, that means you're constantly creating and destroying TCP connections, doing TLS handshakes, and running PostgreSQL's authentication flow. We set this equal to MaxOpenConns. If you're willing to have 25 connections open, you should be willing to keep 25 idle. The memory cost is negligible.

Rule of thumb: Set MaxIdleConns equal to MaxOpenConns. There's almost never a good reason to have them differ in a high-throughput service. If you set idle lower, you're paying the connection setup cost repeatedly for no benefit.

ConnMaxLifetime

The maximum amount of time a connection can be reused. After this duration, the connection is closed and removed from the pool. This matters because PostgreSQL connections accumulate memory over time, and load balancers (PgBouncer, HAProxy) may silently drop long-lived connections. We use 5 minutes. Short enough to rotate through connections regularly, long enough that we're not churning during peak traffic.

ConnMaxIdleTime

How long a connection can sit idle before it gets closed. This is your cleanup mechanism for traffic spikes. If you had a burst that opened 25 connections but now only need 5, ConnMaxIdleTime reclaims the extras. We set 30 seconds. Aggressive, but our traffic patterns are spiky around settlement windows and we want those connections back quickly.

The Configuration That Saved Us

Here's what our pool setup looks like now. Nothing fancy — just deliberate choices for every parameter:

func NewPaymentDB(dsn string) (*sql.DB, error) {
    db, err := sql.Open("pgx", dsn)
    if err != nil {
        return nil, fmt.Errorf("opening database: %w", err)
    }

    // Hard cap: 25 connections per instance.
    // With 8 instances = 200 total against a 300-conn PG server.
    db.SetMaxOpenConns(25)

    // Keep all 25 warm. The memory cost is ~50KB per idle conn.
    // The cost of NOT doing this is ~3ms per new connection.
    db.SetMaxIdleConns(25)

    // Rotate connections every 5 minutes.
    // Prevents stale connections behind load balancers
    // and limits per-connection memory growth in PG.
    db.SetConnMaxLifetime(5 * time.Minute)

    // Reclaim idle connections after 30s.
    // Cleans up after settlement batch spikes.
    db.SetConnMaxIdleTime(30 * time.Second)

    // Verify the connection is actually reachable.
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    if err := db.PingContext(ctx); err != nil {
        db.Close()
        return nil, fmt.Errorf("pinging database: %w", err)
    }

    return db, nil
}

Before and After

Here's what changed after we tuned the pool and deployed. These numbers are from a two-week window, comparing the period before and after the fix:

47 → 25

MaxOpenConns per instance

< 2ms

Avg pool acquire time

Dropped transactions (post-fix)

Metric	Before	After
Connection errors / day	~340	0
p99 query latency	820ms	45ms
3 AM incidents (per quarter)	3	0
Avg pool acquire time	180ms	1.2ms
Peak open connections	Unbounded (~310)	25 per instance
Connection churn rate	~900/min	~8/min

The p99 drop is the one that still surprises people. Most of that 820ms wasn't query execution — it was goroutines waiting in line for a connection. Once we capped the pool properly and kept idle connections warm, the wait time essentially vanished.

Detecting Connection Leaks

Pool tuning only works if you're not leaking connections. A leaked connection is one your code acquired but never returned — usually a rows.Close() that never gets called, or a transaction that's never committed or rolled back. In a payment service, a single leaked connection per request will drain your pool in seconds.

Watch out: The most common leak pattern we've seen is early returns in error handling that skip defer rows.Close(). Always call defer rows.Close() immediately after checking the error from Query(). Not later. Not after you've done some validation. Immediately.

// WRONG — leak on error between Query and Close
rows, err := db.QueryContext(ctx, query, args...)
if err != nil {
    return err
}
// ... some validation that might return early ...
defer rows.Close()  // too late if we returned above

// RIGHT — defer immediately after error check
rows, err := db.QueryContext(ctx, query, args...)
if err != nil {
    return err
}
defer rows.Close()  // first thing after the error check

Monitoring Your Pool with sql.DBStats

Go gives you db.Stats() which returns an sql.DBStats struct. Export these to your metrics system. We push them to Prometheus every 10 seconds. The fields that matter most:

InUse — connections currently checked out. If this equals MaxOpenConns, you're saturated.
WaitCount — total number of times a goroutine had to wait for a connection. This should be zero or near-zero in steady state.
WaitDuration — cumulative time spent waiting. Divide by WaitCount for average wait. If this is climbing, your pool is too small or you have a leak.
MaxIdleClosed — connections closed because MaxIdleConns was exceeded. If this is high, bump your idle limit.
MaxLifetimeClosed — connections closed due to ConnMaxLifetime. This should be steady and predictable.

func reportPoolMetrics(db *sql.DB, interval time.Duration) {
    ticker := time.NewTicker(interval)
    for range ticker.C {
        stats := db.Stats()
        metrics.Gauge("db.pool.open_connections", float64(stats.OpenConnections))
        metrics.Gauge("db.pool.in_use", float64(stats.InUse))
        metrics.Gauge("db.pool.idle", float64(stats.Idle))
        metrics.Counter("db.pool.wait_count", float64(stats.WaitCount))
        metrics.Timer("db.pool.wait_duration", stats.WaitDuration)
        metrics.Counter("db.pool.max_idle_closed", float64(stats.MaxIdleClosed))
        metrics.Counter("db.pool.max_lifetime_closed", float64(stats.MaxLifetimeClosed))
    }
}

Set alerts on WaitCount increasing and on InUse approaching MaxOpenConns. Those two signals would have caught our 3 AM incident before it became an incident.

Why the Defaults Are Terrible for Payment Services

Go's defaults are designed for the general case: a web app with moderate traffic and forgiving latency requirements. Payment services are neither of those things. Here's the mismatch:

Unlimited MaxOpenConns means a traffic spike can open hundreds of connections, overwhelming your database and any connection pooler in front of it.
MaxIdleConns of 2 means you're paying the connection setup cost on nearly every request during normal traffic. For PostgreSQL with TLS, that's 3-5ms per connection — an eternity when your SLA is sub-100ms.
No ConnMaxLifetime means connections live forever. Behind a PgBouncer or after a failover, you'll be talking to a closed socket and won't know until a query fails.
No ConnMaxIdleTime means after a traffic spike, those extra connections sit around consuming database resources indefinitely.

Bottom line: If you're running a payment service in Go and you haven't explicitly set all four pool parameters, you have a production incident waiting to happen. It's not a question of if — it's a question of when your settlement batch and real-time traffic collide.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Specific metrics and thresholds mentioned are illustrative — your values will depend on your transaction volume, risk tolerance, and business requirements. Always verify with official documentation.

The 3 AM Wake-Up Call

How Go's Connection Pool Actually Works

The Four Knobs You Need to Get Right

MaxOpenConns

MaxIdleConns

ConnMaxLifetime

ConnMaxIdleTime

The Configuration That Saved Us

Before and After

Detecting Connection Leaks

Monitoring Your Pool with sql.DBStats

Why the Defaults Are Terrible for Payment Services

References

Related Articles