April 9, 2026 10 min read

Go Worker Pools for Payment Batch Processing — Why Unbounded Goroutines Are a Production Time Bomb

We were processing 2 million settlement records nightly. Then one Friday, the batch job OOM-killed itself at 3 AM. The fix wasn't more memory — it was a properly bounded worker pool. Here's the pattern we landed on after three iterations.

The Night Everything Broke

Our settlement service had been running fine for months. It read settlement files from the card networks, parsed each record, matched it against our internal transactions, and updated the ledger. Simple enough. The original author had written it the way most Go tutorials teach concurrency:

// The code that looked fine until it wasn't
for _, record := range settlementRecords {
    go func(r SettlementRecord) {
        if err := processRecord(ctx, r); err != nil {
            log.Printf("failed to process record %s: %v", r.ID, err)
        }
    }(record)
}

For 500K records, this worked. The pod had 4GB of memory, each goroutine was cheap, and the database connection pool handled the load. Then our merchant base grew, and the nightly file hit 2.1 million records. Two million goroutines launched simultaneously, each holding a parsed record in memory, each trying to grab a database connection from a pool of 50. The connection pool backed up, goroutines piled up waiting, memory climbed to 12GB, and the OOM killer did its job.

The on-call engineer bumped the pod memory to 16GB. It bought us a week. Then the file grew again. We needed a real fix.

The Worker Pool Pattern

The idea is simple: instead of launching one goroutine per record, you launch a fixed number of workers that pull jobs from a shared channel. The channel acts as a queue with natural backpressure — when all workers are busy, the producer blocks until one frees up. No unbounded growth, no connection pool exhaustion, predictable memory usage.

Here's how the data flows:

Worker Pool Architecture
File
Parser
Buffered
Channel
Worker 1
Worker 2
Worker 3
Worker N
Results
Channel
Result
Collector

The producer reads the settlement file and pushes records into a buffered channel. A fixed pool of N workers pulls from that channel, processes each record (database lookup, ledger update, status write), and pushes results into a results channel. The collector aggregates everything into a settlement report. At no point do we have more than N records being processed concurrently.

The Implementation

Here's the core worker pool we use in production. It handles 2M+ settlement records nightly across three card networks:

func ProcessSettlementFile(ctx context.Context, records []SettlementRecord, workers int) (*BatchResult, error) {
    jobs := make(chan SettlementRecord, workers*2)
    results := make(chan RecordResult, workers*2)

    var wg sync.WaitGroup

    // Start fixed worker pool
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func(workerID int) {
            defer wg.Done()
            for record := range jobs {
                res := processRecord(ctx, workerID, record)
                results <- res
            }
        }(i)
    }

    // Producer: feed records into the jobs channel
    go func() {
        defer close(jobs)
        for _, r := range records {
            select {
            case jobs <- r:
            case <-ctx.Done():
                return
            }
        }
    }()

    // Close results channel once all workers finish
    go func() {
        wg.Wait()
        close(results)
    }()

    // Collector: aggregate results
    batch := &BatchResult{}
    for res := range results {
        if res.Err != nil {
            batch.Failed = append(batch.Failed, res)
        } else {
            batch.Succeeded = append(batch.Succeeded, res)
        }
    }
    return batch, ctx.Err()
}

A few things to notice. The jobs channel buffer is workers*2 — just enough to keep workers fed without the producer blocking on every send, but not so large that we're buffering millions of records in memory. The select on ctx.Done() in the producer is critical: if the context gets cancelled (shutdown signal, timeout), the producer stops feeding new work immediately instead of blocking on a full channel forever.

Why workers*2 for the buffer? It's a sweet spot. A buffer of 1 means the producer blocks after every send until a worker picks it up. A buffer of 1,000,000 means you're back to holding everything in memory. 2x the worker count gives each worker one record to process and one queued up, so there's no idle time between jobs. We benchmarked this — going from 1x to 2x buffer cut total processing time by 8%. Going from 2x to 10x made no measurable difference.

Peak Memory
12 GB
unbounded
800 MB
worker pool
Processing Time
45 min
unbounded
12 min
worker pool
Goroutines
2.1M
unbounded
50
worker pool

Yes, the processing time actually went down with fewer goroutines. Counterintuitive, but it makes sense: 2 million goroutines fighting over 50 database connections means massive contention. 50 workers each with a dedicated connection means zero contention. The database stopped being the bottleneck and throughput tripled.

Graceful Shutdown

Settlement jobs run at 2 AM. Deployments happen whenever CI is green. Those two things will eventually collide. When they do, you need the batch job to stop accepting new work, finish what's in progress, and report what it completed — not crash halfway through and leave the ledger in an inconsistent state.

We wire up OS signals to context cancellation:

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    // Listen for SIGTERM/SIGINT
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

    go func() {
        sig := <-sigCh
        log.Printf("received %v, initiating graceful shutdown", sig)
        cancel()
    }()

    result, err := ProcessSettlementFile(ctx, records, 50)
    if err != nil {
        log.Printf("batch interrupted: processed %d/%d records",
            len(result.Succeeded)+len(result.Failed), len(records))
        // Persist partial results for reconciliation
        persistPartialResult(result)
        os.Exit(1)
    }
    log.Printf("batch complete: %d succeeded, %d failed",
        len(result.Succeeded), len(result.Failed))
}

When SIGTERM arrives, the context cancels. The producer stops sending new jobs. Workers finish their current record (they check ctx.Done() between steps, not mid-transaction). The collector gathers whatever results came through. We persist the partial result so the next run can pick up where we left off.

Never kill a worker mid-transaction. If a worker is halfway through updating a ledger entry, cancelling it can leave your books in an inconsistent state. Check ctx.Done() between logical steps (parse, match, update), not inside them. Let the current step finish, then bail.

Error Collection and Partial Failures

In payment batch processing, a single bad record should never kill the entire batch. If record #847,293 has a malformed amount, the other 1,999,999 records still need to be processed. This is fundamentally different from request-level processing where you can just return a 500.

Our processRecord function returns errors as values, not panics:

func processRecord(ctx context.Context, workerID int, r SettlementRecord) RecordResult {
    // Step 1: Find matching internal transaction
    txn, err := findTransaction(ctx, r.NetworkRef)
    if err != nil {
        return RecordResult{Record: r, Err: fmt.Errorf("match: %w", err)}
    }

    // Step 2: Validate amounts
    if r.SettledAmount != txn.AuthorizedAmount {
        return RecordResult{
            Record: r,
            Err:    fmt.Errorf("amount mismatch: settled=%d authorized=%d",
                r.SettledAmount, txn.AuthorizedAmount),
        }
    }

    // Step 3: Update ledger
    if err := updateLedger(ctx, txn.ID, r); err != nil {
        return RecordResult{Record: r, Err: fmt.Errorf("ledger: %w", err)}
    }

    return RecordResult{Record: r, TxnID: txn.ID}
}

The collector separates successes from failures. After the batch completes, we generate a discrepancy report for the failures and feed it into our reconciliation pipeline. The ops team reviews it in the morning. No manual database fixes at 3 AM.

Choosing Your Approach

There are three common ways to handle concurrent batch processing in Go. Here's how they compare for payment workloads:

Approach Unbounded Goroutines Channel Worker Pool errgroup.SetLimit
Memory usage Unbounded — grows with input Fixed — proportional to worker count Fixed — proportional to limit
Backpressure None Channel buffer provides natural backpressure g.Go() blocks when limit reached
Error handling Manual — easy to lose errors Results channel collects all outcomes First error cancels context
Partial failure Hard to track which records failed Each result carries its own error Stops on first error by default
Best for Prototypes, small datasets Batch processing with partial failure tolerance Request-level parallelism, all-or-nothing operations

We use the channel-based worker pool for settlement processing because we need partial failure tolerance — one bad record can't stop the batch. For pre-authorization validation (fraud check + card check + limit check in parallel), we use errgroup because there we want to fail fast on the first error.

Production Monitoring

A worker pool without observability is a black box. We export three metrics per batch run that have caught issues before they became incidents:

// Per-worker metrics
var (
    recordsProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "settlement_records_processed_total",
    }, []string{"worker_id", "status"})

    processingDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "settlement_record_duration_seconds",
        Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5},
    }, []string{"worker_id"})

    queueDepth = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "settlement_job_queue_depth",
    })
)

The queue depth metric is the most useful. If it's consistently at the buffer max, your workers are the bottleneck — consider adding more or optimizing the per-record processing. If it's consistently at zero, your producer is the bottleneck (probably I/O-bound on file reading). We graph these in Grafana and set an alert if queue depth stays maxed for more than 5 minutes — that usually means a worker is stuck on a slow database query.

Watch for worker starvation. If one worker consistently processes 3x more records than the others, you might have a hot-partition problem in your database. We saw this when settlement records for one large merchant all hit the same database shard, and the worker processing those records was 4x slower than the rest. The fix was to shuffle the input before feeding it to the pool.

What I'd Do Differently

Three iterations later, here's what I wish I'd known from the start:

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.