Go Worker Pools for Payment Batch Processing

The Night Everything Broke

Our settlement service had been running fine for months. It read settlement files from the card networks, parsed each record, matched it against our internal transactions, and updated the ledger. Simple enough. The original author had written it the way most Go tutorials teach concurrency:

// The code that looked fine until it wasn't
for _, record := range settlementRecords {
    go func(r SettlementRecord) {
        if err := processRecord(ctx, r); err != nil {
            log.Printf("failed to process record %s: %v", r.ID, err)
        }
    }(record)
}

For 500K records, this worked. The pod had 4GB of memory, each goroutine was cheap, and the database connection pool handled the load. Then our merchant base grew, and the nightly file hit 2.1 million records. Two million goroutines launched simultaneously, each holding a parsed record in memory, each trying to grab a database connection from a pool of 50. The connection pool backed up, goroutines piled up waiting, memory climbed to 12GB, and the OOM killer did its job.

The on-call engineer bumped the pod memory to 16GB. It bought us a week. Then the file grew again. We needed a real fix.

The Worker Pool Pattern

The idea is simple: instead of launching one goroutine per record, you launch a fixed number of workers that pull jobs from a shared channel. The channel acts as a queue with natural backpressure — when all workers are busy, the producer blocks until one frees up. No unbounded growth, no connection pool exhaustion, predictable memory usage.

Here's how the data flows:

Worker Pool Architecture

File
Parser

→

Buffered
Channel

→

Worker 1

Worker 2

Worker 3

Worker N

→

Results
Channel

→

Result
Collector

The producer reads the settlement file and pushes records into a buffered channel. A fixed pool of N workers pulls from that channel, processes each record (database lookup, ledger update, status write), and pushes results into a results channel. The collector aggregates everything into a settlement report. At no point do we have more than N records being processed concurrently.

The Implementation

Here's the core worker pool we use in production. It handles 2M+ settlement records nightly across three card networks:

func ProcessSettlementFile(ctx context.Context, records []SettlementRecord, workers int) (*BatchResult, error) {
    jobs := make(chan SettlementRecord, workers*2)
    results := make(chan RecordResult, workers*2)

    var wg sync.WaitGroup

    // Start fixed worker pool
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func(workerID int) {
            defer wg.Done()
            for record := range jobs {
                res := processRecord(ctx, workerID, record)
                results <- res
            }
        }(i)
    }

    // Producer: feed records into the jobs channel
    go func() {
        defer close(jobs)
        for _, r := range records {
            select {
            case jobs <- r:
            case <-ctx.Done():
                return
            }
        }
    }()

    // Close results channel once all workers finish
    go func() {
        wg.Wait()
        close(results)
    }()

    // Collector: aggregate results
    batch := &BatchResult{}
    for res := range results {
        if res.Err != nil {
            batch.Failed = append(batch.Failed, res)
        } else {
            batch.Succeeded = append(batch.Succeeded, res)
        }
    }
    return batch, ctx.Err()
}

A few things to notice. The jobs channel buffer is workers*2 — just enough to keep workers fed without the producer blocking on every send, but not so large that we're buffering millions of records in memory. The select on ctx.Done() in the producer is critical: if the context gets cancelled (shutdown signal, timeout), the producer stops feeding new work immediately instead of blocking on a full channel forever.

Why workers*2 for the buffer? It's a sweet spot. A buffer of 1 means the producer blocks after every send until a worker picks it up. A buffer of 1,000,000 means you're back to holding everything in memory. 2x the worker count gives each worker one record to process and one queued up, so there's no idle time between jobs. We benchmarked this — going from 1x to 2x buffer cut total processing time by 8%. Going from 2x to 10x made no measurable difference.

Peak Memory

12 GB

unbounded

800 MB

worker pool

Processing Time

45 min

unbounded

12 min

worker pool

Goroutines

2.1M

unbounded

worker pool

Yes, the processing time actually went down with fewer goroutines. Counterintuitive, but it makes sense: 2 million goroutines fighting over 50 database connections means massive contention. 50 workers each with a dedicated connection means zero contention. The database stopped being the bottleneck and throughput tripled.

Graceful Shutdown

Settlement jobs run at 2 AM. Deployments happen whenever CI is green. Those two things will eventually collide. When they do, you need the batch job to stop accepting new work, finish what's in progress, and report what it completed — not crash halfway through and leave the ledger in an inconsistent state.

We wire up OS signals to context cancellation:

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    // Listen for SIGTERM/SIGINT
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

    go func() {
        sig := <-sigCh
        log.Printf("received %v, initiating graceful shutdown", sig)
        cancel()
    }()

    result, err := ProcessSettlementFile(ctx, records, 50)
    if err != nil {
        log.Printf("batch interrupted: processed %d/%d records",
            len(result.Succeeded)+len(result.Failed), len(records))
        // Persist partial results for reconciliation
        persistPartialResult(result)
        os.Exit(1)
    }
    log.Printf("batch complete: %d succeeded, %d failed",
        len(result.Succeeded), len(result.Failed))
}

When SIGTERM arrives, the context cancels. The producer stops sending new jobs. Workers finish their current record (they check ctx.Done() between steps, not mid-transaction). The collector gathers whatever results came through. We persist the partial result so the next run can pick up where we left off.

Never kill a worker mid-transaction. If a worker is halfway through updating a ledger entry, cancelling it can leave your books in an inconsistent state. Check ctx.Done() between logical steps (parse, match, update), not inside them. Let the current step finish, then bail.

Error Collection and Partial Failures

In payment batch processing, a single bad record should never kill the entire batch. If record #847,293 has a malformed amount, the other 1,999,999 records still need to be processed. This is fundamentally different from request-level processing where you can just return a 500.

Our processRecord function returns errors as values, not panics:

func processRecord(ctx context.Context, workerID int, r SettlementRecord) RecordResult {
    // Step 1: Find matching internal transaction
    txn, err := findTransaction(ctx, r.NetworkRef)
    if err != nil {
        return RecordResult{Record: r, Err: fmt.Errorf("match: %w", err)}
    }

    // Step 2: Validate amounts
    if r.SettledAmount != txn.AuthorizedAmount {
        return RecordResult{
            Record: r,
            Err:    fmt.Errorf("amount mismatch: settled=%d authorized=%d",
                r.SettledAmount, txn.AuthorizedAmount),
        }
    }

    // Step 3: Update ledger
    if err := updateLedger(ctx, txn.ID, r); err != nil {
        return RecordResult{Record: r, Err: fmt.Errorf("ledger: %w", err)}
    }

    return RecordResult{Record: r, TxnID: txn.ID}
}

The collector separates successes from failures. After the batch completes, we generate a discrepancy report for the failures and feed it into our reconciliation pipeline. The ops team reviews it in the morning. No manual database fixes at 3 AM.

Choosing Your Approach

There are three common ways to handle concurrent batch processing in Go. Here's how they compare for payment workloads:

Approach	Unbounded Goroutines	Channel Worker Pool	errgroup.SetLimit
Memory usage	Unbounded — grows with input	Fixed — proportional to worker count	Fixed — proportional to limit
Backpressure	None	Channel buffer provides natural backpressure	g.Go() blocks when limit reached
Error handling	Manual — easy to lose errors	Results channel collects all outcomes	First error cancels context
Partial failure	Hard to track which records failed	Each result carries its own error	Stops on first error by default
Best for	Prototypes, small datasets	Batch processing with partial failure tolerance	Request-level parallelism, all-or-nothing operations

We use the channel-based worker pool for settlement processing because we need partial failure tolerance — one bad record can't stop the batch. For pre-authorization validation (fraud check + card check + limit check in parallel), we use errgroup because there we want to fail fast on the first error.

Production Monitoring

A worker pool without observability is a black box. We export three metrics per batch run that have caught issues before they became incidents:

// Per-worker metrics
var (
    recordsProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "settlement_records_processed_total",
    }, []string{"worker_id", "status"})

    processingDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "settlement_record_duration_seconds",
        Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5},
    }, []string{"worker_id"})

    queueDepth = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "settlement_job_queue_depth",
    })
)

The queue depth metric is the most useful. If it's consistently at the buffer max, your workers are the bottleneck — consider adding more or optimizing the per-record processing. If it's consistently at zero, your producer is the bottleneck (probably I/O-bound on file reading). We graph these in Grafana and set an alert if queue depth stays maxed for more than 5 minutes — that usually means a worker is stuck on a slow database query.

Watch for worker starvation. If one worker consistently processes 3x more records than the others, you might have a hot-partition problem in your database. We saw this when settlement records for one large merchant all hit the same database shard, and the worker processing those records was 4x slower than the rest. The fix was to shuffle the input before feeding it to the pool.

What I'd Do Differently

Three iterations later, here's what I wish I'd known from the start:

Start with 2x your database connection pool size as your worker count. Not CPU cores, not "as many as possible." Your workers are I/O-bound, and the database is almost always the bottleneck. We run 50 workers against a pool of 25 connections, and the slight oversubscription keeps the pool fully utilized without excessive contention.
Shuffle your input. Settlement files from card networks are often sorted by merchant. If you feed them in order, one worker gets all the records for your biggest merchant while others sit idle. A simple rand.Shuffle before processing distributes the load evenly.
Persist progress, not just results. If your batch crashes at record 1.5M, you don't want to reprocess the first 1.5M. We checkpoint every 10,000 records to Redis, and the next run picks up from the last checkpoint.
Test with production-sized data. Our worker pool worked perfectly with 10,000 test records. The channel deadlock only showed up at 500,000+ because it required a specific timing condition between the producer and a slow worker. Load test with real volumes.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.

The Night Everything Broke

The Worker Pool Pattern

The Implementation

Graceful Shutdown

Error Collection and Partial Failures

Choosing Your Approach

Production Monitoring

What I'd Do Differently

References

Related Articles