Go Memory Profiling for Payment Services — Finding the Leaks Before Your Customers Do

Why Memory Leaks Hit Payment Services Harder

Most web services can tolerate a slow memory leak for a while. A blog backend that drifts from 200MB to 400MB over a week? Annoying, but nobody loses money. Payment services are different. They run hot, they run long, and when they go down, transactions fail. A card authorization that times out because your pod got OOM-killed is a lost sale and a frustrated customer. Worse, if your service restarts mid-settlement, you might end up with partial batches that need manual reconciliation.

Go's garbage collector is good, but it can't save you from yourself. If you hold references to objects that should have been released, the GC dutifully keeps them alive. In payment services, the usual suspects are connection pools that grow but never shrink, cached tokens that accumulate forever, and goroutines that block on channels nobody's reading from.

The pprof Basics You Actually Need

Go ships with net/http/pprof, and it's the single most underused tool in the ecosystem. Adding it to your service takes four lines:

import _ "net/http/pprof"

func main() {
    go func() {
        // Separate port so it's not exposed through your API gateway
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    // ... rest of your service
}

That gives you heap profiles, goroutine dumps, CPU profiles, and more — all accessible over HTTP. In production, I bind this to localhost and access it through a kubectl port-forward or SSH tunnel. Never expose pprof to the internet. It's a debugging goldmine for attackers too.

Production tip: Run pprof on a separate port behind your internal network. The overhead of having the endpoints registered is near zero — they only do real work when you actually hit them. There's no reason not to have this in every service from day one.

Heap vs Goroutine Profiling

These are the two profiles I reach for first, and they answer different questions:

Aspect	Heap Profile	Goroutine Profile
What it shows	Where memory is allocated and how much is still in use	All active goroutines and their current stack traces
Best for finding	Unbounded caches, large allocations, objects that should have been freed	Leaked goroutines, blocked channel operations, connection pool exhaustion
Command	`go tool pprof http://localhost:6060/debug/pprof/heap`	`go tool pprof http://localhost:6060/debug/pprof/goroutine`
Payment use case	Token cache growing unbounded, response bodies not closed	Webhook retry goroutines piling up, stale gateway connections

My workflow: start with the heap profile to see what's eating memory. If the heap looks reasonable but RSS keeps climbing, switch to the goroutine profile — you probably have goroutine leaks, and each one carries its own stack (typically 2-8KB that the heap profile won't attribute clearly).

The Profiling Workflow That Actually Works

After chasing memory issues across a few services, I settled on a repeatable workflow. Here's the sequence I follow every time:

Memory Profiling Workflow

1. Capture baseline heap profile

↓

2. Run load test or wait for traffic

↓

3. Capture second heap profile

↓

4. Diff the two profiles

↓

5. Identify top allocators in diff

↓

6. Fix, deploy, repeat from step 1

The diff step is the key. A single heap profile tells you what's allocated right now, but the diff between two profiles tells you what's growing. That's the leak.

# Capture baseline
curl -o base.prof http://localhost:6060/debug/pprof/heap

# Wait 10 minutes (or run your load test)

# Capture after load
curl -o after.prof http://localhost:6060/debug/pprof/heap

# Diff them — this is where the magic happens
go tool pprof -base=base.prof after.prof
(pprof) top 10 -inuse_space

The -inuse_space flag shows you memory that's still held, not just allocated. In payment services, I care about inuse over alloc because allocations that get freed quickly are fine — it's the ones that stick around that kill you.

Common Leak Patterns in Payment Services

After profiling a dozen payment services, the same patterns keep showing up. Here are the three I see most often:

1. Unbounded Token/Fingerprint Maps

This was our 2GB leak. We cached card fingerprints in a map[string]CardMeta to avoid redundant tokenization calls. The map had a Set() but no eviction. Over weeks of production traffic, it accumulated millions of entries. The fix was embarrassingly simple — switch to a TTL-based cache with a size cap:

// Before: unbounded map that grows forever
var tokenCache = make(map[string]CardMeta)

// After: LRU cache with TTL and max size
cache, _ := lru.NewWithExpire(50000, 15*time.Minute)

2. Connection Pool Leaks

HTTP clients talking to payment gateways are another classic. If you create a new http.Client per request (or per goroutine), each one spins up its own connection pool. Those idle connections sit in memory waiting for reuse that never comes. Always share a single http.Client with sensible pool settings:

var gatewayClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 20,
        IdleConnTimeout:     90 * time.Second,
    },
}

And always close response bodies. I've seen resp.Body.Close() missing in error paths more times than I'd like to admit. A deferred close right after the nil check is the safest pattern.

3. Goroutine Leaks from Webhook Retries

Payment services send a lot of webhooks — payment confirmations, refund notifications, settlement reports. A common pattern is to retry failed webhooks in a goroutine with exponential backoff. If the destination is permanently down, those goroutines pile up. I've seen services with 50,000+ leaked goroutines, each holding onto the webhook payload and HTTP request objects.

# Check goroutine count in production
curl http://localhost:6060/debug/pprof/goroutine?debug=1 | head -1
# goroutine profile: total 51234    <-- that's a problem

The fix: use a bounded worker pool for webhook delivery (sound familiar?) and persist failed webhooks to a database or queue for retry, not in-memory goroutines.

Production Profiling with Minimal Overhead

The number one objection I hear: "We can't run profiling in production, it'll slow things down." This is mostly a myth for Go's memory profiler. The heap profiler samples allocations — by default, one sample per 512KB allocated. The overhead is negligible. CPU profiling is a different story (it does add measurable overhead), but heap and goroutine profiles are essentially free to have available.

Before Profiling Fix

2.1 GB

RSS after 7 days uptime

After Profiling Fix

180 MB

RSS after 7 days uptime

For continuous monitoring, I export a few key metrics to Prometheus and set alerts:

// Export to Prometheus for continuous monitoring
var (
    heapInUse = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "payment_svc_heap_inuse_bytes",
    })
    goroutineCount = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "payment_svc_goroutine_count",
    })
)

func recordMemMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    heapInUse.Set(float64(m.HeapInuse))
    goroutineCount.Set(float64(runtime.NumGoroutine()))
}

I call recordMemMetrics() every 30 seconds from a background ticker. The alert fires if heap usage crosses 500MB or goroutine count exceeds 10,000. Both of those have caught real issues before they became incidents.

Watch out for runtime.ReadMemStats in hot paths. It triggers a stop-the-world pause to get consistent numbers. Every 30 seconds is fine. Every request is not. I've seen a well-intentioned middleware that called ReadMemStats on every API call add 2ms of latency to every payment request.

Putting It Into Practice

Memory profiling isn't something you do once and forget. For payment services, I treat it as part of the release checklist: run a load test, capture before/after heap profiles, diff them, and verify that nothing is growing unbounded. It takes 20 minutes and has saved us from at least three production incidents in the past year.

The tools are already built into Go. You don't need a vendor, you don't need a SaaS platform, you don't need to instrument every allocation. Just net/http/pprof, a couple of curl commands, and the discipline to look at the numbers before your customers feel the impact.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.

Why Memory Leaks Hit Payment Services Harder

The pprof Basics You Actually Need

Heap vs Goroutine Profiling

The Profiling Workflow That Actually Works

Common Leak Patterns in Payment Services

1. Unbounded Token/Fingerprint Maps

2. Connection Pool Leaks

3. Goroutine Leaks from Webhook Retries

Production Profiling with Minimal Overhead

Putting It Into Practice

References

Related Articles