The Bug That Didn't Show Up in Logs
Last year, our payment reconciliation team flagged something odd: a handful of merchant accounts showed balance discrepancies of a few cents. Not every day — maybe once or twice a week. No errors in the logs. No failed transactions. No panics. Just... wrong numbers.
We spent two days combing through transaction logs before someone on the team suggested running our test suite with -race. Within seconds, the terminal lit up with data race warnings. Three of them. All in code that had been running in production for months.
That was the moment I became a zealot for Go's race detector. Let me walk you through what we found and how you can avoid the same mistakes.
How a Race Condition Causes a Double-Charge
Before diving into the code, let's visualize what was actually happening. Two goroutines processing transactions for the same merchant hit our balance map at the same time:
Both goroutines read the same starting balance, compute their own result, and write it back. The last write wins, and one deduction vanishes. This is a textbook read-modify-write race, and it's terrifyingly easy to introduce in Go when you're using plain maps across goroutines.
The Racy Code
Here's a simplified version of what our balance service looked like. I've stripped out the domain noise, but the structure is faithful to what we had in production:
type BalanceService struct {
balances map[string]int64
}
func (s *BalanceService) Debit(
merchantID string,
amount int64,
) error {
bal := s.balances[merchantID]
if bal < amount {
return ErrInsufficient
}
// Another goroutine can read
// the stale balance here!
s.balances[merchantID] = bal - amount
return nil
}
type BalanceService struct {
mu sync.Mutex
balances map[string]int64
}
func (s *BalanceService) Debit(
merchantID string,
amount int64,
) error {
s.mu.Lock()
defer s.mu.Unlock()
bal := s.balances[merchantID]
if bal < amount {
return ErrInsufficient
}
s.balances[merchantID] = bal - amount
return nil
}
The fix is almost embarrassingly simple — a sync.Mutex guarding the read-modify-write sequence. But here's the thing: the racy version passed every unit test we had. It passed integration tests. It ran in production for months. The race only manifested under real concurrent load, and even then, it was intermittent enough to look like a rounding issue.
Why not sync.Map? We considered it, but sync.Map only protects individual read/write operations. Our bug was a read-modify-write — we needed the entire sequence to be atomic. A mutex around the whole operation is the right tool here. Use sync.Map when you have independent key-value access patterns with no compound operations.
What the Race Detector Actually Does
Go's race detector isn't magic, but it's close. Under the hood, it uses ThreadSanitizer (TSan), originally developed at Google for C/C++. When you compile with -race, the compiler instruments every memory access in your code. At runtime, it tracks which goroutine accessed which memory address and whether proper synchronization (channels, mutexes, atomic operations) happened between accesses.
If two goroutines access the same memory location, at least one of them is a write, and there's no synchronization between them — that's a data race, and the detector reports it with full stack traces for both accesses.
The key thing to understand: it's a dynamic detector, not a static analyzer. It only finds races that actually occur during execution. This means your test coverage directly determines how many races it can catch. If a code path isn't exercised, its races stay hidden.
Running It: From Local to CI
The simplest way to start is running your tests with the flag:
$ go test -race ./...
==================
WARNING: DATA RACE
Read at 0x00c0000a4060 by goroutine 12:
payment/balance.(*BalanceService).Debit()
/app/balance/service.go:24 +0x6c
Previous write at 0x00c0000a4060 by goroutine 11:
payment/balance.(*BalanceService).Debit()
/app/balance/service.go:28 +0x94
Goroutine 12 (running) created at:
payment/balance.TestConcurrentDebits()
/app/balance/service_test.go:47 +0x118
==================
FAIL
That output is gold. It tells you exactly which two goroutines raced, which lines of code were involved, and where those goroutines were spawned. No guesswork.
The Test That Caught It
You need concurrent test scenarios to trigger races. Here's the pattern we now use for every service that handles shared state:
func TestConcurrentDebits(t *testing.T) {
svc := NewBalanceService()
svc.Credit("merchant-1", 10000) // seed $100.00
var wg sync.WaitGroup
for i := 0; i < 50; i++ {
wg.Add(1)
go func() {
defer wg.Done()
_ = svc.Debit("merchant-1", 100) // $1.00
}()
}
wg.Wait()
bal := svc.GetBalance("merchant-1")
if bal != 5000 { // $50.00 expected
t.Errorf("balance = %d, want 5000", bal)
}
}
Without -race, this test might pass even with the racy code — the scheduler might happen to serialize the goroutines. With -race, it reliably catches the unsynchronized access.
CI Integration
We added the race detector to our CI pipeline as a dedicated stage. Here's the key consideration: don't just slap -race onto your existing test command and call it done. Run it as a separate step.
# In your CI config (GitHub Actions, GitLab CI, etc.)
test-race:
stage: test
script:
- go test -race -count=1 -timeout=10m ./...
# Give it more memory — TSan needs ~5-10x more
variables:
GORACE: "halt_on_error=1"
The halt_on_error=1 environment variable tells the race detector to crash immediately on the first race instead of continuing. In CI, you want fast failure. The -count=1 disables test caching so races can't hide behind cached results.
-racePerformance: Can You Run It in Production?
Short answer: probably not, and you shouldn't need to. The 5-10x memory overhead and 2-5x CPU slowdown make it impractical for production payment services where latency matters. We tried it briefly in a staging environment that mirrored production traffic, and p99 latency jumped from 12ms to 45ms.
The right approach is comprehensive concurrent tests in CI. If your tests exercise the concurrent paths your production code takes, the race detector will find the bugs before they reach production. We now require every PR that touches shared state to include a concurrent test — it's part of our code review checklist.
The Three Races We Found
For the curious, here's what our first -race CI run uncovered:
- Balance map race — the one described above. Concurrent debits to the same merchant account without synchronization. Fixed with a
sync.Mutex. - Transaction cache invalidation — a goroutine writing to a cache map while another goroutine iterated over it to expire old entries. Go maps are not safe for concurrent read/write. Fixed by switching to
sync.RWMutexwith read locks for iteration and write locks for mutation. - Config hot-reload race — our fee configuration was reloaded from a config service every 30 seconds, writing to a struct field while request handlers read from it. Fixed with
atomic.Valueto swap the entire config struct atomically.
None of these had caused a production incident — yet. But the balance race was actively causing the reconciliation discrepancies, and the cache race was a ticking time bomb that could have caused a panic under higher load (concurrent map read/write in Go is a fatal runtime error, not just incorrect data).
Lessons Learned
After this experience, we adopted a few rules that I'd recommend to any team building concurrent Go services, especially in payments:
- Run
-racein CI from day one. Don't wait until you have a mystery bug. The cost is a few extra minutes of CI time. The payoff is catching races before they corrupt financial data. - Write concurrent tests for every shared-state operation. The race detector can only find races that happen during execution. No concurrent test coverage means no race detection.
- Prefer
sync.Mutexover clever lock-free patterns for business logic. Mutexes are boring, readable, and correct. In payment code, correctness beats cleverness every time. - Use
GORACEenvironment variable to tune behavior.halt_on_error=1for CI,log_pathfor writing reports to files,history_sizefor controlling memory usage in large test suites. - Don't run it in production. Invest in test coverage instead. If you need production-level race detection, consider canary deployments with the race detector enabled on a single instance handling a fraction of traffic.
One more thing: Go's race detector has zero false positives. If it reports a race, you have a race. Don't dismiss warnings or add them to an ignore list. Every single one is a real bug waiting to manifest.
References
- Go Data Race Detector — Official Documentation
- Go sync Package — Standard Library Reference
- ThreadSanitizer — Google Sanitizers Wiki
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.