Why Not Just Use Goroutines and WaitGroup
The naive approach to parallel API calls in Go looks like this: spawn goroutines, use a sync.WaitGroup, collect results through channels. It works, but it has three problems in payment contexts:
- Error propagation is manual. If one provider call fails, you need to decide whether to cancel the others. With raw goroutines, you're wiring up context cancellation yourself.
- Panics in goroutines crash the process. A nil pointer from a malformed provider response takes down your entire payment service, not just that one request.
- No concurrency limits. During a settlement batch, you might fan out 10,000 reconciliation calls. Without a limiter, you'll exhaust file descriptors or get rate-limited by the provider.
golang.org/x/sync/errgroup solves all three. It's a thin wrapper — about 60 lines of code — but it encodes the right patterns for concurrent work with shared error handling.
errgroup Basics for Payment Fan-Out
The core pattern: create a group with a context, launch goroutines with g.Go(), and wait for all of them. If any goroutine returns an error, the context is cancelled and g.Wait() returns that error.
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error {
return callFraudService(ctx, txn)
})
g.Go(func() error {
return callRiskEngine(ctx, txn)
})
if err := g.Wait(); err != nil {
// At least one call failed — ctx was cancelled,
// so the other call got a cancellation signal too
return fmt.Errorf("pre-auth checks failed: %w", err)
}
The key insight: when errgroup.WithContext creates the group, it derives a child context. The first error from any goroutine cancels that child context. Other goroutines receive the cancellation through ctx.Done() — but only if they're checking it. Make sure your HTTP clients respect context cancellation (the standard library's http.Client does by default).
Important: errgroup only returns the first error. In payment systems, you often need all errors — "Stripe timed out AND Adyen returned 503." We'll cover multi-error collection below.
Parallel Provider Health Checks
Before routing a transaction, our orchestration layer checks which providers are healthy. Doing this sequentially adds latency — three providers at 200ms each means 600ms before you even start the authorization. With errgroup, it's a single round-trip:
type HealthResult struct {
Provider string
Healthy bool
Latency time.Duration
}
func checkProviders(ctx context.Context, providers []Provider) ([]HealthResult, error) {
results := make([]HealthResult, len(providers))
g, ctx := errgroup.WithContext(ctx)
for i, p := range providers {
g.Go(func() error {
start := time.Now()
err := p.Ping(ctx)
results[i] = HealthResult{
Provider: p.Name(),
Healthy: err == nil,
Latency: time.Since(start),
}
return nil // Don't fail the group on unhealthy provider
})
}
return results, g.Wait()
}
Notice that each goroutine writes to its own index in the results slice — no mutex needed. And we return nil from each goroutine because an unhealthy provider isn't an error in the group; it's data we use for routing decisions.
45ms
62ms
timeout
Stripe
Multi-Acquirer Routing with First Success
Sometimes you want to try multiple acquirers simultaneously and take the first successful authorization. This is common in high-value transactions where you want to maximize approval rates. errgroup alone doesn't support "first success" — it waits for all goroutines. But you can combine it with a channel:
func authorizeWithFallback(ctx context.Context, txn Transaction, acquirers []Acquirer) (*AuthResult, error) {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
resultCh := make(chan *AuthResult, len(acquirers))
g, ctx := errgroup.WithContext(ctx)
for _, acq := range acquirers {
g.Go(func() error {
res, err := acq.Authorize(ctx, txn)
if err != nil {
return err
}
if res.Approved {
resultCh <- res
cancel() // Signal others to stop
}
return nil
})
}
go func() {
g.Wait()
close(resultCh)
}()
if res, ok := <-resultCh; ok {
return res, nil
}
return nil, g.Wait() // All failed — return the first error
}
Warning: Sending the same transaction to multiple acquirers simultaneously can result in double charges if more than one approves before cancellation propagates. Use this pattern only for idempotent operations or when your providers support void-on-duplicate.
Bounded Concurrency for Batch Operations
Settlement reconciliation might involve checking thousands of transactions against a provider's API. Unbounded concurrency will get you rate-limited or worse. errgroup's SetLimit method (added in Go 1.20) handles this:
func reconcileBatch(ctx context.Context, txns []Transaction) []ReconcileResult {
results := make([]ReconcileResult, len(txns))
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(20) // Max 20 concurrent API calls
for i, txn := range txns {
g.Go(func() error {
res, err := reconcileOne(ctx, txn)
results[i] = ReconcileResult{TxnID: txn.ID, Result: res, Err: err}
return nil // Collect errors in results, don't fail the group
})
}
g.Wait()
return results
}
The SetLimit(20) call means at most 20 goroutines run concurrently. When one finishes, the next g.Go() call unblocks. This is cleaner than managing a semaphore channel yourself, and it integrates with errgroup's error handling.
Collecting Errors Without Losing Context
errgroup returns only the first error. In payment systems, you need all of them — for logging, for deciding whether to retry, and for incident response. Here's the pattern we use:
type MultiError struct {
mu sync.Mutex
errors []error
}
func (me *MultiError) Add(err error) {
me.mu.Lock()
me.errors = append(me.errors, err)
me.mu.Unlock()
}
func (me *MultiError) Err() error {
me.mu.Lock()
defer me.mu.Unlock()
if len(me.errors) == 0 {
return nil
}
return fmt.Errorf("%d provider errors: %w", len(me.errors), errors.Join(me.errors...))
}
Use it alongside errgroup — each goroutine appends to the MultiError and returns nil to the group, so all goroutines run to completion. Then check MultiError.Err() after g.Wait().
| Pattern | Error Behavior | Use Case |
|---|---|---|
| errgroup (default) | First error cancels all | Pre-auth checks (all must pass) |
| errgroup + nil returns | All run to completion | Health checks, batch reconciliation |
| errgroup + MultiError | Collect all, decide after | Multi-provider settlement |
| errgroup + channel | First success wins | Multi-acquirer authorization |
Production Lessons
- Always set a context timeout. errgroup inherits the parent context, but if that context has no deadline, a hung provider call blocks the group forever. We wrap every payment fan-out with a 5-second timeout.
- Log which goroutine failed. errgroup's first-error-wins behavior means you lose context about which provider caused the failure. Wrap errors with the provider name:
fmt.Errorf("stripe: %w", err). - Don't share mutable state between goroutines. The index-per-goroutine pattern (
results[i]) is safe because each goroutine writes to a unique index. But if you're building a shared map, you need a mutex. - Use SetLimit for external APIs. Even if your service can handle 1,000 concurrent goroutines, the provider's API probably can't. We've been rate-limited by every major PSP at least once during batch operations.
- Test with -race. Fan-out patterns are where data races hide. Run
go test -raceon every CI build. We caught a shared-buffer bug in our reconciliation fan-out that only manifested under load.
References
- Go errgroup package documentation
- Go Blog — Go Concurrency Patterns: Pipelines and cancellation
- Go context package — cancellation, deadlines, and values
- Go Data Race Detector documentation
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always add proper error handling and testing for production use.