Why Financial Systems Need Better Error Handling
I've been writing Go for payment microservices for a while now, and if there's one thing I've learned the hard way, it's that if err != nil { return err } will eventually cost you money. Literally.
In most applications, an error is an error. You log it, maybe show the user a friendly message, and move on. In payment systems, the type of error changes everything. A timeout from the payment gateway means the charge might have gone through — retry blindly and you double-charge the customer. A "card declined" response is permanent — retrying it ten times won't change the outcome, but it will get you rate-limited by the gateway and flagged for suspicious behavior.
The stakes are different here. Every unclassified error is a potential dispute, a compliance finding, or a customer who never comes back. So we need error handling that does more than just propagate failures — it needs to classify them, carry context for audit trails, and inform retry decisions.
A real incident that shaped my thinking: we once had a service that retried on all errors from a gateway. A network blip caused 340 duplicate charges in 12 minutes. The refund process took three days. After that, we built the patterns I'm sharing here.
Typed Errors for Payment Domains
The foundation of everything else is a well-structured error type. Go's error interface is minimal by design, but for payments, we need errors that carry domain-specific information. Here's the core type I use across our payment services:
type PaymentError struct {
Code string
Message string
Retryable bool
GatewayResponse string
TransactionID string
HTTPStatus int
}
func (e *PaymentError) Error() string {
return fmt.Sprintf("payment error %s: %s (retryable: %t)",
e.Code, e.Message, e.Retryable)
}
// Sentinel errors for common cases
var (
ErrInsufficientFunds = &PaymentError{Code: "insufficient_funds", Retryable: false}
ErrGatewayTimeout = &PaymentError{Code: "gateway_timeout", Retryable: true}
ErrRateLimited = &PaymentError{Code: "rate_limited", Retryable: true}
ErrInvalidCard = &PaymentError{Code: "invalid_card", Retryable: false}
ErrFraudSuspected = &PaymentError{Code: "fraud_suspected", Retryable: false}
)
The key fields are Code for programmatic classification, Retryable for retry logic, and GatewayResponse for debugging. With Go 1.13+ error wrapping, we can check these anywhere in the call stack:
func handleChargeResult(err error) {
var payErr *PaymentError
if errors.As(err, &payErr) {
if payErr.Retryable {
// enqueue for retry
retryQueue.Push(payErr.TransactionID)
return
}
// permanent failure — notify the customer
notifyDecline(payErr.TransactionID, payErr.Code)
return
}
// unknown error — flag for manual review
escalate(err)
}
The errors.As call unwraps through any number of wrapping layers to find our PaymentError. This means intermediate services can add context without destroying the classification.
The Error Classification Pattern
Raw gateway responses are messy. Every payment processor returns errors differently — Stripe gives you structured error codes, some legacy gateways give you four-digit numeric codes, and others return free-text messages. I normalize everything through a classifier:
type ErrorCategory int
const (
CategoryRetryable ErrorCategory = iota
CategoryPermanent
CategoryRequiresReview
)
func ClassifyGatewayError(httpStatus int, gatewayCode string) (*PaymentError, ErrorCategory) {
// Network-level classification
if httpStatus >= 500 {
return &PaymentError{
Code: "gateway_error",
Message: "Gateway returned server error",
Retryable: true,
HTTPStatus: httpStatus,
}, CategoryRetryable
}
// Application-level classification
switch gatewayCode {
case "insufficient_funds", "card_declined", "expired_card":
return &PaymentError{
Code: gatewayCode,
Message: "Card was declined",
Retryable: false,
}, CategoryPermanent
case "rate_limit", "try_again_later":
return &PaymentError{
Code: gatewayCode,
Message: "Gateway rate limited",
Retryable: true,
}, CategoryRetryable
case "invalid_card_number", "invalid_cvv":
return &PaymentError{
Code: gatewayCode,
Message: "Invalid card details",
Retryable: false,
}, CategoryPermanent
case "fraud_warning", "risk_threshold":
return &PaymentError{
Code: gatewayCode,
Message: "Flagged for fraud review",
Retryable: false,
}, CategoryRequiresReview
default:
return &PaymentError{
Code: "unknown_" + gatewayCode,
Message: "Unrecognized gateway response",
Retryable: false,
}, CategoryRequiresReview
}
}
The important detail: unknown errors default to RequiresReview, not Retryable. In payments, when you don't know what happened, the safest thing is to stop and let a human look at it. Retrying an unknown error is how you get duplicate charges.
| Error Type | Retryable? | Action | Example |
|---|---|---|---|
| Timeout | Yes | Retry with idempotency key | Gateway didn't respond within 30s |
| Decline | No | Notify customer, stop retries | Insufficient funds, expired card |
| Rate Limit | Yes | Backoff and retry after delay | HTTP 429 from gateway API |
| Network Error | Yes | Retry with circuit breaker | DNS resolution failure, TCP reset |
| Fraud Block | No | Escalate to fraud review queue | Risk score exceeded threshold |
Wrapping Errors Without Losing Context
Go's fmt.Errorf with %w is great for adding context, but in financial systems you have to be deliberate about what context you add. Transaction IDs, gateway names, amounts — all useful for debugging. Card numbers, CVVs, account details — absolutely not. PCI DSS is very clear on this: sensitive authentication data must never appear in logs.
Here's the pattern I follow:
func chargeCard(ctx context.Context, req ChargeRequest) error {
resp, err := gateway.Charge(ctx, req)
if err != nil {
// Safe context: transaction ID, gateway name, amount
// Never include: card number, CVV, full account number
return fmt.Errorf(
"charge failed for txn %s via %s (amount: %d %s): %w",
req.TransactionID,
req.GatewayName,
req.Amount,
req.Currency,
classifyError(err, resp),
)
}
return nil
}
// What NOT to do:
// return fmt.Errorf("charge failed for card %s: %w", req.CardNumber, err)
// This puts PAN data in your logs. Don't do it.
The wrapped error preserves the full chain. Any caller can still use errors.As to extract the PaymentError and check the Retryable flag, but now the error message also carries the operational context you need when you're debugging at 2 AM.
One thing I've found useful: create a helper that masks sensitive data if it accidentally gets passed in. It's a safety net, not a primary defense, but it's caught real mistakes in code review.
Error Handling in Retry Logic
Retry logic in payment systems has to be smarter than a simple loop. You need to respect the Retryable flag, implement exponential backoff to avoid hammering a struggling gateway, and set a hard ceiling on attempts. Here's the retry function we use:
func RetryPayment(ctx context.Context, txnID string, fn func() error) error {
maxRetries := 3
baseDelay := 500 * time.Millisecond
var lastErr error
for attempt := 0; attempt <= maxRetries; attempt++ {
lastErr = fn()
if lastErr == nil {
return nil
}
var payErr *PaymentError
if errors.As(lastErr, &payErr) && !payErr.Retryable {
// Permanent failure — retrying won't help
return fmt.Errorf(
"permanent failure on attempt %d for txn %s: %w",
attempt+1, txnID, lastErr,
)
}
if attempt < maxRetries {
// Exponential backoff: 500ms, 1s, 2s
delay := baseDelay * time.Duration(1<
A few things to note. The function respects context cancellation — if the upstream caller times out or cancels, we stop retrying immediately instead of burning through attempts. The backoff is exponential but capped at three retries, because in payments, if it hasn't worked after three tries, something is genuinely wrong and you need human eyes on it.
Always pair retry logic with idempotency keys. If you're retrying a charge, the gateway needs to know it's the same charge attempt, not a new one. Without idempotency keys, retry logic is just a double-charge generator.
Audit Trail Errors
In financial systems, every error is a potential compliance event. Regulators and auditors want to know what happened, when, and what the system did about it. But PCI DSS means you can't just dump everything into your logs. You need structured logging that captures operational context while scrubbing sensitive data.
Here's the pattern I use with structured logging (we use slog from the standard library, but this works with zerolog or zap too):
func LogPaymentError(logger *slog.Logger, err error, txnID string) {
var payErr *PaymentError
if errors.As(err, &payErr) {
logger.Error("payment processing failed",
// Operational context — safe for logs
slog.String("transaction_id", txnID),
slog.String("error_code", payErr.Code),
slog.Bool("retryable", payErr.Retryable),
slog.Int("http_status", payErr.HTTPStatus),
slog.String("gateway_response", payErr.GatewayResponse),
slog.Time("timestamp", time.Now().UTC()),
// Classification for alerting
slog.String("category", classifyForAlert(payErr)),
// Never log these:
// slog.String("card_number", req.CardNumber), // PCI violation
// slog.String("cvv", req.CVV), // PCI violation
// slog.String("account_number", req.Account), // PII risk
)
return
}
// Unclassified error — log and escalate
logger.Error("unclassified payment error",
slog.String("transaction_id", txnID),
slog.String("error", err.Error()),
slog.String("category", "requires_review"),
slog.Time("timestamp", time.Now().UTC()),
)
}
func classifyForAlert(err *PaymentError) string {
if err.Retryable {
return "retryable"
}
if err.Code == "fraud_suspected" || err.Code == "risk_threshold" {
return "requires_review"
}
return "permanent"
}
The structured fields make these errors searchable. When compliance asks "show me all fraud-flagged transactions from last Tuesday," you can query category=requires_review AND error_code=fraud_suspected instead of grepping through unstructured log lines. That's the difference between a 30-second query and a two-hour investigation.
I also route different categories to different alerting channels. Retryable errors go to a dashboard for trend monitoring — a spike in timeouts might mean the gateway is having issues. Permanent declines are normal business flow. But requires_review errors page the on-call engineer, because those are the ones that can turn into real problems if they sit unattended.
Putting It All Together
These patterns aren't complex individually, but they compound. Typed errors feed the classifier. The classifier informs retry logic. Retry logic generates structured audit logs. Each layer builds on the one below it, and the result is a system where you can confidently answer "what happened to this transaction?" at any point in the pipeline.
If you're building payment systems in Go, start with the error type. Get that right, and the rest follows naturally. The PaymentError struct I showed above has been stable in our codebase for over two years — the classification logic changes as we integrate new gateways, but the core type hasn't needed modification. That's a good sign that the abstraction is at the right level.
References
The code examples in this article are simplified for clarity and do not represent production-ready implementations. Always conduct thorough testing, security review, and compliance validation before deploying error handling logic in financial systems. PCI DSS requirements vary by merchant level — consult your QSA for specific guidance on logging and data handling.