Why Refunds Are Harder Than Charges
Charges are optimistic. You send a request, the gateway says yes or no, you record the result. The happy path is linear. Refunds are the opposite — they're inherently adversarial to your own system. You're unwinding something that already settled, already hit the ledger, already got reported to finance, and possibly already triggered a commission payout to a sales partner.
I've built refund systems at two different payment companies now, and the pattern is always the same. The charge flow gets months of careful design. The refund flow gets a single endpoint that calls gateway.Refund(chargeID, amount) and hopes for the best. Then three months in, someone does a partial refund on a cross-currency transaction during the settlement window, and the whole thing falls apart.
The core problem is that refunds are not the inverse of charges. A charge creates one ledger entry. A refund might need to reverse that entry, create a new one, adjust fees, recalculate net settlement, and notify three downstream systems — all while handling the possibility that the original charge hasn't even settled yet.
The Refund State Machine
If your refund has two states — "pending" and "done" — you're going to have a bad time. Refunds need at least five states to handle the real world, and each transition has rules about what can trigger it and what side effects it produces.
The partially_refunded state lives on the original charge, not on the refund itself. A charge can have multiple child refunds, each with their own state lifecycle. This parent-child relationship is where most teams get tripped up — they model refund status on the charge record instead of giving each refund its own row.
// Go: Refund state machine with explicit transition validation
type RefundState string
const (
RefundInitiated RefundState = "initiated"
RefundProcessing RefundState = "processing"
RefundSettled RefundState = "settled"
RefundFailed RefundState = "failed"
)
var validTransitions = map[RefundState][]RefundState{
RefundInitiated: {RefundProcessing, RefundFailed},
RefundProcessing: {RefundSettled, RefundFailed},
RefundFailed: {RefundInitiated}, // allow retry
}
func (r *Refund) TransitionTo(next RefundState) error {
allowed, ok := validTransitions[r.State]
if !ok {
return fmt.Errorf("no transitions from state %s", r.State)
}
for _, s := range allowed {
if s == next {
r.PreviousState = r.State
r.State = next
r.UpdatedAt = time.Now().UTC()
return nil
}
}
return fmt.Errorf("invalid transition: %s -> %s", r.State, next)
}
Design tip: Store every state transition in an audit log table, not just the current state. When finance asks "why did this refund take 4 days?" you need the full timeline — initiated at T0, sent to gateway at T0+2min, gateway timeout at T0+30s, retried at T0+1hr, settled at T0+96hr. Without the log, you're guessing.
Partial Refunds: The Accounting Nightmare
Full refunds are straightforward. Partial refunds are where the real complexity lives. You need to track the cumulative refunded amount against the original charge and enforce that it never exceeds the original. Sounds trivial until you consider concurrency.
Picture this: a customer service agent clicks "refund $20" at the exact moment an automated system triggers a "$15 partial refund" for a returned item. Both requests read the current refunded total as $0, both validate that their amount is under the $50 charge, and now you've issued $35 in refunds — or worse, both succeed at the gateway and you've refunded $35 when the business only intended $20.
// Go: Atomic partial refund with optimistic locking
func (s *RefundService) CreatePartialRefund(ctx context.Context, chargeID string, amount decimal.Decimal) (*Refund, error) {
tx, err := s.db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
if err != nil {
return nil, fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback()
// Lock the charge row and get current refund total
var charge Charge
err = tx.QueryRowContext(ctx,
`SELECT id, amount, currency, refunded_total, version
FROM charges WHERE id = $1 FOR UPDATE`, chargeID,
).Scan(&charge.ID, &charge.Amount, &charge.Currency, &charge.RefundedTotal, &charge.Version)
if err != nil {
return nil, fmt.Errorf("lock charge: %w", err)
}
remaining := charge.Amount.Sub(charge.RefundedTotal)
if amount.GreaterThan(remaining) {
return nil, fmt.Errorf("refund %s exceeds remaining %s", amount, remaining)
}
refund := &Refund{
ID: generateRefundID(),
ChargeID: chargeID,
Amount: amount,
Currency: charge.Currency,
State: RefundInitiated,
}
// Insert refund and update charge atomically
_, err = tx.ExecContext(ctx,
`UPDATE charges SET refunded_total = refunded_total + $1, version = version + 1
WHERE id = $2 AND version = $3`, amount, chargeID, charge.Version)
if err != nil {
return nil, fmt.Errorf("update charge: %w", err)
}
return refund, tx.Commit()
}
The FOR UPDATE lock and version check are non-negotiable. I've seen teams try to solve this with application-level mutexes, and it works until you have two API server instances. The database is the only reliable coordination point.
Ledger Entries for Refunds
This is the part that trips up engineers who haven't worked in fintech before. A refund isn't just "subtract money from the merchant." It's a double-entry bookkeeping event, and the entries look different depending on whether the original charge has settled or not.
Refund Before Settlement vs. After Settlement
When a refund happens before the original charge has settled with the acquirer, you can often void the transaction entirely — no money actually moves. But once settlement has occurred, the funds have already been transferred, and the refund becomes a new money movement in the opposite direction.
| Aspect | Pre-Settlement (Void) | Post-Settlement (Refund) |
|---|---|---|
| Money movement | None — authorization released | New transfer back to cardholder |
| Ledger entries | Reverse the pending entries | New debit/credit pair |
| Processing fees | Usually not charged | Original fee often not returned |
| Timeline | Instant to a few hours | 3-10 business days |
| Gateway API call | void(authorizationID) |
refund(chargeID, amount) |
| Reconciliation impact | Transaction disappears from settlement | Appears as separate line item |
The ledger implications are significant. For a pre-settlement void, you reverse the original journal entries — debit and credit swap. For a post-settlement refund, you create entirely new entries: debit the merchant's settlement account, credit the customer's receivable. The original charge entries stay untouched because that money actually moved.
Watch out: Some gateways silently convert a refund request into a void if the charge hasn't settled yet. This is usually fine operationally, but it means your ledger logic needs to handle both outcomes from a single API call. Always check the gateway response type, not just the HTTP status code.
Timing Edge Cases
Timing is where refund systems go from "works in staging" to "on-call nightmare." Here are the three scenarios that have burned me:
Refund During the Settlement Window
You submit a refund at 11:58 PM. The acquirer's settlement batch cuts at midnight. Did your refund make it into tonight's batch, or will it appear in tomorrow's? You genuinely don't know, and the gateway might not tell you for 24 hours. Your reconciliation pipeline needs to handle the refund appearing in either batch without double-counting it.
Refund After a Chargeback
A customer files a chargeback on Monday. On Tuesday, before you've processed the chargeback notification, a support agent issues a refund for the same transaction. Now the customer gets their money back twice — once from the refund and once from the chargeback. Your system needs to check for open disputes before allowing a refund, and it needs to handle the race condition where the chargeback webhook arrives between your check and your refund submission.
Cross-Day Refunds and Reporting
The charge was on March 31. The refund is on April 2. These land in different reporting periods. If your finance team closes the books monthly, that March revenue number just changed retroactively. Your reporting system needs to decide: does the refund reduce March revenue or April revenue? There's no universally correct answer — it depends on your accounting policy — but your code needs to support whichever choice the business makes.
The Gateway Abstraction Problem
Every gateway handles refunds slightly differently, and the differences are in the details that matter. Stripe lets you issue multiple partial refunds up to the original amount with a simple API call. Adyen requires you to reference the original pspReference and handles partial refunds through modification requests. Direct acquirer integrations often require you to submit refunds in batch files and poll for results hours later.
The temptation is to build a unified RefundProvider interface and pretend the differences don't exist. That works for the happy path. It falls apart when you need to handle gateway-specific error codes, retry semantics, and the fact that some gateways are synchronous (you know immediately if the refund succeeded) while others are asynchronous (you submit and wait for a webhook).
The approach I've found workable: a thin interface for the common operations, with gateway-specific adapters that expose the full capability set. The refund orchestrator calls the interface for standard refunds but can reach through to the adapter when it needs gateway-specific behavior like checking void eligibility or querying refund status.
Production Lessons
After running refund systems in production across card payments, bank transfers, and e-wallet providers, these are the gotchas that don't show up in documentation:
- Currency rounding will bite you. A $10.00 charge split into three partial refunds of $3.33 each leaves $0.01 unrefundable. Decide upfront whether the last refund gets rounded up or whether you allow a 1-cent residual. Document the policy. Your support team will thank you.
- Refund reference IDs must be globally unique and idempotent. If a network timeout causes a retry, the gateway needs to recognize it as a duplicate. Generate a deterministic refund ID from the charge ID + refund sequence number, not a random UUID. Some gateways deduplicate on your reference; others don't — know which is which.
- Webhook ordering is not guaranteed. You might receive the
refund.settledwebhook beforerefund.created. Your handler needs to be order-independent. I use an upsert pattern: if the refund row doesn't exist when a settlement webhook arrives, create it in the settled state and backfill the details. - Idempotency keys expire. Stripe's idempotency keys last 24 hours. If your retry logic spans longer than that (say, a weekend outage), the same key will create a second refund. Track gateway-side refund IDs independently of your idempotency mechanism.
- Test with real settlement cycles. Sandbox environments usually settle refunds instantly. Production doesn't. The bugs that matter — race conditions with settlement batches, reconciliation mismatches, reporting period boundaries — only surface when there's a real 48-72 hour delay between submission and settlement.
References
- Stripe Refunds — API reference and refund lifecycle documentation
- Adyen Refund — Modification requests and partial refund handling
- PCI DSS v4.0.1 — Payment Card Industry Data Security Standard
- shopspring/decimal — Arbitrary-precision fixed-point decimals in Go
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Technical specifications are subject to change — always verify with official documentation.