Payment Refund Engineering — State Machines, Partial Refunds, and the Ledger Entries Nobody Talks About

Why Refunds Are Harder Than Charges

Charges are optimistic. You send a request, the gateway says yes or no, you record the result. The happy path is linear. Refunds are the opposite — they're inherently adversarial to your own system. You're unwinding something that already settled, already hit the ledger, already got reported to finance, and possibly already triggered a commission payout to a sales partner.

I've built refund systems at two different payment companies now, and the pattern is always the same. The charge flow gets months of careful design. The refund flow gets a single endpoint that calls gateway.Refund(chargeID, amount) and hopes for the best. Then three months in, someone does a partial refund on a cross-currency transaction during the settlement window, and the whole thing falls apart.

The core problem is that refunds are not the inverse of charges. A charge creates one ledger entry. A refund might need to reverse that entry, create a new one, adjust fees, recalculate net settlement, and notify three downstream systems — all while handling the possibility that the original charge hasn't even settled yet.

72 hrs

Typical refund settlement

5 States

Minimum refund state machine

3-8%

Typical refund rate by volume

The Refund State Machine

If your refund has two states — "pending" and "done" — you're going to have a bad time. Refunds need at least five states to handle the real world, and each transition has rules about what can trigger it and what side effects it produces.

Initiated Request received

Processing Sent to gateway

Settled Funds returned

Failed Gateway rejected

The partially_refunded state lives on the original charge, not on the refund itself. A charge can have multiple child refunds, each with their own state lifecycle. This parent-child relationship is where most teams get tripped up — they model refund status on the charge record instead of giving each refund its own row.

// Go: Refund state machine with explicit transition validation
type RefundState string

const (
    RefundInitiated  RefundState = "initiated"
    RefundProcessing RefundState = "processing"
    RefundSettled    RefundState = "settled"
    RefundFailed     RefundState = "failed"
)

var validTransitions = map[RefundState][]RefundState{
    RefundInitiated:  {RefundProcessing, RefundFailed},
    RefundProcessing: {RefundSettled, RefundFailed},
    RefundFailed:     {RefundInitiated}, // allow retry
}

func (r *Refund) TransitionTo(next RefundState) error {
    allowed, ok := validTransitions[r.State]
    if !ok {
        return fmt.Errorf("no transitions from state %s", r.State)
    }
    for _, s := range allowed {
        if s == next {
            r.PreviousState = r.State
            r.State = next
            r.UpdatedAt = time.Now().UTC()
            return nil
        }
    }
    return fmt.Errorf("invalid transition: %s -> %s", r.State, next)
}

Design tip: Store every state transition in an audit log table, not just the current state. When finance asks "why did this refund take 4 days?" you need the full timeline — initiated at T0, sent to gateway at T0+2min, gateway timeout at T0+30s, retried at T0+1hr, settled at T0+96hr. Without the log, you're guessing.

Partial Refunds: The Accounting Nightmare

Full refunds are straightforward. Partial refunds are where the real complexity lives. You need to track the cumulative refunded amount against the original charge and enforce that it never exceeds the original. Sounds trivial until you consider concurrency.

Picture this: a customer service agent clicks "refund $20" at the exact moment an automated system triggers a "$15 partial refund" for a returned item. Both requests read the current refunded total as $0, both validate that their amount is under the $50 charge, and now you've issued $35 in refunds — or worse, both succeed at the gateway and you've refunded $35 when the business only intended $20.

// Go: Atomic partial refund with optimistic locking
func (s *RefundService) CreatePartialRefund(ctx context.Context, chargeID string, amount decimal.Decimal) (*Refund, error) {
    tx, err := s.db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
    if err != nil {
        return nil, fmt.Errorf("begin tx: %w", err)
    }
    defer tx.Rollback()

    // Lock the charge row and get current refund total
    var charge Charge
    err = tx.QueryRowContext(ctx,
        `SELECT id, amount, currency, refunded_total, version
         FROM charges WHERE id = $1 FOR UPDATE`, chargeID,
    ).Scan(&charge.ID, &charge.Amount, &charge.Currency, &charge.RefundedTotal, &charge.Version)
    if err != nil {
        return nil, fmt.Errorf("lock charge: %w", err)
    }

    remaining := charge.Amount.Sub(charge.RefundedTotal)
    if amount.GreaterThan(remaining) {
        return nil, fmt.Errorf("refund %s exceeds remaining %s", amount, remaining)
    }

    refund := &Refund{
        ID:       generateRefundID(),
        ChargeID: chargeID,
        Amount:   amount,
        Currency: charge.Currency,
        State:    RefundInitiated,
    }

    // Insert refund and update charge atomically
    _, err = tx.ExecContext(ctx,
        `UPDATE charges SET refunded_total = refunded_total + $1, version = version + 1
         WHERE id = $2 AND version = $3`, amount, chargeID, charge.Version)
    if err != nil {
        return nil, fmt.Errorf("update charge: %w", err)
    }

    return refund, tx.Commit()
}

The FOR UPDATE lock and version check are non-negotiable. I've seen teams try to solve this with application-level mutexes, and it works until you have two API server instances. The database is the only reliable coordination point.

Ledger Entries for Refunds

This is the part that trips up engineers who haven't worked in fintech before. A refund isn't just "subtract money from the merchant." It's a double-entry bookkeeping event, and the entries look different depending on whether the original charge has settled or not.

Refund Before Settlement vs. After Settlement

When a refund happens before the original charge has settled with the acquirer, you can often void the transaction entirely — no money actually moves. But once settlement has occurred, the funds have already been transferred, and the refund becomes a new money movement in the opposite direction.

Aspect	Pre-Settlement (Void)	Post-Settlement (Refund)
Money movement	None — authorization released	New transfer back to cardholder
Ledger entries	Reverse the pending entries	New debit/credit pair
Processing fees	Usually not charged	Original fee often not returned
Timeline	Instant to a few hours	3-10 business days
Gateway API call	`void(authorizationID)`	`refund(chargeID, amount)`
Reconciliation impact	Transaction disappears from settlement	Appears as separate line item

The ledger implications are significant. For a pre-settlement void, you reverse the original journal entries — debit and credit swap. For a post-settlement refund, you create entirely new entries: debit the merchant's settlement account, credit the customer's receivable. The original charge entries stay untouched because that money actually moved.

Watch out: Some gateways silently convert a refund request into a void if the charge hasn't settled yet. This is usually fine operationally, but it means your ledger logic needs to handle both outcomes from a single API call. Always check the gateway response type, not just the HTTP status code.

Timing Edge Cases

Timing is where refund systems go from "works in staging" to "on-call nightmare." Here are the three scenarios that have burned me:

Refund During the Settlement Window

You submit a refund at 11:58 PM. The acquirer's settlement batch cuts at midnight. Did your refund make it into tonight's batch, or will it appear in tomorrow's? You genuinely don't know, and the gateway might not tell you for 24 hours. Your reconciliation pipeline needs to handle the refund appearing in either batch without double-counting it.

Refund After a Chargeback

A customer files a chargeback on Monday. On Tuesday, before you've processed the chargeback notification, a support agent issues a refund for the same transaction. Now the customer gets their money back twice — once from the refund and once from the chargeback. Your system needs to check for open disputes before allowing a refund, and it needs to handle the race condition where the chargeback webhook arrives between your check and your refund submission.

Cross-Day Refunds and Reporting

The charge was on March 31. The refund is on April 2. These land in different reporting periods. If your finance team closes the books monthly, that March revenue number just changed retroactively. Your reporting system needs to decide: does the refund reduce March revenue or April revenue? There's no universally correct answer — it depends on your accounting policy — but your code needs to support whichever choice the business makes.

The Gateway Abstraction Problem

Every gateway handles refunds slightly differently, and the differences are in the details that matter. Stripe lets you issue multiple partial refunds up to the original amount with a simple API call. Adyen requires you to reference the original pspReference and handles partial refunds through modification requests. Direct acquirer integrations often require you to submit refunds in batch files and poll for results hours later.

The temptation is to build a unified RefundProvider interface and pretend the differences don't exist. That works for the happy path. It falls apart when you need to handle gateway-specific error codes, retry semantics, and the fact that some gateways are synchronous (you know immediately if the refund succeeded) while others are asynchronous (you submit and wait for a webhook).

The approach I've found workable: a thin interface for the common operations, with gateway-specific adapters that expose the full capability set. The refund orchestrator calls the interface for standard refunds but can reach through to the adapter when it needs gateway-specific behavior like checking void eligibility or querying refund status.

Production Lessons

After running refund systems in production across card payments, bank transfers, and e-wallet providers, these are the gotchas that don't show up in documentation:

Currency rounding will bite you. A $10.00 charge split into three partial refunds of $3.33 each leaves $0.01 unrefundable. Decide upfront whether the last refund gets rounded up or whether you allow a 1-cent residual. Document the policy. Your support team will thank you.
Refund reference IDs must be globally unique and idempotent. If a network timeout causes a retry, the gateway needs to recognize it as a duplicate. Generate a deterministic refund ID from the charge ID + refund sequence number, not a random UUID. Some gateways deduplicate on your reference; others don't — know which is which.
Webhook ordering is not guaranteed. You might receive the refund.settled webhook before refund.created. Your handler needs to be order-independent. I use an upsert pattern: if the refund row doesn't exist when a settlement webhook arrives, create it in the settled state and backfill the details.
Idempotency keys expire. Stripe's idempotency keys last 24 hours. If your retry logic spans longer than that (say, a weekend outage), the same key will create a second refund. Track gateway-side refund IDs independently of your idempotency mechanism.
Test with real settlement cycles. Sandbox environments usually settle refunds instantly. Production doesn't. The bugs that matter — race conditions with settlement batches, reconciliation mismatches, reporting period boundaries — only surface when there's a real 48-72 hour delay between submission and settlement.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Technical specifications are subject to change — always verify with official documentation.

Why Refunds Are Harder Than Charges

The Refund State Machine

Partial Refunds: The Accounting Nightmare

Ledger Entries for Refunds

Refund Before Settlement vs. After Settlement

Timing Edge Cases

Refund During the Settlement Window

Refund After a Chargeback

Cross-Day Refunds and Reporting

The Gateway Abstraction Problem

Production Lessons

References

Related Articles