April 12, 2026 9 min read

Payment Refund Engineering — State Machines, Partial Refunds, and the Ledger Entries Nobody Talks About

Every payment team builds the charge flow first and treats refunds as a weekend project. Then production happens, and you discover that reversing money touches more systems, has more edge cases, and breaks in more creative ways than moving it forward ever did.

Why Refunds Are Harder Than Charges

Charges are optimistic. You send a request, the gateway says yes or no, you record the result. The happy path is linear. Refunds are the opposite — they're inherently adversarial to your own system. You're unwinding something that already settled, already hit the ledger, already got reported to finance, and possibly already triggered a commission payout to a sales partner.

I've built refund systems at two different payment companies now, and the pattern is always the same. The charge flow gets months of careful design. The refund flow gets a single endpoint that calls gateway.Refund(chargeID, amount) and hopes for the best. Then three months in, someone does a partial refund on a cross-currency transaction during the settlement window, and the whole thing falls apart.

The core problem is that refunds are not the inverse of charges. A charge creates one ledger entry. A refund might need to reverse that entry, create a new one, adjust fees, recalculate net settlement, and notify three downstream systems — all while handling the possibility that the original charge hasn't even settled yet.

72 hrs
Typical refund settlement
5 States
Minimum refund state machine
3-8%
Typical refund rate by volume

The Refund State Machine

If your refund has two states — "pending" and "done" — you're going to have a bad time. Refunds need at least five states to handle the real world, and each transition has rules about what can trigger it and what side effects it produces.

Initiated Request received
Processing Sent to gateway
Settled Funds returned
Failed Gateway rejected

The partially_refunded state lives on the original charge, not on the refund itself. A charge can have multiple child refunds, each with their own state lifecycle. This parent-child relationship is where most teams get tripped up — they model refund status on the charge record instead of giving each refund its own row.

// Go: Refund state machine with explicit transition validation
type RefundState string

const (
    RefundInitiated  RefundState = "initiated"
    RefundProcessing RefundState = "processing"
    RefundSettled    RefundState = "settled"
    RefundFailed     RefundState = "failed"
)

var validTransitions = map[RefundState][]RefundState{
    RefundInitiated:  {RefundProcessing, RefundFailed},
    RefundProcessing: {RefundSettled, RefundFailed},
    RefundFailed:     {RefundInitiated}, // allow retry
}

func (r *Refund) TransitionTo(next RefundState) error {
    allowed, ok := validTransitions[r.State]
    if !ok {
        return fmt.Errorf("no transitions from state %s", r.State)
    }
    for _, s := range allowed {
        if s == next {
            r.PreviousState = r.State
            r.State = next
            r.UpdatedAt = time.Now().UTC()
            return nil
        }
    }
    return fmt.Errorf("invalid transition: %s -> %s", r.State, next)
}

Design tip: Store every state transition in an audit log table, not just the current state. When finance asks "why did this refund take 4 days?" you need the full timeline — initiated at T0, sent to gateway at T0+2min, gateway timeout at T0+30s, retried at T0+1hr, settled at T0+96hr. Without the log, you're guessing.

Partial Refunds: The Accounting Nightmare

Full refunds are straightforward. Partial refunds are where the real complexity lives. You need to track the cumulative refunded amount against the original charge and enforce that it never exceeds the original. Sounds trivial until you consider concurrency.

Picture this: a customer service agent clicks "refund $20" at the exact moment an automated system triggers a "$15 partial refund" for a returned item. Both requests read the current refunded total as $0, both validate that their amount is under the $50 charge, and now you've issued $35 in refunds — or worse, both succeed at the gateway and you've refunded $35 when the business only intended $20.

// Go: Atomic partial refund with optimistic locking
func (s *RefundService) CreatePartialRefund(ctx context.Context, chargeID string, amount decimal.Decimal) (*Refund, error) {
    tx, err := s.db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelSerializable})
    if err != nil {
        return nil, fmt.Errorf("begin tx: %w", err)
    }
    defer tx.Rollback()

    // Lock the charge row and get current refund total
    var charge Charge
    err = tx.QueryRowContext(ctx,
        `SELECT id, amount, currency, refunded_total, version
         FROM charges WHERE id = $1 FOR UPDATE`, chargeID,
    ).Scan(&charge.ID, &charge.Amount, &charge.Currency, &charge.RefundedTotal, &charge.Version)
    if err != nil {
        return nil, fmt.Errorf("lock charge: %w", err)
    }

    remaining := charge.Amount.Sub(charge.RefundedTotal)
    if amount.GreaterThan(remaining) {
        return nil, fmt.Errorf("refund %s exceeds remaining %s", amount, remaining)
    }

    refund := &Refund{
        ID:       generateRefundID(),
        ChargeID: chargeID,
        Amount:   amount,
        Currency: charge.Currency,
        State:    RefundInitiated,
    }

    // Insert refund and update charge atomically
    _, err = tx.ExecContext(ctx,
        `UPDATE charges SET refunded_total = refunded_total + $1, version = version + 1
         WHERE id = $2 AND version = $3`, amount, chargeID, charge.Version)
    if err != nil {
        return nil, fmt.Errorf("update charge: %w", err)
    }

    return refund, tx.Commit()
}

The FOR UPDATE lock and version check are non-negotiable. I've seen teams try to solve this with application-level mutexes, and it works until you have two API server instances. The database is the only reliable coordination point.

Ledger Entries for Refunds

This is the part that trips up engineers who haven't worked in fintech before. A refund isn't just "subtract money from the merchant." It's a double-entry bookkeeping event, and the entries look different depending on whether the original charge has settled or not.

Refund Before Settlement vs. After Settlement

When a refund happens before the original charge has settled with the acquirer, you can often void the transaction entirely — no money actually moves. But once settlement has occurred, the funds have already been transferred, and the refund becomes a new money movement in the opposite direction.

Aspect Pre-Settlement (Void) Post-Settlement (Refund)
Money movement None — authorization released New transfer back to cardholder
Ledger entries Reverse the pending entries New debit/credit pair
Processing fees Usually not charged Original fee often not returned
Timeline Instant to a few hours 3-10 business days
Gateway API call void(authorizationID) refund(chargeID, amount)
Reconciliation impact Transaction disappears from settlement Appears as separate line item

The ledger implications are significant. For a pre-settlement void, you reverse the original journal entries — debit and credit swap. For a post-settlement refund, you create entirely new entries: debit the merchant's settlement account, credit the customer's receivable. The original charge entries stay untouched because that money actually moved.

Watch out: Some gateways silently convert a refund request into a void if the charge hasn't settled yet. This is usually fine operationally, but it means your ledger logic needs to handle both outcomes from a single API call. Always check the gateway response type, not just the HTTP status code.

Timing Edge Cases

Timing is where refund systems go from "works in staging" to "on-call nightmare." Here are the three scenarios that have burned me:

Refund During the Settlement Window

You submit a refund at 11:58 PM. The acquirer's settlement batch cuts at midnight. Did your refund make it into tonight's batch, or will it appear in tomorrow's? You genuinely don't know, and the gateway might not tell you for 24 hours. Your reconciliation pipeline needs to handle the refund appearing in either batch without double-counting it.

Refund After a Chargeback

A customer files a chargeback on Monday. On Tuesday, before you've processed the chargeback notification, a support agent issues a refund for the same transaction. Now the customer gets their money back twice — once from the refund and once from the chargeback. Your system needs to check for open disputes before allowing a refund, and it needs to handle the race condition where the chargeback webhook arrives between your check and your refund submission.

Cross-Day Refunds and Reporting

The charge was on March 31. The refund is on April 2. These land in different reporting periods. If your finance team closes the books monthly, that March revenue number just changed retroactively. Your reporting system needs to decide: does the refund reduce March revenue or April revenue? There's no universally correct answer — it depends on your accounting policy — but your code needs to support whichever choice the business makes.

The Gateway Abstraction Problem

Every gateway handles refunds slightly differently, and the differences are in the details that matter. Stripe lets you issue multiple partial refunds up to the original amount with a simple API call. Adyen requires you to reference the original pspReference and handles partial refunds through modification requests. Direct acquirer integrations often require you to submit refunds in batch files and poll for results hours later.

The temptation is to build a unified RefundProvider interface and pretend the differences don't exist. That works for the happy path. It falls apart when you need to handle gateway-specific error codes, retry semantics, and the fact that some gateways are synchronous (you know immediately if the refund succeeded) while others are asynchronous (you submit and wait for a webhook).

The approach I've found workable: a thin interface for the common operations, with gateway-specific adapters that expose the full capability set. The refund orchestrator calls the interface for standard refunds but can reach through to the adapter when it needs gateway-specific behavior like checking void eligibility or querying refund status.

Production Lessons

After running refund systems in production across card payments, bank transfers, and e-wallet providers, these are the gotchas that don't show up in documentation:

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Technical specifications are subject to change — always verify with official documentation.