April 13, 2026 10 min read

Engineering Payment State Machines That Don’t Lose Money

Every payment system has a lifecycle — created, authorized, captured, settled. The question is whether you model that lifecycle explicitly or let it emerge from scattered if statements and a mutable status column. I’ve seen both approaches in production, and the implicit one always ends the same way: money in limbo, angry merchants, and an on-call engineer staring at a database row that makes no sense.

Why Naive Status Columns Fail

The most common pattern I see in early-stage payment systems is a single status column on the payments table. It starts as a VARCHAR — maybe an enum if someone was feeling disciplined — and the application code updates it with plain UPDATE statements as the payment progresses. It works fine for the first few months.

Then reality hits. A webhook from Stripe arrives twice and your payment jumps from authorized to settled, skipping captured entirely. A race condition between your capture job and a customer-initiated refund leaves a payment marked as refunded that was never actually captured. Your finance team exports the ledger and finds transactions that don’t add up because a payment was marked failed after money had already moved.

The root cause is always the same: there’s no enforcement of which transitions are valid. The status column is just a string, and any code path can set it to any value at any time. You’re relying on every developer, in every service, to know the rules — and to never make a mistake under pressure at 2 AM.

I’ve debugged enough of these incidents to know the pattern by heart. The fix isn’t better code review or more documentation. It’s making invalid transitions impossible at the type level.

Designing a Proper State Machine

A finite state machine for payments has three components: a set of states, a set of transitions between those states, and guards that determine whether a transition is allowed. The key insight is that the set of valid transitions is small relative to all possible state changes. A payment in created can move to pending or failed — but never directly to settled or refunded. Making that constraint explicit is what separates a robust payment system from a fragile one.

Here are the states I’ve found cover most card payment flows:

Payment State Lifecycle
created pending authorized captured settled failed refunded disputed Dashed lines = failure paths from any early state
Each arrow represents an explicitly defined, valid transition — anything else is rejected

Enforcing Valid Transitions in Code

The state machine needs to live in code, not in documentation. I define the transition table as a map — if a transition isn’t in the map, it doesn’t exist. Here’s the pattern I use in Go:

type PaymentState string

const (
    StateCreated    PaymentState = "created"
    StatePending    PaymentState = "pending"
    StateAuthorized PaymentState = "authorized"
    StateCaptured   PaymentState = "captured"
    StateSettled    PaymentState = "settled"
    StateFailed     PaymentState = "failed"
    StateRefunded   PaymentState = "refunded"
    StateDisputed   PaymentState = "disputed"
)

var validTransitions = map[PaymentState][]PaymentState{
    StateCreated:    {StatePending, StateFailed},
    StatePending:    {StateAuthorized, StateFailed},
    StateAuthorized: {StateCaptured, StateFailed},
    StateCaptured:   {StateSettled, StateRefunded},
    StateSettled:    {StateRefunded, StateDisputed},
    StateRefunded:   {},
    StateFailed:     {},
    StateDisputed:   {},
}

func (p *Payment) TransitionTo(next PaymentState) error {
    allowed, ok := validTransitions[p.State]
    if !ok {
        return fmt.Errorf("unknown state: %s", p.State)
    }
    for _, s := range allowed {
        if s == next {
            p.PreviousState = p.State
            p.State = next
            p.UpdatedAt = time.Now().UTC()
            return nil
        }
    }
    return fmt.Errorf(
        "invalid transition: %s -> %s (payment %s)",
        p.State, next, p.ID,
    )
}

This is deliberately simple. The validTransitions map is the single source of truth for what’s allowed. When a new developer joins the team, they don’t need to read a wiki page about payment flows — they read the map. When a product manager asks “can a settled payment be refunded?” you check the map. When a bug report says a payment jumped from created to settled, you know immediately that something bypassed the state machine, because that transition isn’t in the map.

The error message includes the payment ID and both states. This matters more than you’d think — when you’re tailing logs during an incident, you want to grep for the payment ID and immediately see what went wrong.

Handling Concurrent State Transitions

In any real payment system, multiple processes are trying to mutate the same payment simultaneously. Your API server receives a capture request while a webhook from the processor arrives with an authorization update. A timeout job tries to fail a payment at the exact moment the gateway responds with success. If you’re not careful, the last writer wins — and the last writer might be wrong.

The pattern I rely on is optimistic locking with a version column. Every time you transition a payment, you increment the version and include the expected version in your UPDATE’s WHERE clause:

func (r *PaymentRepo) TransitionState(
    ctx context.Context,
    paymentID string,
    from, to PaymentState,
    expectedVersion int,
) error {
    result, err := r.db.ExecContext(ctx, `
        UPDATE payments
        SET state = $1,
            previous_state = $2,
            version = version + 1,
            updated_at = now()
        WHERE id = $3
          AND state = $4
          AND version = $5`,
        to, from, paymentID, from, expectedVersion,
    )
    if err != nil {
        return fmt.Errorf("transition %s -> %s: %w", from, to, err)
    }
    rows, _ := result.RowsAffected()
    if rows == 0 {
        return ErrStaleState
    }
    return nil
}

If rows == 0, it means either the state changed since you last read it, or the version was bumped by another process. Either way, you reload the payment and re-evaluate. This is cheaper than pessimistic locking and avoids deadlocks entirely. The trade-off is that under high contention you’ll see more retries — but for payments, contention on a single row is rare. Most of the time, only one process is acting on a given payment at any moment.

The “Stuck in Pending” Problem

Every payment engineer has dealt with this: a payment moves to pending when you send it to the processor, but the response never comes back. The webhook is lost, the HTTP connection times out, the processor has an outage. Your payment sits in pending forever, and the customer is staring at a spinning loader.

You need a timeout strategy, and it needs to be part of the state machine — not a separate cron job that someone forgets to monitor. Here’s what I’ve found works:

  1. Record the deadline. When a payment enters pending, write a pending_deadline timestamp. I typically set this to 5 minutes for card payments, 30 minutes for bank transfers.
  2. Run a sweeper. A background goroutine queries for payments where state = 'pending' AND pending_deadline < now(). For each one, it attempts a status check against the processor’s API.
  3. Resolve or fail. If the processor says the payment succeeded, transition to authorized. If it says it failed or doesn’t recognize the payment, transition to failed. If the processor is unreachable, extend the deadline and try again — but cap the retries.
func (s *Sweeper) HandleStuckPayments(ctx context.Context) error {
    payments, err := s.repo.FindStuckPending(ctx, time.Now())
    if err != nil {
        return err
    }
    for _, p := range payments {
        status, err := s.gateway.CheckStatus(ctx, p.GatewayRef)
        if err != nil {
            s.repo.ExtendDeadline(ctx, p.ID, 2*time.Minute)
            continue
        }
        switch status {
        case gateway.StatusApproved:
            s.repo.TransitionState(ctx, p.ID, StatePending,
                StateAuthorized, p.Version)
        default:
            s.repo.TransitionState(ctx, p.ID, StatePending,
                StateFailed, p.Version)
        }
    }
    return nil
}

The critical detail: the sweeper uses the same TransitionState function with optimistic locking. If a webhook arrives and transitions the payment while the sweeper is running, the sweeper’s update will harmlessly fail with ErrStaleState. No race condition, no double-processing.

Aspect Naive Status Column Proper State Machine
Invalid transitions Silently accepted Rejected with error
Concurrency Last write wins Optimistic locking
Audit trail Only current state Full transition history
Debugging Guess from logs Query the history table
Stuck payments Manual intervention Automated sweeper
New developer onboarding Read the wiki (if it exists) Read the transition map
Ledger consistency Hope for the best Enforced by constraints

Database Schema for State History

The state machine enforces correctness going forward, but you also need to know what happened in the past. A separate payment_state_history table gives you a complete audit trail without cluttering the main payments table. Every time a transition succeeds, you append a row:

CREATE TABLE payment_state_history (
    id            BIGSERIAL PRIMARY KEY,
    payment_id    UUID NOT NULL REFERENCES payments(id),
    from_state    VARCHAR(20) NOT NULL,
    to_state      VARCHAR(20) NOT NULL,
    triggered_by  VARCHAR(100) NOT NULL,
    metadata      JSONB DEFAULT '{}',
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_state_history_payment
    ON payment_state_history (payment_id, created_at ASC);

The triggered_by column records what caused the transition — api:capture, webhook:stripe, sweeper:timeout, admin:manual. When you’re investigating an incident, this is the first thing you look at. The metadata JSONB column captures context that varies by transition: gateway response codes, error messages, the operator’s user ID for manual overrides.

In Go, I wrap the transition and the history insert in a single database transaction. If either fails, both roll back:

func (r *PaymentRepo) TransitionWithHistory(
    ctx context.Context,
    paymentID string,
    from, to PaymentState,
    version int,
    triggeredBy string,
    meta map[string]any,
) error {
    tx, err := r.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Transition with optimistic lock
    res, err := tx.ExecContext(ctx, `
        UPDATE payments
        SET state = $1, previous_state = $2,
            version = version + 1, updated_at = now()
        WHERE id = $3 AND state = $4 AND version = $5`,
        to, from, paymentID, from, version,
    )
    if err != nil {
        return err
    }
    if rows, _ := res.RowsAffected(); rows == 0 {
        return ErrStaleState
    }

    // Record history
    metaJSON, _ := json.Marshal(meta)
    _, err = tx.ExecContext(ctx, `
        INSERT INTO payment_state_history
            (payment_id, from_state, to_state, triggered_by, metadata)
        VALUES ($1, $2, $3, $4, $5)`,
        paymentID, from, to, triggeredBy, metaJSON,
    )
    if err != nil {
        return err
    }

    return tx.Commit()
}

This pattern has saved me more times than I can count. During a chargeback dispute, I pulled the full state history for the transaction, showed the exact sequence of events with timestamps and trigger sources, and the dispute was resolved in our favor within a day. Without that history, it would have been our word against the cardholder’s.

Key takeaway: A payment state machine isn’t just a design pattern — it’s a financial control. Every transition you prevent is a potential ledger discrepancy you’ll never have to reconcile. Every transition you record is evidence you’ll never have to reconstruct. Build the state machine first, then build everything else on top of it.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.