The Saga Pattern for Distributed Payment Transactions

The $8,000-a-Week Problem

Our payment flow touched four services: the payment provider (Stripe), a wallet service, a ledger service, and a notification service. For months, we treated this like a single operation — charge the card, update the wallet, write the ledger entry, send the receipt. If any step failed, we'd log an error and move on.

The problem? "Move on" meant different things depending on where the failure happened. If the ledger timed out after the card was charged, we had collected money with no accounting record. If the wallet update failed after the ledger write, the customer's balance was wrong. Every week, our reconciliation team was manually fixing about $8,000 in discrepancies.

Key insight: Distributed transactions across microservices can't use traditional database transactions. You need a pattern that handles partial failures explicitly — and that's exactly what the Saga pattern does.

Why Two-Phase Commit Doesn't Work Here

The textbook answer to distributed transactions is 2PC (two-phase commit). In theory, a coordinator asks all participants to prepare, then tells them all to commit. In practice, 2PC is terrible for payment microservices:

It requires all participants to hold locks during the prepare phase — Stripe's API doesn't support that.
If the coordinator crashes between prepare and commit, all participants are stuck holding locks indefinitely.
It's synchronous and blocking. At 3,000 transactions per hour, we can't afford to have services waiting on each other.
Network partitions between services turn a 2PC into a split-brain nightmare.

The Saga pattern takes a fundamentally different approach: instead of trying to make the distributed operation atomic, it breaks it into a sequence of local transactions, each with a compensating action that can undo it.

Choreography vs. Orchestration

Aspect	Choreography	Orchestration
Coordination	Services emit events, others react	Central orchestrator directs each step
Coupling	Low — services are independent	Higher — orchestrator knows all steps
Visibility	Hard to trace — logic is scattered	Easy — single place to see the flow
Error handling	Complex — each service handles its own	Centralized compensation logic
Best for	Simple flows, 2-3 services	Complex flows, 4+ services, payments

We tried choreography first. Each service published events to Kafka, and downstream services reacted. It worked for about two months, then became impossible to debug. When a payment failed, we had to trace events across four different service logs to figure out what happened. We switched to orchestration and never looked back.

Our Payment Saga in Practice

Payment Saga — Happy Path vs. Compensation

Charge Card via Stripe

⇄

Refund Card

Reserve Wallet Funds

⇄

Release Wallet Funds

Write Ledger Entry

⇄

Reverse Ledger Entry

Send Receipt Email

⇄

Send Failure Notice

Forward (execute)

Backward (compensate)

Each step has a forward action and a compensating action. If step 3 (ledger write) fails, the orchestrator runs compensations in reverse order: release wallet funds, then refund the card. The receipt email never gets sent.

The Go Orchestrator

Our saga orchestrator is surprisingly simple. Each saga is a slice of steps, and each step has an Execute and a Compensate function:

type SagaStep struct {
    Name       string
    Execute    func(ctx context.Context, state *SagaState) error
    Compensate func(ctx context.Context, state *SagaState) error
}

type SagaState struct {
    TransactionID string
    MerchantID    string
    Amount        int64
    Currency      string
    StripeChargeID string
    WalletHoldID   string
    LedgerEntryID  string
    CompletedSteps []string
}

func RunSaga(ctx context.Context, steps []SagaStep, state *SagaState) error {
    for i, step := range steps {
        if err := step.Execute(ctx, state); err != nil {
            slog.ErrorContext(ctx, "saga step failed",
                slog.String("step", step.Name),
                slog.Int("step_index", i),
                slog.String("transaction_id", state.TransactionID),
                slog.String("error", err.Error()),
            )
            // Run compensations in reverse
            for j := i - 1; j >= 0; j-- {
                if compErr := steps[j].Compensate(ctx, state); compErr != nil {
                    slog.ErrorContext(ctx, "compensation failed",
                        slog.String("step", steps[j].Name),
                        slog.String("transaction_id", state.TransactionID),
                        slog.String("error", compErr.Error()),
                    )
                    // Alert on-call — manual intervention needed
                    alertOncall(ctx, state.TransactionID, steps[j].Name, compErr)
                }
            }
            return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
        }
        state.CompletedSteps = append(state.CompletedSteps, step.Name)
    }
    return nil
}

Critical detail: Every saga step must be idempotent. If the orchestrator crashes and restarts, it might re-execute a step that already succeeded. We use idempotency keys for Stripe charges and unique constraint checks for ledger entries to make this safe.

When Things Go Really Wrong

Partial Failure Scenario — Ledger Timeout

t=0ms Stripe charge $450 ✓ Success

t=340ms Reserve wallet funds ✓ Success

t=5340ms Write ledger entry ✗ Timeout (5s)

t=5341ms COMPENSATE: Release wallet funds ✓

t=5520ms COMPENSATE: Refund Stripe charge ✓

t=5890ms Saga complete — customer notified of failure, no money lost

The scary scenario is when a compensation itself fails. If we can't refund the Stripe charge, we have a real problem — money has been collected and we can't give it back automatically. For these cases, we persist the saga state to PostgreSQL and push the failed compensation to a dead letter queue. An on-call engineer gets paged, and they have all the context they need to resolve it manually.

Saga State Persistence

We store every saga execution in a saga_executions table. Each row tracks the transaction ID, current step, status (running, completed, compensating, failed), and a JSON blob of the saga state. This gives us two things: crash recovery (if the orchestrator dies, another instance picks up incomplete sagas) and a complete audit trail of every payment attempt.

CREATE TABLE saga_executions (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transaction_id  TEXT NOT NULL UNIQUE,
    current_step    INT NOT NULL DEFAULT 0,
    status          TEXT NOT NULL DEFAULT 'running',
    state           JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_saga_status ON saga_executions(status)
    WHERE status IN ('running', 'compensating');

Production Lessons

After running this pattern for over a year processing about 3,000 transactions per hour, here's what I've learned:

Timeouts are your biggest enemy. A step that times out is ambiguous — did it succeed or fail? We use shorter timeouts (3-5 seconds) and check the actual state before compensating. If the ledger write actually succeeded but we just didn't get the response, we don't want to reverse it.
Idempotency keys everywhere. Every Stripe charge uses an idempotency key derived from the transaction ID. Every ledger write has a unique constraint on transaction_id. Every wallet operation checks for duplicate holds. This makes retries safe.
Monitor saga duration. We alert if any saga takes longer than 30 seconds. Long-running sagas usually mean a downstream service is degraded, and we'd rather fail fast and compensate than hang.
Dead letter queues save lives. Failed compensations go to a DLQ in SQS with a 14-day retention. We process about 2-3 per week manually. Without the DLQ, those would be silent money leaks.

99.7%

Sagas complete without needing compensation

~850ms

Average saga completion time (4 steps)

Weekly reconciliation discrepancy (down from $8K)

The real takeaway: The Saga pattern isn't complicated — it's just disciplined. Every forward action gets a compensating action. Every step is idempotent. Every failure is handled explicitly. The hard part isn't the code; it's the discipline of thinking through every failure mode before you ship.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

The $8,000-a-Week Problem

Why Two-Phase Commit Doesn't Work Here

Choreography vs. Orchestration

Our Payment Saga in Practice

The Go Orchestrator

When Things Go Really Wrong

Saga State Persistence

Production Lessons

References

Related Articles