April 9, 2026 11 min read

The Saga Pattern for Distributed Payment Transactions — How We Stopped Losing Money to Partial Failures

We were hemorrhaging about $8,000 a week to partial failures. A card would get charged, but the ledger update would timeout, leaving us with money collected and no record of it. The Saga pattern turned our most fragile pipeline into our most reliable one.

The $8,000-a-Week Problem

Our payment flow touched four services: the payment provider (Stripe), a wallet service, a ledger service, and a notification service. For months, we treated this like a single operation — charge the card, update the wallet, write the ledger entry, send the receipt. If any step failed, we'd log an error and move on.

The problem? "Move on" meant different things depending on where the failure happened. If the ledger timed out after the card was charged, we had collected money with no accounting record. If the wallet update failed after the ledger write, the customer's balance was wrong. Every week, our reconciliation team was manually fixing about $8,000 in discrepancies.

Key insight: Distributed transactions across microservices can't use traditional database transactions. You need a pattern that handles partial failures explicitly — and that's exactly what the Saga pattern does.

Why Two-Phase Commit Doesn't Work Here

The textbook answer to distributed transactions is 2PC (two-phase commit). In theory, a coordinator asks all participants to prepare, then tells them all to commit. In practice, 2PC is terrible for payment microservices:

The Saga pattern takes a fundamentally different approach: instead of trying to make the distributed operation atomic, it breaks it into a sequence of local transactions, each with a compensating action that can undo it.

Choreography vs. Orchestration

Aspect Choreography Orchestration
Coordination Services emit events, others react Central orchestrator directs each step
Coupling Low — services are independent Higher — orchestrator knows all steps
Visibility Hard to trace — logic is scattered Easy — single place to see the flow
Error handling Complex — each service handles its own Centralized compensation logic
Best for Simple flows, 2-3 services Complex flows, 4+ services, payments

We tried choreography first. Each service published events to Kafka, and downstream services reacted. It worked for about two months, then became impossible to debug. When a payment failed, we had to trace events across four different service logs to figure out what happened. We switched to orchestration and never looked back.

Our Payment Saga in Practice

Payment Saga — Happy Path vs. Compensation
1
Charge Card via Stripe
Refund Card
2
Reserve Wallet Funds
Release Wallet Funds
3
Write Ledger Entry
Reverse Ledger Entry
4
Send Receipt Email
Send Failure Notice
Forward (execute)
Backward (compensate)

Each step has a forward action and a compensating action. If step 3 (ledger write) fails, the orchestrator runs compensations in reverse order: release wallet funds, then refund the card. The receipt email never gets sent.

The Go Orchestrator

Our saga orchestrator is surprisingly simple. Each saga is a slice of steps, and each step has an Execute and a Compensate function:

type SagaStep struct {
    Name       string
    Execute    func(ctx context.Context, state *SagaState) error
    Compensate func(ctx context.Context, state *SagaState) error
}

type SagaState struct {
    TransactionID string
    MerchantID    string
    Amount        int64
    Currency      string
    StripeChargeID string
    WalletHoldID   string
    LedgerEntryID  string
    CompletedSteps []string
}

func RunSaga(ctx context.Context, steps []SagaStep, state *SagaState) error {
    for i, step := range steps {
        if err := step.Execute(ctx, state); err != nil {
            slog.ErrorContext(ctx, "saga step failed",
                slog.String("step", step.Name),
                slog.Int("step_index", i),
                slog.String("transaction_id", state.TransactionID),
                slog.String("error", err.Error()),
            )
            // Run compensations in reverse
            for j := i - 1; j >= 0; j-- {
                if compErr := steps[j].Compensate(ctx, state); compErr != nil {
                    slog.ErrorContext(ctx, "compensation failed",
                        slog.String("step", steps[j].Name),
                        slog.String("transaction_id", state.TransactionID),
                        slog.String("error", compErr.Error()),
                    )
                    // Alert on-call — manual intervention needed
                    alertOncall(ctx, state.TransactionID, steps[j].Name, compErr)
                }
            }
            return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
        }
        state.CompletedSteps = append(state.CompletedSteps, step.Name)
    }
    return nil
}

Critical detail: Every saga step must be idempotent. If the orchestrator crashes and restarts, it might re-execute a step that already succeeded. We use idempotency keys for Stripe charges and unique constraint checks for ledger entries to make this safe.

When Things Go Really Wrong

Partial Failure Scenario — Ledger Timeout
t=0ms Stripe charge $450 ✓ Success
t=340ms Reserve wallet funds ✓ Success
t=5340ms Write ledger entry ✗ Timeout (5s)
t=5341ms COMPENSATE: Release wallet funds
t=5520ms COMPENSATE: Refund Stripe charge
t=5890ms Saga complete — customer notified of failure, no money lost

The scary scenario is when a compensation itself fails. If we can't refund the Stripe charge, we have a real problem — money has been collected and we can't give it back automatically. For these cases, we persist the saga state to PostgreSQL and push the failed compensation to a dead letter queue. An on-call engineer gets paged, and they have all the context they need to resolve it manually.

Saga State Persistence

We store every saga execution in a saga_executions table. Each row tracks the transaction ID, current step, status (running, completed, compensating, failed), and a JSON blob of the saga state. This gives us two things: crash recovery (if the orchestrator dies, another instance picks up incomplete sagas) and a complete audit trail of every payment attempt.

CREATE TABLE saga_executions (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transaction_id  TEXT NOT NULL UNIQUE,
    current_step    INT NOT NULL DEFAULT 0,
    status          TEXT NOT NULL DEFAULT 'running',
    state           JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_saga_status ON saga_executions(status)
    WHERE status IN ('running', 'compensating');

Production Lessons

After running this pattern for over a year processing about 3,000 transactions per hour, here's what I've learned:

99.7%
Sagas complete without needing compensation
~850ms
Average saga completion time (4 steps)
$0
Weekly reconciliation discrepancy (down from $8K)

The real takeaway: The Saga pattern isn't complicated — it's just disciplined. Every forward action gets a compensating action. Every step is idempotent. Every failure is handled explicitly. The hard part isn't the code; it's the discipline of thinking through every failure mode before you ship.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.