The $8,000-a-Week Problem
Our payment flow touched four services: the payment provider (Stripe), a wallet service, a ledger service, and a notification service. For months, we treated this like a single operation — charge the card, update the wallet, write the ledger entry, send the receipt. If any step failed, we'd log an error and move on.
The problem? "Move on" meant different things depending on where the failure happened. If the ledger timed out after the card was charged, we had collected money with no accounting record. If the wallet update failed after the ledger write, the customer's balance was wrong. Every week, our reconciliation team was manually fixing about $8,000 in discrepancies.
Key insight: Distributed transactions across microservices can't use traditional database transactions. You need a pattern that handles partial failures explicitly — and that's exactly what the Saga pattern does.
Why Two-Phase Commit Doesn't Work Here
The textbook answer to distributed transactions is 2PC (two-phase commit). In theory, a coordinator asks all participants to prepare, then tells them all to commit. In practice, 2PC is terrible for payment microservices:
- It requires all participants to hold locks during the prepare phase — Stripe's API doesn't support that.
- If the coordinator crashes between prepare and commit, all participants are stuck holding locks indefinitely.
- It's synchronous and blocking. At 3,000 transactions per hour, we can't afford to have services waiting on each other.
- Network partitions between services turn a 2PC into a split-brain nightmare.
The Saga pattern takes a fundamentally different approach: instead of trying to make the distributed operation atomic, it breaks it into a sequence of local transactions, each with a compensating action that can undo it.
Choreography vs. Orchestration
We tried choreography first. Each service published events to Kafka, and downstream services reacted. It worked for about two months, then became impossible to debug. When a payment failed, we had to trace events across four different service logs to figure out what happened. We switched to orchestration and never looked back.
Our Payment Saga in Practice
Each step has a forward action and a compensating action. If step 3 (ledger write) fails, the orchestrator runs compensations in reverse order: release wallet funds, then refund the card. The receipt email never gets sent.
The Go Orchestrator
Our saga orchestrator is surprisingly simple. Each saga is a slice of steps, and each step has an Execute and a Compensate function:
type SagaStep struct {
Name string
Execute func(ctx context.Context, state *SagaState) error
Compensate func(ctx context.Context, state *SagaState) error
}
type SagaState struct {
TransactionID string
MerchantID string
Amount int64
Currency string
StripeChargeID string
WalletHoldID string
LedgerEntryID string
CompletedSteps []string
}
func RunSaga(ctx context.Context, steps []SagaStep, state *SagaState) error {
for i, step := range steps {
if err := step.Execute(ctx, state); err != nil {
slog.ErrorContext(ctx, "saga step failed",
slog.String("step", step.Name),
slog.Int("step_index", i),
slog.String("transaction_id", state.TransactionID),
slog.String("error", err.Error()),
)
// Run compensations in reverse
for j := i - 1; j >= 0; j-- {
if compErr := steps[j].Compensate(ctx, state); compErr != nil {
slog.ErrorContext(ctx, "compensation failed",
slog.String("step", steps[j].Name),
slog.String("transaction_id", state.TransactionID),
slog.String("error", compErr.Error()),
)
// Alert on-call — manual intervention needed
alertOncall(ctx, state.TransactionID, steps[j].Name, compErr)
}
}
return fmt.Errorf("saga failed at step %s: %w", step.Name, err)
}
state.CompletedSteps = append(state.CompletedSteps, step.Name)
}
return nil
}
Critical detail: Every saga step must be idempotent. If the orchestrator crashes and restarts, it might re-execute a step that already succeeded. We use idempotency keys for Stripe charges and unique constraint checks for ledger entries to make this safe.
When Things Go Really Wrong
The scary scenario is when a compensation itself fails. If we can't refund the Stripe charge, we have a real problem — money has been collected and we can't give it back automatically. For these cases, we persist the saga state to PostgreSQL and push the failed compensation to a dead letter queue. An on-call engineer gets paged, and they have all the context they need to resolve it manually.
Saga State Persistence
We store every saga execution in a saga_executions table. Each row tracks the transaction ID, current step, status (running, completed, compensating, failed), and a JSON blob of the saga state. This gives us two things: crash recovery (if the orchestrator dies, another instance picks up incomplete sagas) and a complete audit trail of every payment attempt.
CREATE TABLE saga_executions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
transaction_id TEXT NOT NULL UNIQUE,
current_step INT NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'running',
state JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_saga_status ON saga_executions(status)
WHERE status IN ('running', 'compensating');
Production Lessons
After running this pattern for over a year processing about 3,000 transactions per hour, here's what I've learned:
- Timeouts are your biggest enemy. A step that times out is ambiguous — did it succeed or fail? We use shorter timeouts (3-5 seconds) and check the actual state before compensating. If the ledger write actually succeeded but we just didn't get the response, we don't want to reverse it.
- Idempotency keys everywhere. Every Stripe charge uses an idempotency key derived from the transaction ID. Every ledger write has a unique constraint on transaction_id. Every wallet operation checks for duplicate holds. This makes retries safe.
- Monitor saga duration. We alert if any saga takes longer than 30 seconds. Long-running sagas usually mean a downstream service is degraded, and we'd rather fail fast and compensate than hang.
- Dead letter queues save lives. Failed compensations go to a DLQ in SQS with a 14-day retention. We process about 2-3 per week manually. Without the DLQ, those would be silent money leaks.
The real takeaway: The Saga pattern isn't complicated — it's just disciplined. Every forward action gets a compensating action. Every step is idempotent. Every failure is handled explicitly. The hard part isn't the code; it's the discipline of thinking through every failure mode before you ship.
References
- Microservices.io — Saga Pattern
- Temporal.io — Detecting Activity Failures in Workflows
- Stripe API — Idempotent Requests
- PostgreSQL Documentation — JSON Types
- AWS SQS — Dead Letter Queues
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.