Why Message Queues in Payment Systems
Synchronous request-response works fine until it doesn't. A payment service that directly calls the ledger service, the notification service, and the fraud engine in sequence is one slow downstream away from timing out the entire checkout. I've watched a 200ms fraud check spike to 8 seconds during a model reload, taking the whole payment flow down with it.
Message queues decouple these dependencies. The payment service publishes an event — payment.captured — and moves on. Downstream consumers process it at their own pace. The fraud engine can be slow, the notification service can be down for a deploy, and the customer still gets their confirmation in under a second.
But queues introduce their own failure modes. Messages get lost. Consumers crash mid-processing. Events arrive out of order. In payment systems, every one of these failures means money is wrong somewhere. The patterns in this article are the ones I've used to keep that from happening.
Kafka vs RabbitMQ — Picking the Right Tool
I've run both in production for payment workloads. They solve different problems, and picking the wrong one costs you months of workarounds. Here's the honest comparison from running them side by side.
For our core payment event pipeline — the stream that feeds the ledger, reconciliation, and audit trail — we use Kafka. The replay capability alone justifies it. When we found a bug in our ledger consumer last year, we rewound the offset by 48 hours and reprocessed 2.3 million events. With RabbitMQ, those messages would have been gone.
For task-oriented work — sending receipt emails, triggering webhook deliveries, scheduling retry attempts — we use RabbitMQ. The routing flexibility is genuinely useful when you need to fan out a single payment event to six different consumers with different filtering rules. Kafka can do this with consumer groups, but RabbitMQ's exchange-binding model is more natural for it.
My rule of thumb: If you need an immutable event log that multiple services can read independently, use Kafka. If you need a work queue where each message is processed exactly once by one worker, use RabbitMQ. Most payment systems need both.
The Transactional Outbox Pattern
This is the pattern that fixed our 312-event data loss. The problem is deceptively simple: you need to update your database and publish a message to the broker, and both need to succeed or neither should. But databases and message brokers are separate systems — there's no distributed transaction that reliably spans both.
The naive approach is what we had: write to Postgres, then publish to RabbitMQ. If the publish fails (network blip, broker restart, timeout), your database has the payment record but no event was emitted. Downstream services never find out. The ledger is wrong. Reconciliation breaks.
The transactional outbox solves this by writing the event to the database as part of the same transaction that writes the business data. A separate process (the poller or CDC connector) reads from the outbox table and publishes to the broker.
Here's the schema and the write path in Go:
CREATE TABLE outbox (
id BIGSERIAL PRIMARY KEY,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published BOOLEAN NOT NULL DEFAULT false,
published_at TIMESTAMPTZ
);
CREATE INDEX idx_outbox_unpublished ON outbox (id)
WHERE published = false;
func CreatePayment(ctx context.Context, tx *sql.Tx, p Payment) error {
// Insert the payment record
_, err := tx.ExecContext(ctx,
`INSERT INTO payments (id, amount, currency, status)
VALUES ($1, $2, $3, $4)`,
p.ID, p.Amount, p.Currency, "captured",
)
if err != nil {
return fmt.Errorf("insert payment: %w", err)
}
// Write the event to the outbox in the SAME transaction
payload, _ := json.Marshal(PaymentCapturedEvent{
PaymentID: p.ID,
Amount: p.Amount,
Currency: p.Currency,
Timestamp: time.Now().UTC(),
})
_, err = tx.ExecContext(ctx,
`INSERT INTO outbox (event_type, payload)
VALUES ($1, $2)`,
"payment.captured", payload,
)
if err != nil {
return fmt.Errorf("insert outbox: %w", err)
}
return nil // Both writes commit or both roll back
}
The key insight: both writes are in the same Postgres transaction. If the payment insert succeeds but the outbox insert fails, the whole transaction rolls back. No orphaned records. No lost events. The outbox poller is a separate goroutine that reads unpublished rows and pushes them to the broker:
func (p *OutboxPoller) Run(ctx context.Context) {
ticker := time.NewTicker(500 * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
rows, _ := p.db.QueryContext(ctx,
`SELECT id, event_type, payload FROM outbox
WHERE published = false
ORDER BY id ASC LIMIT 100`)
for rows.Next() {
var id int64
var eventType string
var payload []byte
rows.Scan(&id, &eventType, &payload)
err := p.broker.Publish(ctx, eventType, payload)
if err != nil {
log.Error("publish failed", "id", id, "err", err)
break // Retry on next tick
}
p.db.ExecContext(ctx,
`UPDATE outbox SET published = true,
published_at = now() WHERE id = $1`, id)
}
rows.Close()
}
}
}
Why not CDC? Change Data Capture (Debezium tailing the Postgres WAL) is the more sophisticated version of this pattern. It eliminates polling latency and scales better. But it adds operational complexity — another service to deploy, monitor, and debug. For most payment systems doing under 10,000 events per second, the polling approach is simpler and plenty fast. We switched to Debezium only when our outbox table was getting polled 2,000 times per second across multiple service instances.
The "Exactly-Once" Myth
Let me save you some time: exactly-once delivery doesn't exist in distributed systems. What you actually get is at-least-once delivery with idempotent consumers. Kafka's "exactly-once semantics" (EOS) applies within the Kafka ecosystem — produce-transform-consume pipelines using Kafka Streams. The moment your consumer writes to an external database, you're back to at-least-once.
This matters for payment systems because a consumer that processes a payment.captured event twice will credit the merchant twice. The fix is the same pattern I covered in my idempotent API article: every consumer maintains a processed-events table and checks it before doing any work.
func (c *LedgerConsumer) Handle(ctx context.Context, msg Message) error {
// Idempotency check — have we seen this event before?
inserted, err := c.db.ExecContext(ctx,
`INSERT INTO processed_events (event_id, consumer, processed_at)
VALUES ($1, $2, now())
ON CONFLICT (event_id, consumer) DO NOTHING`,
msg.ID, "ledger-consumer",
)
if err != nil {
return fmt.Errorf("idempotency check: %w", err)
}
rowsAffected, _ := inserted.RowsAffected()
if rowsAffected == 0 {
log.Info("duplicate event, skipping", "event_id", msg.ID)
return nil // Already processed
}
// Safe to credit the merchant
return c.creditMerchant(ctx, msg.Payload)
}
Don't rely on broker-level deduplication. Kafka's idempotent producer prevents duplicate publishes from the producer side, but it doesn't help with consumer-side duplicates caused by rebalances, offset commit failures, or consumer restarts. Your consumer must be idempotent regardless of what the broker promises.
Dead Letter Queues — When Processing Fails
Some messages can't be processed. The payload is malformed. The referenced payment doesn't exist yet (ordering issue). A downstream service returns an unrecoverable error. You need a place for these messages to go that isn't "silently dropped" or "retry forever in an infinite loop."
That place is the dead letter queue (DLQ). After a configurable number of retry attempts, failed messages get routed to a separate queue for manual inspection.
/ ACK
< max (3)?
& Retry
DLQ
In RabbitMQ, DLQ setup is native — you configure a x-dead-letter-exchange on your queue and failed messages route automatically. With Kafka, you typically implement it in your consumer code: after N failed attempts, produce the message to a *.dlq topic.
We alert on DLQ depth in Grafana. Any message landing in the DLQ triggers a Slack notification to the payments channel. An engineer investigates, fixes the root cause (usually a schema mismatch or a missing database migration), and replays the messages. In 18 months, we've had 47 DLQ events total — all resolved within the same business day.
Ordering Guarantees and Partition Keys
Payment events for the same transaction must be processed in order. A payment.refunded event arriving before payment.captured will break your ledger. Kafka guarantees ordering within a partition, so the trick is making sure all events for the same payment land on the same partition.
We use the payment ID as the partition key. Kafka hashes it consistently, so payment.created, payment.captured, and payment.refunded for payment pay_abc123 always go to the same partition and are consumed in order.
// Partition key ensures ordering per payment
err := producer.Produce(ctx, &kafka.Message{
Topic: "payment-events",
Key: []byte(payment.ID), // All events for this payment → same partition
Value: eventBytes,
})
Watch your partition count. More partitions means more parallelism, but it also means more consumers needed to maintain ordering. We run 12 partitions for our payment events topic — enough for 12 concurrent consumers, which handles our throughput with headroom. Don't start with 100 partitions because a blog post told you to. Start with what you need and scale up.
Monitoring Your Queue Health
The three metrics that have saved us from production incidents more than anything else:
- Consumer lag — the gap between the latest produced offset and the consumer's committed offset. If lag is growing, your consumers can't keep up. We alert at 1,000 messages of lag and page at 10,000.
- Outbox table depth — how many unpublished rows are sitting in the outbox. Should be near zero in steady state. A growing count means the poller is stuck or the broker is unreachable.
- DLQ depth — any non-zero value needs investigation. We treat DLQ messages like production bugs — they get a ticket and a root cause analysis.
The Patterns That Actually Matter
After running message queues in payment systems for several years, the patterns that matter aren't the exotic ones. They're boring and well-understood:
- Use the transactional outbox to guarantee event publication. Never write to the DB and broker separately.
- Make every consumer idempotent. At-least-once delivery is the reality — design for it.
- Use dead letter queues with alerting. Failed messages need a home and an owner.
- Partition by payment ID for ordering. Don't fight the broker's model.
- Use Kafka for event streams, RabbitMQ for task queues. Pick the right tool for the job.
None of this is groundbreaking. But after the 312-event incident, I stopped looking for clever solutions and started applying these patterns consistently. The result: zero lost payment events in 18 months, across three Kafka clusters and two RabbitMQ instances processing over 4 million events per day. Boring works.
References
- Apache Kafka Documentation — official reference for producers, consumers, and exactly-once semantics
- RabbitMQ Reliability Guide — publisher confirms, consumer acknowledgements, and dead lettering
- Microservices.io — Transactional Outbox Pattern — Chris Richardson's canonical description of the pattern
- Confluent — Exactly-Once Semantics in Apache Kafka — deep dive on Kafka's EOS implementation
- RabbitMQ Dead Letter Exchanges — official documentation on DLQ configuration
- Debezium Documentation — CDC-based outbox pattern implementation
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity and may omit error handling, logging, and security measures — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.