April 7, 2026 10 min read

Message Queues for Payment Processing — Kafka, RabbitMQ, and the Patterns That Matter

We lost 312 payment events on a Tuesday afternoon because our service wrote to the database and published to RabbitMQ in separate steps. The DB committed. RabbitMQ didn't. Here's everything I learned about message queues in payment systems after cleaning up that mess — and the patterns that have kept us at 99.99% delivery since.

Why Message Queues in Payment Systems

Synchronous request-response works fine until it doesn't. A payment service that directly calls the ledger service, the notification service, and the fraud engine in sequence is one slow downstream away from timing out the entire checkout. I've watched a 200ms fraud check spike to 8 seconds during a model reload, taking the whole payment flow down with it.

Message queues decouple these dependencies. The payment service publishes an event — payment.captured — and moves on. Downstream consumers process it at their own pace. The fraud engine can be slow, the notification service can be down for a deploy, and the customer still gets their confirmation in under a second.

But queues introduce their own failure modes. Messages get lost. Consumers crash mid-processing. Events arrive out of order. In payment systems, every one of these failures means money is wrong somewhere. The patterns in this article are the ones I've used to keep that from happening.

99.99%
Message delivery rate
< 50ms
p99 publish latency
0
Lost payment events (18 months)

Kafka vs RabbitMQ — Picking the Right Tool

I've run both in production for payment workloads. They solve different problems, and picking the wrong one costs you months of workarounds. Here's the honest comparison from running them side by side.

Dimension Apache Kafka RabbitMQ
Throughput Millions msgs/sec per cluster Tens of thousands msgs/sec
Ordering Per-partition guaranteed Per-queue FIFO (single consumer)
Message Replay Yes — offset-based retention No — consumed messages are gone
Delivery Semantics At-least-once (exactly-once with txns) At-least-once (with publisher confirms)
Routing Topic-based, simple Flexible — exchanges, bindings, headers
Best For Payments Event sourcing, audit logs, high-volume settlement streams Task queues, notification dispatch, low-latency RPC

For our core payment event pipeline — the stream that feeds the ledger, reconciliation, and audit trail — we use Kafka. The replay capability alone justifies it. When we found a bug in our ledger consumer last year, we rewound the offset by 48 hours and reprocessed 2.3 million events. With RabbitMQ, those messages would have been gone.

For task-oriented work — sending receipt emails, triggering webhook deliveries, scheduling retry attempts — we use RabbitMQ. The routing flexibility is genuinely useful when you need to fan out a single payment event to six different consumers with different filtering rules. Kafka can do this with consumer groups, but RabbitMQ's exchange-binding model is more natural for it.

My rule of thumb: If you need an immutable event log that multiple services can read independently, use Kafka. If you need a work queue where each message is processed exactly once by one worker, use RabbitMQ. Most payment systems need both.

The Transactional Outbox Pattern

This is the pattern that fixed our 312-event data loss. The problem is deceptively simple: you need to update your database and publish a message to the broker, and both need to succeed or neither should. But databases and message brokers are separate systems — there's no distributed transaction that reliably spans both.

The naive approach is what we had: write to Postgres, then publish to RabbitMQ. If the publish fails (network blip, broker restart, timeout), your database has the payment record but no event was emitted. Downstream services never find out. The ledger is wrong. Reconciliation breaks.

The transactional outbox solves this by writing the event to the database as part of the same transaction that writes the business data. A separate process (the poller or CDC connector) reads from the outbox table and publishes to the broker.

Transactional Outbox Pattern
Payment Service
BEGIN TXN
PostgreSQL
payments
outbox
Outbox Poller
polls every 500ms
Kafka / RMQ
Consumer

Here's the schema and the write path in Go:

CREATE TABLE outbox (
    id          BIGSERIAL PRIMARY KEY,
    event_type  TEXT NOT NULL,
    payload     JSONB NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    published   BOOLEAN NOT NULL DEFAULT false,
    published_at TIMESTAMPTZ
);

CREATE INDEX idx_outbox_unpublished ON outbox (id)
    WHERE published = false;
func CreatePayment(ctx context.Context, tx *sql.Tx, p Payment) error {
    // Insert the payment record
    _, err := tx.ExecContext(ctx,
        `INSERT INTO payments (id, amount, currency, status)
         VALUES ($1, $2, $3, $4)`,
        p.ID, p.Amount, p.Currency, "captured",
    )
    if err != nil {
        return fmt.Errorf("insert payment: %w", err)
    }

    // Write the event to the outbox in the SAME transaction
    payload, _ := json.Marshal(PaymentCapturedEvent{
        PaymentID: p.ID,
        Amount:    p.Amount,
        Currency:  p.Currency,
        Timestamp: time.Now().UTC(),
    })
    _, err = tx.ExecContext(ctx,
        `INSERT INTO outbox (event_type, payload)
         VALUES ($1, $2)`,
        "payment.captured", payload,
    )
    if err != nil {
        return fmt.Errorf("insert outbox: %w", err)
    }

    return nil // Both writes commit or both roll back
}

The key insight: both writes are in the same Postgres transaction. If the payment insert succeeds but the outbox insert fails, the whole transaction rolls back. No orphaned records. No lost events. The outbox poller is a separate goroutine that reads unpublished rows and pushes them to the broker:

func (p *OutboxPoller) Run(ctx context.Context) {
    ticker := time.NewTicker(500 * time.Millisecond)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            rows, _ := p.db.QueryContext(ctx,
                `SELECT id, event_type, payload FROM outbox
                 WHERE published = false
                 ORDER BY id ASC LIMIT 100`)

            for rows.Next() {
                var id int64
                var eventType string
                var payload []byte
                rows.Scan(&id, &eventType, &payload)

                err := p.broker.Publish(ctx, eventType, payload)
                if err != nil {
                    log.Error("publish failed", "id", id, "err", err)
                    break // Retry on next tick
                }

                p.db.ExecContext(ctx,
                    `UPDATE outbox SET published = true,
                     published_at = now() WHERE id = $1`, id)
            }
            rows.Close()
        }
    }
}

Why not CDC? Change Data Capture (Debezium tailing the Postgres WAL) is the more sophisticated version of this pattern. It eliminates polling latency and scales better. But it adds operational complexity — another service to deploy, monitor, and debug. For most payment systems doing under 10,000 events per second, the polling approach is simpler and plenty fast. We switched to Debezium only when our outbox table was getting polled 2,000 times per second across multiple service instances.

The "Exactly-Once" Myth

Let me save you some time: exactly-once delivery doesn't exist in distributed systems. What you actually get is at-least-once delivery with idempotent consumers. Kafka's "exactly-once semantics" (EOS) applies within the Kafka ecosystem — produce-transform-consume pipelines using Kafka Streams. The moment your consumer writes to an external database, you're back to at-least-once.

This matters for payment systems because a consumer that processes a payment.captured event twice will credit the merchant twice. The fix is the same pattern I covered in my idempotent API article: every consumer maintains a processed-events table and checks it before doing any work.

func (c *LedgerConsumer) Handle(ctx context.Context, msg Message) error {
    // Idempotency check — have we seen this event before?
    inserted, err := c.db.ExecContext(ctx,
        `INSERT INTO processed_events (event_id, consumer, processed_at)
         VALUES ($1, $2, now())
         ON CONFLICT (event_id, consumer) DO NOTHING`,
        msg.ID, "ledger-consumer",
    )
    if err != nil {
        return fmt.Errorf("idempotency check: %w", err)
    }

    rowsAffected, _ := inserted.RowsAffected()
    if rowsAffected == 0 {
        log.Info("duplicate event, skipping", "event_id", msg.ID)
        return nil // Already processed
    }

    // Safe to credit the merchant
    return c.creditMerchant(ctx, msg.Payload)
}

Don't rely on broker-level deduplication. Kafka's idempotent producer prevents duplicate publishes from the producer side, but it doesn't help with consumer-side duplicates caused by rebalances, offset commit failures, or consumer restarts. Your consumer must be idempotent regardless of what the broker promises.

Dead Letter Queues — When Processing Fails

Some messages can't be processed. The payload is malformed. The referenced payment doesn't exist yet (ordering issue). A downstream service returns an unrecoverable error. You need a place for these messages to go that isn't "silently dropped" or "retry forever in an infinite loop."

That place is the dead letter queue (DLQ). After a configurable number of retry attempts, failed messages get routed to a separate queue for manual inspection.

Dead Letter Queue Flow
Message Consumed
Process Message
Success?
YES
Commit Offset
/ ACK
NO
Retry count
< max (3)?
Yes
Backoff
& Retry
No
Send to
DLQ

In RabbitMQ, DLQ setup is native — you configure a x-dead-letter-exchange on your queue and failed messages route automatically. With Kafka, you typically implement it in your consumer code: after N failed attempts, produce the message to a *.dlq topic.

We alert on DLQ depth in Grafana. Any message landing in the DLQ triggers a Slack notification to the payments channel. An engineer investigates, fixes the root cause (usually a schema mismatch or a missing database migration), and replays the messages. In 18 months, we've had 47 DLQ events total — all resolved within the same business day.

Ordering Guarantees and Partition Keys

Payment events for the same transaction must be processed in order. A payment.refunded event arriving before payment.captured will break your ledger. Kafka guarantees ordering within a partition, so the trick is making sure all events for the same payment land on the same partition.

We use the payment ID as the partition key. Kafka hashes it consistently, so payment.created, payment.captured, and payment.refunded for payment pay_abc123 always go to the same partition and are consumed in order.

// Partition key ensures ordering per payment
err := producer.Produce(ctx, &kafka.Message{
    Topic: "payment-events",
    Key:   []byte(payment.ID), // All events for this payment → same partition
    Value: eventBytes,
})

Watch your partition count. More partitions means more parallelism, but it also means more consumers needed to maintain ordering. We run 12 partitions for our payment events topic — enough for 12 concurrent consumers, which handles our throughput with headroom. Don't start with 100 partitions because a blog post told you to. Start with what you need and scale up.

Monitoring Your Queue Health

The three metrics that have saved us from production incidents more than anything else:

The Patterns That Actually Matter

After running message queues in payment systems for several years, the patterns that matter aren't the exotic ones. They're boring and well-understood:

  1. Use the transactional outbox to guarantee event publication. Never write to the DB and broker separately.
  2. Make every consumer idempotent. At-least-once delivery is the reality — design for it.
  3. Use dead letter queues with alerting. Failed messages need a home and an owner.
  4. Partition by payment ID for ordering. Don't fight the broker's model.
  5. Use Kafka for event streams, RabbitMQ for task queues. Pick the right tool for the job.

None of this is groundbreaking. But after the 312-event incident, I stopped looking for clever solutions and started applying these patterns consistently. The result: zero lost payment events in 18 months, across three Kafka clusters and two RabbitMQ instances processing over 4 million events per day. Boring works.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity and may omit error handling, logging, and security measures — always review and adapt for your specific use case and security requirements. This is not financial or legal advice.