The Problem: Synchronous Checkout Is a Revenue Killer
Here's what our checkout controller looked like before the refactor. Every step ran inline, blocking the HTTP response:
def create
charge = Stripe::Charge.create(amount: cart.total_cents, source: token)
payment.update!(status: :captured, gateway_id: charge.id)
LedgerEntry.create!(payment: payment, type: :debit)
ReceiptMailer.send_receipt(payment).deliver_now
WebhookDispatcher.notify(:payment_captured, payment)
MerchantSettlement.queue_for_batch(payment)
render json: { status: "success" }
end
On a good day, this took 3 seconds. On a bad day — when Stripe was slow or SendGrid was having issues — it ballooned to 8-12 seconds. Our analytics showed a 23% cart abandonment rate on the payment step. Users were clicking "Pay," waiting, assuming it was broken, and leaving.
4-8 sec
800ms
The Fix: What Stays Synchronous, What Goes Async
The first mistake teams make is moving everything to background jobs. Don't. The charge itself must stay synchronous — the customer needs to know immediately if their card was declined. But everything after the successful charge can be async.
Here's the rule I follow: if the customer needs to see the result, it's synchronous. If it's bookkeeping, notifications, or downstream processing, it's a Sidekiq job.
def create
# Synchronous — customer needs immediate feedback
charge = Stripe::Charge.create(amount: cart.total_cents, source: token)
payment.update!(status: :captured, gateway_id: charge.id)
# Async — enqueue and respond immediately
LedgerEntryWorker.perform_async(payment.id)
ReceiptEmailWorker.perform_async(payment.id)
WebhookDispatchWorker.perform_async(payment.id, "payment_captured")
SettlementQueueWorker.perform_async(payment.id)
render json: { status: "success", payment_id: payment.id }
end
Response time dropped from 4-8 seconds to under 800ms. Cart abandonment on the payment step fell from 23% to 6%.
Retry Strategies: Payment Jobs Are Not Normal Jobs
Sidekiq's default retry behavior — 25 retries with exponential backoff — is fine for sending emails. It's dangerous for payment operations. A ledger entry that gets retried 25 times could create 25 duplicate entries if your job isn't idempotent.
class LedgerEntryWorker
include Sidekiq::Worker
sidekiq_options(
queue: "payment_critical",
retry: 5, # Not 25 — fail fast, alert humans
dead: true, # Move to dead set after exhaustion
lock: :until_executed, # Prevent duplicate execution (sidekiq-unique-jobs)
on_conflict: :log # Log duplicates instead of silently dropping
)
def perform(payment_id)
payment = Payment.find(payment_id)
# Idempotency guard — check before creating
return if LedgerEntry.exists?(payment_id: payment.id, entry_type: :debit)
LedgerEntry.create!(
payment_id: payment.id,
entry_type: :debit,
amount_cents: payment.amount_cents,
currency: payment.currency,
idempotency_key: "ledger_debit_#{payment.id}"
)
end
end
The idempotency guard is non-negotiable. Even with sidekiq-unique-jobs, there are edge cases — Redis failovers, lock expiry during long-running jobs — where a job can execute twice. The database-level check is your last line of defense. For financial operations, belt and suspenders isn't paranoia, it's engineering.
Custom Retry Logic for Gateway Errors
Not all errors deserve the same retry treatment. A Stripe::RateLimitError should retry quickly. A Stripe::InvalidRequestError should never retry — the request is malformed and will fail forever. We use Sidekiq's sidekiq_retry_in hook:
class WebhookDispatchWorker
include Sidekiq::Worker
sidekiq_options queue: "payment_critical", retry: 8
sidekiq_retry_in do |count, exception|
case exception
when Net::OpenTimeout, Net::ReadTimeout
(count ** 2) + 15 # Aggressive backoff for timeouts
when Faraday::ConnectionFailed
(count ** 3) + 60 # Even slower for connection failures
else
:kill # Unknown errors go straight to dead set
end
end
def perform(payment_id, event_type)
payment = Payment.find(payment_id)
MerchantWebhook.dispatch(
merchant: payment.merchant,
event: event_type,
payload: PaymentSerializer.new(payment).as_json,
idempotency_key: "webhook_#{payment.id}_#{event_type}"
)
end
end
Dead Letter Queues: When Retries Run Out
When a payment job exhausts its retries, it lands in Sidekiq's dead set. For most apps, that's fine — someone checks the dashboard eventually. For payment systems, "eventually" isn't good enough. We hook into Sidekiq's death handler to trigger immediate alerts:
Sidekiq.configure_server do |config|
config.death_handlers << ->(job, exception) {
if job["queue"] == "payment_critical"
PaymentAlerts.critical(
worker: job["class"],
args: job["args"],
error: exception.message,
retry_count: job["retry_count"]
)
# Also track in our payment ops dashboard
PaymentMetrics.increment("dead_letter.payment_critical",
tags: ["worker:#{job['class']}"]
)
end
}
end
Redis persistence is critical for payment jobs. By default, Redis uses RDB snapshots, which means you can lose the last few minutes of data on a crash. For payment-critical queues, enable AOF (Append Only File) persistence with appendfsync everysec at minimum. We lost 340 jobs during a Redis restart before we learned this. Those were 340 receipts that never sent and 340 ledger entries that were missing until our daily reconciliation caught them.
Queue Priority and Isolation
Don't mix payment jobs with your regular application jobs. A spike in report-generation jobs shouldn't delay payment processing. We run separate Sidekiq processes with dedicated queues:
# config/sidekiq_payment.yml
:concurrency: 10
:queues:
- [payment_critical, 10]
- [payment_standard, 5]
# config/sidekiq_default.yml
:concurrency: 25
:queues:
- [default, 5]
- [mailers, 3]
- [reports, 1]
The payment_critical queue handles ledger entries and settlement batching — things that affect financial accuracy. The payment_standard queue handles receipts and webhook notifications — important but not financially critical. Each gets its own Sidekiq process so they can't starve each other.
Monitoring: You Can't Fix What You Can't See
Sidekiq Pro's metrics are decent, but for payment jobs we needed more. We track three things obsessively:
- Job latency by queue. How long jobs sit in the queue before a worker picks them up. If
payment_criticallatency exceeds 5 seconds, something is wrong — either workers are overloaded or Redis is struggling. - Retry rate by worker class. A sudden spike in retries for
LedgerEntryWorkermeans the database might be under pressure. A spike inWebhookDispatchWorkerretries means a merchant's endpoint is down. - Dead set growth. Any job hitting the dead set in
payment_criticaltriggers a PagerDuty alert. Zero tolerance. These are financial operations that need human attention.
Lessons from 18 Months in Production
- Always pass IDs, not objects.
perform_async(payment.id), neverperform_async(payment). Sidekiq serializes arguments to JSON. If you pass an object, you get a stale snapshot. The job should always fetch the latest state from the database. - Test your idempotency. Run every payment worker twice with the same arguments in your test suite. If the second run creates duplicate records or sends duplicate emails, your idempotency guard is broken.
- Use transactions carefully. If your job wraps multiple database writes in a transaction but the job fails after the transaction commits, the retry will try to redo work that's already done. Design each step to be independently idempotent.
- Monitor Redis memory. Payment jobs can pile up fast during an outage. We set a max memory policy of
noevictionon our payment Redis instance — we'd rather have Sidekiq raise an error than silently drop jobs.
The migration path: Don't try to move everything at once. We migrated one job type per week — receipts first (lowest risk), then webhooks, then ledger entries (highest risk). Each migration got its own PR, its own monitoring dashboard, and a week of observation before moving to the next.
References
- Sidekiq Wiki — Official Documentation
- Sidekiq Best Practices — Job Design Guidelines
- Redis Persistence — RDB vs AOF Configuration
- Stripe API — Idempotent Requests
- sidekiq-unique-jobs — Preventing Duplicate Job Execution
Disclaimer: This article reflects the author's personal experience and opinions. Code examples are simplified for clarity and may not represent production-ready implementations. Product names, logos, and brands are property of their respective owners. Always verify with official documentation.