April 11, 2026 9 min read

Ruby Sidekiq Background Jobs for Payment Processing — How We Stopped Blocking the Checkout Flow

Our checkout endpoint was taking 4-8 seconds because we were doing everything synchronously — charge capture, receipt emails, ledger updates, webhook notifications. Users were abandoning carts. Moving to Sidekiq recovered $18K/month in lost revenue, but the migration had its own set of traps.

The Problem: Synchronous Checkout Is a Revenue Killer

Here's what our checkout controller looked like before the refactor. Every step ran inline, blocking the HTTP response:

def create
  charge = Stripe::Charge.create(amount: cart.total_cents, source: token)
  payment.update!(status: :captured, gateway_id: charge.id)
  LedgerEntry.create!(payment: payment, type: :debit)
  ReceiptMailer.send_receipt(payment).deliver_now
  WebhookDispatcher.notify(:payment_captured, payment)
  MerchantSettlement.queue_for_batch(payment)
  render json: { status: "success" }
end

On a good day, this took 3 seconds. On a bad day — when Stripe was slow or SendGrid was having issues — it ballooned to 8-12 seconds. Our analytics showed a 23% cart abandonment rate on the payment step. Users were clicking "Pay," waiting, assuming it was broken, and leaving.

Before: Synchronous Flow
Request
Charge
Ledger
Email
Webhook
Response
4-8 sec
After: Async with Sidekiq
Request
Charge
Enqueue Jobs
Response
800ms

The Fix: What Stays Synchronous, What Goes Async

The first mistake teams make is moving everything to background jobs. Don't. The charge itself must stay synchronous — the customer needs to know immediately if their card was declined. But everything after the successful charge can be async.

Here's the rule I follow: if the customer needs to see the result, it's synchronous. If it's bookkeeping, notifications, or downstream processing, it's a Sidekiq job.

def create
  # Synchronous — customer needs immediate feedback
  charge = Stripe::Charge.create(amount: cart.total_cents, source: token)
  payment.update!(status: :captured, gateway_id: charge.id)

  # Async — enqueue and respond immediately
  LedgerEntryWorker.perform_async(payment.id)
  ReceiptEmailWorker.perform_async(payment.id)
  WebhookDispatchWorker.perform_async(payment.id, "payment_captured")
  SettlementQueueWorker.perform_async(payment.id)

  render json: { status: "success", payment_id: payment.id }
end

Response time dropped from 4-8 seconds to under 800ms. Cart abandonment on the payment step fell from 23% to 6%.

Metric Synchronous Sidekiq Async
Checkout response time 4-8 seconds ~800ms
Cart abandonment (payment step) 23% 6%
Failure blast radius Entire checkout fails Only affected job retries
SendGrid outage impact Checkout blocked Receipts delayed, checkout fine
Monthly recovered revenue +$18K

Retry Strategies: Payment Jobs Are Not Normal Jobs

Sidekiq's default retry behavior — 25 retries with exponential backoff — is fine for sending emails. It's dangerous for payment operations. A ledger entry that gets retried 25 times could create 25 duplicate entries if your job isn't idempotent.

class LedgerEntryWorker
  include Sidekiq::Worker

  sidekiq_options(
    queue: "payment_critical",
    retry: 5,                    # Not 25 — fail fast, alert humans
    dead: true,                  # Move to dead set after exhaustion
    lock: :until_executed,       # Prevent duplicate execution (sidekiq-unique-jobs)
    on_conflict: :log            # Log duplicates instead of silently dropping
  )

  def perform(payment_id)
    payment = Payment.find(payment_id)

    # Idempotency guard — check before creating
    return if LedgerEntry.exists?(payment_id: payment.id, entry_type: :debit)

    LedgerEntry.create!(
      payment_id: payment.id,
      entry_type: :debit,
      amount_cents: payment.amount_cents,
      currency: payment.currency,
      idempotency_key: "ledger_debit_#{payment.id}"
    )
  end
end

The idempotency guard is non-negotiable. Even with sidekiq-unique-jobs, there are edge cases — Redis failovers, lock expiry during long-running jobs — where a job can execute twice. The database-level check is your last line of defense. For financial operations, belt and suspenders isn't paranoia, it's engineering.

Custom Retry Logic for Gateway Errors

Not all errors deserve the same retry treatment. A Stripe::RateLimitError should retry quickly. A Stripe::InvalidRequestError should never retry — the request is malformed and will fail forever. We use Sidekiq's sidekiq_retry_in hook:

class WebhookDispatchWorker
  include Sidekiq::Worker
  sidekiq_options queue: "payment_critical", retry: 8

  sidekiq_retry_in do |count, exception|
    case exception
    when Net::OpenTimeout, Net::ReadTimeout
      (count ** 2) + 15  # Aggressive backoff for timeouts
    when Faraday::ConnectionFailed
      (count ** 3) + 60  # Even slower for connection failures
    else
      :kill  # Unknown errors go straight to dead set
    end
  end

  def perform(payment_id, event_type)
    payment = Payment.find(payment_id)
    MerchantWebhook.dispatch(
      merchant: payment.merchant,
      event: event_type,
      payload: PaymentSerializer.new(payment).as_json,
      idempotency_key: "webhook_#{payment.id}_#{event_type}"
    )
  end
end

Dead Letter Queues: When Retries Run Out

When a payment job exhausts its retries, it lands in Sidekiq's dead set. For most apps, that's fine — someone checks the dashboard eventually. For payment systems, "eventually" isn't good enough. We hook into Sidekiq's death handler to trigger immediate alerts:

Sidekiq.configure_server do |config|
  config.death_handlers << ->(job, exception) {
    if job["queue"] == "payment_critical"
      PaymentAlerts.critical(
        worker: job["class"],
        args: job["args"],
        error: exception.message,
        retry_count: job["retry_count"]
      )
      # Also track in our payment ops dashboard
      PaymentMetrics.increment("dead_letter.payment_critical",
        tags: ["worker:#{job['class']}"]
      )
    end
  }
end

Redis persistence is critical for payment jobs. By default, Redis uses RDB snapshots, which means you can lose the last few minutes of data on a crash. For payment-critical queues, enable AOF (Append Only File) persistence with appendfsync everysec at minimum. We lost 340 jobs during a Redis restart before we learned this. Those were 340 receipts that never sent and 340 ledger entries that were missing until our daily reconciliation caught them.

Queue Priority and Isolation

Don't mix payment jobs with your regular application jobs. A spike in report-generation jobs shouldn't delay payment processing. We run separate Sidekiq processes with dedicated queues:

# config/sidekiq_payment.yml
:concurrency: 10
:queues:
  - [payment_critical, 10]
  - [payment_standard, 5]

# config/sidekiq_default.yml
:concurrency: 25
:queues:
  - [default, 5]
  - [mailers, 3]
  - [reports, 1]

The payment_critical queue handles ledger entries and settlement batching — things that affect financial accuracy. The payment_standard queue handles receipts and webhook notifications — important but not financially critical. Each gets its own Sidekiq process so they can't starve each other.

800ms
Avg checkout response
99.7%
Job success rate
$18K
Monthly recovered revenue

Monitoring: You Can't Fix What You Can't See

Sidekiq Pro's metrics are decent, but for payment jobs we needed more. We track three things obsessively:

  1. Job latency by queue. How long jobs sit in the queue before a worker picks them up. If payment_critical latency exceeds 5 seconds, something is wrong — either workers are overloaded or Redis is struggling.
  2. Retry rate by worker class. A sudden spike in retries for LedgerEntryWorker means the database might be under pressure. A spike in WebhookDispatchWorker retries means a merchant's endpoint is down.
  3. Dead set growth. Any job hitting the dead set in payment_critical triggers a PagerDuty alert. Zero tolerance. These are financial operations that need human attention.

Lessons from 18 Months in Production

The migration path: Don't try to move everything at once. We migrated one job type per week — receipts first (lowest risk), then webhooks, then ledger entries (highest risk). Each migration got its own PR, its own monitoring dashboard, and a week of observation before moving to the next.

References

Disclaimer: This article reflects the author's personal experience and opinions. Code examples are simplified for clarity and may not represent production-ready implementations. Product names, logos, and brands are property of their respective owners. Always verify with official documentation.