April 11, 2026 10 min read

Distributed Locking in Payment Systems — Why Getting It Wrong Costs Real Money

Race conditions in payment processing don't show up in staging. They show up at 2 AM on a Friday when your on-call engineer discovers 14,000 customers were double-charged. Here's what I've learned about getting distributed locks right.

The Problem: When Two Requests Walk Into a Balance Check

Every payment system eventually hits the same fundamental problem. A customer with $100 in their account submits two $80 payments at nearly the same instant — maybe they double-tapped a checkout button, maybe two different merchants initiated charges simultaneously. Both requests read the balance, both see $100, both approve the transaction. Now you're $60 in the hole.

This isn't a theoretical concern. It's the most common class of financial bug I've encountered in production, and it stems from a simple truth: reading a value and acting on it are two separate operations, and without a lock, anything can happen in between.

Database-level serializable isolation can prevent this in simple cases, but most payment systems span multiple services — a balance check in one service, a ledger write in another, a gateway call to a third party. You need coordination that works across process boundaries. That's where distributed locks come in.

Redis-Based Locking: SETNX vs. Redlock

The simplest distributed lock is a Redis SET key value NX EX ttl command. It's atomic — either you get the lock or you don't. For a single Redis instance, this works surprisingly well. The problem is what happens when that single instance goes down.

Single-Instance SETNX

With a single Redis node, you set a key with a TTL and a unique value (typically a UUID). To release, you check that the value still matches yours before deleting — this prevents accidentally releasing someone else's lock. It's fast, simple, and perfectly adequate when your Redis instance is reliable and you can tolerate the small window of unavailability during failover.

For many payment systems, this is actually the right choice. If your Redis is running on a managed service with automatic failover and your lock TTLs are short, the window of risk is small. Don't over-engineer it.

Redlock for Higher Guarantees

Redlock, proposed by Salvatore Sanfilippo, uses N independent Redis instances (typically 5). You acquire the lock on a majority (at least 3 of 5) and only consider it held if the total acquisition time is less than the lock TTL. This survives individual node failures but adds latency and operational complexity.

Martin Kleppmann's well-known critique of Redlock raises valid concerns — particularly around clock drift and GC pauses causing a client to believe it holds a lock when it doesn't. In payment systems, where correctness matters more than availability, you should pair Redlock with fencing tokens (more on that below).

PostgreSQL Advisory Locks

If your payment flow already goes through PostgreSQL, advisory locks are an underrated option. They're not tied to a table or row — you acquire a lock on an arbitrary 64-bit integer, and PostgreSQL handles the rest.

-- Acquire a lock keyed on user_id (blocking)
SELECT pg_advisory_lock(user_id);

-- Do your balance check and deduction
-- ...

-- Release
SELECT pg_advisory_unlock(user_id);

The beauty here is that these locks participate in PostgreSQL's deadlock detection, they're automatically released if the session disconnects, and you don't need any additional infrastructure. The downside is they're scoped to a single PostgreSQL cluster — if your services talk to different databases, this won't work.

For services that are already tightly coupled to a single Postgres instance, advisory locks eliminate an entire class of distributed systems problems. I've used them successfully for per-user payment serialization in systems processing tens of thousands of transactions per hour.

Locking Strategy Comparison

Strategy Consistency Performance Complexity Best For
Redis SETNX Good (single node) Very fast (~1ms) Low Single-region, low-risk flows
Redlock Strong (with fencing) Moderate (~5-15ms) High Multi-node Redis, critical paths
PostgreSQL Advisory Strong (ACID-backed) Fast (~2-5ms) Low DB-centric services, per-user locks
etcd Lease Very strong (Raft) Moderate (~10-20ms) Medium Kubernetes-native, leader election

Fencing Tokens: The Safety Net You Actually Need

Here's the scenario that keeps distributed systems engineers up at night: your service acquires a lock, then gets hit by a long GC pause or a network partition. The lock TTL expires. Another process acquires the lock and starts working. The original process wakes up, still believing it holds the lock, and writes stale data.

Fencing tokens solve this. Every time a lock is acquired, you generate a monotonically increasing token. When writing to your datastore, you include the token and reject any write with a token lower than the last one seen. Even if a stale lock holder tries to execute, the write is rejected.

In practice, I store the fencing token alongside the lock in Redis and pass it through the entire payment flow. The ledger service checks the token before committing any balance change. It's a small addition that prevents the worst class of distributed locking bugs.

Lock Granularity: What Are You Actually Locking?

Choosing the right lock scope is as important as choosing the right lock implementation.

In most systems I've worked on, per-user locking with a compound key like lock:payment:{user_id} strikes the right balance. It prevents the double-charge scenario without creating unnecessary contention.

Redis Lock Acquisition in Go

Here's a practical implementation with exponential backoff retry that I've used in production:

package payment

import (
    "context"
    "crypto/rand"
    "encoding/hex"
    "errors"
    "fmt"
    "time"

    "github.com/redis/go-redis/v9"
)

var ErrLockNotAcquired = errors.New("failed to acquire lock")

type DistLock struct {
    client *redis.Client
    key    string
    value  string
    ttl    time.Duration
}

// Acquire attempts to acquire a distributed lock with retry.
// Returns a DistLock on success that must be released by the caller.
func Acquire(ctx context.Context, client *redis.Client, resource string,
    ttl time.Duration, maxRetries int) (*DistLock, error) {

    value := generateToken()
    key := fmt.Sprintf("lock:payment:%s", resource)

    backoff := 50 * time.Millisecond

    for attempt := 0; attempt <= maxRetries; attempt++ {
        ok, err := client.SetNX(ctx, key, value, ttl).Result()
        if err != nil {
            return nil, fmt.Errorf("redis error: %w", err)
        }
        if ok {
            return &DistLock{
                client: client,
                key:    key,
                value:  value,
                ttl:    ttl,
            }, nil
        }

        // Exponential backoff with jitter
        select {
        case <-ctx.Done():
            return nil, ctx.Err()
        case <-time.After(backoff):
            backoff = backoff * 2
            if backoff > 2*time.Second {
                backoff = 2 * time.Second
            }
        }
    }
    return nil, ErrLockNotAcquired
}

// Release removes the lock only if we still own it.
// Uses a Lua script for atomic check-and-delete.
var releaseScript = redis.NewScript(`
    if redis.call("GET", KEYS[1]) == ARGV[1] then
        return redis.call("DEL", KEYS[1])
    end
    return 0
`)

func (l *DistLock) Release(ctx context.Context) error {
    _, err := releaseScript.Run(ctx, l.client, []string{l.key}, l.value).Result()
    return err
}

func generateToken() string {
    b := make([]byte, 16)
    rand.Read(b)
    return hex.EncodeToString(b)
}

A few things worth noting: the Lua script for release is critical — without it, you could delete another process's lock between the GET and DEL. The exponential backoff with a cap prevents thundering herd problems when many processes contend for the same lock. And the context propagation lets you tie lock acquisition to request timeouts.

When Lock TTL Is Too Short: A Production Story

Production Incident — March 2024: We had a payment service with a 3-second lock TTL on per-user balance operations. Under normal load, the entire check-and-deduct flow took about 200ms. Then our payment gateway partner started experiencing elevated latency — responses that normally took 150ms were taking 2.5-4 seconds. The lock would expire mid-flight, a second request would acquire it, and both would proceed to charge the customer. We discovered the issue when our reconciliation pipeline flagged 347 double-charges over a 90-minute window. Total customer impact: $28,400 in duplicate charges that had to be refunded. The fix was two-fold: we increased the TTL to 15 seconds and added fencing tokens to the ledger write path. The longer TTL bought us headroom for gateway latency spikes, and the fencing tokens ensured that even if a lock expired, stale operations couldn't commit.

The lesson here isn't just "use longer TTLs." It's that your lock TTL must account for the worst-case latency of everything that happens while the lock is held — including external API calls. If your critical path includes a third-party gateway call, your TTL needs to be significantly longer than that gateway's P99 latency. And fencing tokens are your second line of defense when TTLs inevitably prove insufficient.

Key Takeaways

  1. Start simple. A single Redis SETNX with a reasonable TTL covers 90% of use cases. Don't reach for Redlock until you've outgrown single-instance Redis.
  2. Always use fencing tokens on the write path. Locks can and will expire at the worst possible time.
  3. Lock per-user, not per-transaction, for balance-affecting operations. Per-transaction locks prevent duplicates but not race conditions.
  4. Set TTLs based on worst-case latency, not average latency. Include external API calls in your calculation.
  5. Consider PostgreSQL advisory locks if your architecture is already DB-centric. Less infrastructure, fewer failure modes.

References

Disclaimer

The opinions and strategies discussed in this article are based on personal experience and are provided for informational purposes only. Every payment system has unique requirements, compliance obligations, and risk profiles. Always conduct thorough testing and review with your team before implementing distributed locking changes in production financial systems.