Zero-Downtime Database Migrations for Payment Systems

Why Payment Schemas Are Uniquely Dangerous to Migrate

Most web apps can tolerate a few seconds of downtime during a deploy. A blog platform, a project management tool — users refresh and move on. Payment systems don't get that luxury. A schema migration that locks a table for even five seconds can cascade into thousands of failed charges, broken webhooks, and angry merchants on the phone before you've finished your coffee.

The core problem is that payment tables are write-heavy and latency-sensitive simultaneously. Your transactions table is getting hammered with inserts from the payment gateway, updates from async settlement processors, and reads from merchant dashboards — all at the same time. A DDL statement that acquires an ACCESS EXCLUSIVE lock blocks every single one of those operations.

I've seen teams get away with naive migrations for months because their volume was low enough. Then they cross a threshold — maybe 500 transactions per second — and suddenly a migration that "always worked" starts causing timeouts. The pattern I'm going to walk through here scales from day one.

The Expand-Contract Pattern

The expand-contract pattern (sometimes called "parallel change") is the single most important technique for zero-downtime migrations. The idea is simple: never remove or rename anything in place. Instead, you expand the schema first, migrate the data, then contract by removing the old structure.

Here's the flow visualized:

Old Schema
v1 in production

Expand
Add new column/table

Migrate
Backfill + dual-write

Contract
Drop old column/table

Deploy 1: schema expand

Deploy 2: code switch + backfill

Deploy 3: drop old artifacts

The key insight is that each step is independently deployable and reversible. If the backfill goes wrong, you still have the old column. If the new code has a bug, you roll back the code without touching the schema. You never reach a state where rolling back requires a schema change.

A concrete example: renaming a column

Say you need to rename payment_method to instrument_type on a table processing 2,000 writes per second. Here's how it breaks down across three deploys:

-- Deploy 1: EXPAND — add the new column (non-blocking)
ALTER TABLE payment_transactions
    ADD COLUMN instrument_type VARCHAR(30);

-- Create a trigger to dual-write during transition
CREATE OR REPLACE FUNCTION sync_instrument_type()
RETURNS TRIGGER AS $$
BEGIN
    NEW.instrument_type := NEW.payment_method;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_instrument_type
    BEFORE INSERT OR UPDATE ON payment_transactions
    FOR EACH ROW EXECUTE FUNCTION sync_instrument_type();

-- Deploy 2: MIGRATE — backfill existing rows in batches
DO $$
DECLARE
    batch_size INT := 5000;
    rows_updated INT;
BEGIN
    LOOP
        UPDATE payment_transactions
        SET instrument_type = payment_method
        WHERE id IN (
            SELECT id FROM payment_transactions
            WHERE instrument_type IS NULL
            LIMIT batch_size
            FOR UPDATE SKIP LOCKED
        );
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        EXIT WHEN rows_updated = 0;
        PERFORM pg_sleep(0.1);  -- throttle to avoid overwhelming replication
        RAISE NOTICE 'Updated % rows', rows_updated;
    END LOOP;
END $$;

-- Deploy 3: CONTRACT — after code fully switched to new column
DROP TRIGGER trg_sync_instrument_type ON payment_transactions;
DROP FUNCTION sync_instrument_type();
ALTER TABLE payment_transactions DROP COLUMN payment_method;

Critical: Never combine expand and contract in the same deploy. I've seen teams try to "save time" by adding the new column and dropping the old one in a single migration. That's a recipe for data loss if the deploy rolls back halfway through. Separate deploys, separate PRs, separate review cycles. The extra overhead is worth it.

Migration Strategy Comparison

There are several approaches to zero-downtime migrations. Which one you pick depends on your table size, write volume, and how much operational complexity you're willing to absorb.

Strategy	How It Works	Pros	Cons
Online DDL ALTER TABLE ... ADD COLUMN	Native database support for non-blocking schema changes	Simple, no external tools, fast for additive changes	Adding defaults or NOT NULL on large tables still locks in older PostgreSQL; limited to simple operations
Ghost Tables gh-ost, pg_repack	Creates a shadow copy of the table, replays changes, then swaps	Handles any schema change; pausable and throttleable	Doubles disk usage during migration; swap requires brief lock; complex to operate
Expand-Contract Multi-deploy pattern	Add new structure, dual-write, backfill, switch reads, drop old	Fully reversible at every step; works for any change; no external tools	Slow — requires multiple deploys over days/weeks; more application code to manage
Blue-Green Tables View-based switching	App reads from a view; swap the view's underlying table after migration	Instant cutover via view redefinition; clean rollback	Views add query overhead; not all ORMs handle views well; write routing is tricky

For most payment systems, I default to expand-contract for structural changes (renames, type changes, table splits) and online DDL for simple additive changes (new nullable columns, new indexes). Ghost table tools are the escape hatch for when you need to do something truly invasive like changing a primary key type.

Backfill Strategies That Won't Kill Your Database

The backfill is where most teams get burned. You've added the new column, your trigger is dual-writing new rows, and now you need to fill in the 200 million existing rows. The naive approach — a single UPDATE ... SET new_col = old_col — will generate a transaction so large it bloats your WAL, overwhelms replication lag, and probably gets killed by your statement timeout.

Batch with throttling

The pattern I keep coming back to is batched updates with explicit throttling. Here's a Go implementation I've used in production:

func backfillInstrumentType(ctx context.Context, db *sql.DB) error {
    const batchSize = 5000
    for {
        res, err := db.ExecContext(ctx, `
            UPDATE payment_transactions
            SET instrument_type = payment_method
            WHERE id IN (
                SELECT id FROM payment_transactions
                WHERE instrument_type IS NULL
                LIMIT $1
                FOR UPDATE SKIP LOCKED
            )`, batchSize)
        if err != nil {
            return fmt.Errorf("backfill batch failed: %w", err)
        }
        n, _ := res.RowsAffected()
        if n == 0 {
            return nil // done
        }
        log.Printf("backfilled %d rows", n)
        time.Sleep(100 * time.Millisecond) // let replicas catch up
    }
}

The FOR UPDATE SKIP LOCKED is critical. It means the backfill won't block on rows that are currently being written to by live traffic — it just skips them and picks them up on the next pass. Without this, your backfill and your live writes will deadlock each other under load.

Warning: Monitor replication lag during backfills. If your read replicas fall behind, merchant dashboards start showing stale data and webhook deliveries can duplicate. I set an automatic pause when lag exceeds 5 seconds — the backfill sleeps until replicas catch up, then resumes.

The Migration Timeline

Here's what a real expand-contract migration looks like across a week. Each step is a separate deploy with its own rollback plan.

Day 1 — Schema Expand

Add new column (nullable, no default). Deploy dual-write trigger. All existing code continues working unchanged.

Rollback: drop column + trigger. Zero impact.

Day 2-3 — Backfill

Run batched backfill during low-traffic hours. Monitor replication lag and WAL size. Verify row counts match.

Rollback: stop backfill script. No schema change needed.

Day 4 — Code Switch

Deploy application code that reads from new column. Feature flag recommended — flip reads gradually (10% → 50% → 100%).

Rollback: flip feature flag back. Old column still has all data.

Day 5 — Validation

Run consistency checks: compare old vs new column values. Verify no NULL gaps. Monitor error rates for 24 hours.

Rollback: if mismatches found, revert code switch and investigate.

Day 7 — Contract

Drop trigger, drop old column. Add NOT NULL constraint if needed (use NOT VALID then VALIDATE separately).

Point of no return — ensure backups are current before this step.

Rollback Plans That Actually Work

Every migration plan I write has a rollback section, and I've learned the hard way that "restore from backup" is not a rollback plan. By the time you've restored a 2TB database, you've lost hours of transactions. Real rollback plans are about making each step independently reversible.

The rules I follow:

Expand steps must be backward-compatible. Adding a nullable column doesn't break existing code. Adding a column with a NOT NULL constraint does. Always add nullable first, enforce constraints later.
Keep the old code path alive until validation passes. Don't delete the code that reads from payment_method until you've confirmed instrument_type is fully populated and the new code path has been stable for at least 48 hours.
Use feature flags for the code switch. A feature flag lets you revert reads from new-column to old-column in seconds, without a deploy. This is your fastest rollback mechanism.
The contract step is the point of no return. Once you drop the old column, you can't undo it without a backup restore. Make sure your validation is thorough before you get here.

-- Safe way to add NOT NULL constraint on a large table
-- Step 1: Add constraint as NOT VALID (instant, no table scan)
ALTER TABLE payment_transactions
    ADD CONSTRAINT chk_instrument_type_not_null
    CHECK (instrument_type IS NOT NULL) NOT VALID;

-- Step 2: Validate separately (scans table but doesn't lock writes)
ALTER TABLE payment_transactions
    VALIDATE CONSTRAINT chk_instrument_type_not_null;

The NOT VALID trick is one of those PostgreSQL features that should be more widely known. Adding a constraint as NOT VALID is instant — it only enforces the constraint on new writes. The VALIDATE step scans existing rows but only takes a SHARE UPDATE EXCLUSIVE lock, which doesn't block normal reads or writes.

Lessons From Production Incidents

A few things I've learned that aren't in any textbook:

Lock timeouts save lives. Set lock_timeout = '3s' before any DDL in production. If the migration can't acquire its lock within 3 seconds, it fails instead of queuing behind long-running queries and creating a pileup. You can always retry during a quieter moment.
Test migrations against production-sized data. A migration that runs in 200ms on your dev database with 10,000 rows might take 45 minutes on production with 300 million. I maintain a staging environment with a recent production snapshot specifically for migration testing.
Coordinate with your ORM. If you're using ActiveRecord, Ecto, or GORM, make sure the ORM's schema cache gets refreshed after a migration. I've seen stale schema caches cause INSERT failures because the ORM didn't know about a new column's default value.
Watch out for long-running transactions. A single open transaction from a reporting query can block your DDL indefinitely. Before running a migration, check pg_stat_activity for transactions older than a few minutes and consider canceling them.

-- Always set lock timeout before DDL in production
SET lock_timeout = '3s';

-- Check for long-running transactions that might block DDL
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
    AND xact_start < now() - INTERVAL '2 minutes'
ORDER BY xact_start;

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case, security requirements, and database version. Test all migrations thoroughly in a staging environment before applying to production.

Why Payment Schemas Are Uniquely Dangerous to Migrate

The Expand-Contract Pattern

A concrete example: renaming a column

Migration Strategy Comparison

Backfill Strategies That Won't Kill Your Database

Batch with throttling

The Migration Timeline

Rollback Plans That Actually Work

Lessons From Production Incidents

References

Related Articles