Why Payment Schemas Are Uniquely Dangerous to Migrate
Most web apps can tolerate a few seconds of downtime during a deploy. A blog platform, a project management tool — users refresh and move on. Payment systems don't get that luxury. A schema migration that locks a table for even five seconds can cascade into thousands of failed charges, broken webhooks, and angry merchants on the phone before you've finished your coffee.
The core problem is that payment tables are write-heavy and latency-sensitive simultaneously. Your transactions table is getting hammered with inserts from the payment gateway, updates from async settlement processors, and reads from merchant dashboards — all at the same time. A DDL statement that acquires an ACCESS EXCLUSIVE lock blocks every single one of those operations.
I've seen teams get away with naive migrations for months because their volume was low enough. Then they cross a threshold — maybe 500 transactions per second — and suddenly a migration that "always worked" starts causing timeouts. The pattern I'm going to walk through here scales from day one.
The Expand-Contract Pattern
The expand-contract pattern (sometimes called "parallel change") is the single most important technique for zero-downtime migrations. The idea is simple: never remove or rename anything in place. Instead, you expand the schema first, migrate the data, then contract by removing the old structure.
Here's the flow visualized:
v1 in production
Add new column/table
Backfill + dual-write
Drop old column/table
The key insight is that each step is independently deployable and reversible. If the backfill goes wrong, you still have the old column. If the new code has a bug, you roll back the code without touching the schema. You never reach a state where rolling back requires a schema change.
A concrete example: renaming a column
Say you need to rename payment_method to instrument_type on a table processing 2,000 writes per second. Here's how it breaks down across three deploys:
-- Deploy 1: EXPAND — add the new column (non-blocking)
ALTER TABLE payment_transactions
ADD COLUMN instrument_type VARCHAR(30);
-- Create a trigger to dual-write during transition
CREATE OR REPLACE FUNCTION sync_instrument_type()
RETURNS TRIGGER AS $$
BEGIN
NEW.instrument_type := NEW.payment_method;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_instrument_type
BEFORE INSERT OR UPDATE ON payment_transactions
FOR EACH ROW EXECUTE FUNCTION sync_instrument_type();
-- Deploy 2: MIGRATE — backfill existing rows in batches
DO $$
DECLARE
batch_size INT := 5000;
rows_updated INT;
BEGIN
LOOP
UPDATE payment_transactions
SET instrument_type = payment_method
WHERE id IN (
SELECT id FROM payment_transactions
WHERE instrument_type IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1); -- throttle to avoid overwhelming replication
RAISE NOTICE 'Updated % rows', rows_updated;
END LOOP;
END $$;
-- Deploy 3: CONTRACT — after code fully switched to new column
DROP TRIGGER trg_sync_instrument_type ON payment_transactions;
DROP FUNCTION sync_instrument_type();
ALTER TABLE payment_transactions DROP COLUMN payment_method;
Critical: Never combine expand and contract in the same deploy. I've seen teams try to "save time" by adding the new column and dropping the old one in a single migration. That's a recipe for data loss if the deploy rolls back halfway through. Separate deploys, separate PRs, separate review cycles. The extra overhead is worth it.
Migration Strategy Comparison
There are several approaches to zero-downtime migrations. Which one you pick depends on your table size, write volume, and how much operational complexity you're willing to absorb.
For most payment systems, I default to expand-contract for structural changes (renames, type changes, table splits) and online DDL for simple additive changes (new nullable columns, new indexes). Ghost table tools are the escape hatch for when you need to do something truly invasive like changing a primary key type.
Backfill Strategies That Won't Kill Your Database
The backfill is where most teams get burned. You've added the new column, your trigger is dual-writing new rows, and now you need to fill in the 200 million existing rows. The naive approach — a single UPDATE ... SET new_col = old_col — will generate a transaction so large it bloats your WAL, overwhelms replication lag, and probably gets killed by your statement timeout.
Batch with throttling
The pattern I keep coming back to is batched updates with explicit throttling. Here's a Go implementation I've used in production:
func backfillInstrumentType(ctx context.Context, db *sql.DB) error {
const batchSize = 5000
for {
res, err := db.ExecContext(ctx, `
UPDATE payment_transactions
SET instrument_type = payment_method
WHERE id IN (
SELECT id FROM payment_transactions
WHERE instrument_type IS NULL
LIMIT $1
FOR UPDATE SKIP LOCKED
)`, batchSize)
if err != nil {
return fmt.Errorf("backfill batch failed: %w", err)
}
n, _ := res.RowsAffected()
if n == 0 {
return nil // done
}
log.Printf("backfilled %d rows", n)
time.Sleep(100 * time.Millisecond) // let replicas catch up
}
}
The FOR UPDATE SKIP LOCKED is critical. It means the backfill won't block on rows that are currently being written to by live traffic — it just skips them and picks them up on the next pass. Without this, your backfill and your live writes will deadlock each other under load.
Warning: Monitor replication lag during backfills. If your read replicas fall behind, merchant dashboards start showing stale data and webhook deliveries can duplicate. I set an automatic pause when lag exceeds 5 seconds — the backfill sleeps until replicas catch up, then resumes.
The Migration Timeline
Here's what a real expand-contract migration looks like across a week. Each step is a separate deploy with its own rollback plan.
NOT VALID then VALIDATE separately).Rollback Plans That Actually Work
Every migration plan I write has a rollback section, and I've learned the hard way that "restore from backup" is not a rollback plan. By the time you've restored a 2TB database, you've lost hours of transactions. Real rollback plans are about making each step independently reversible.
The rules I follow:
- Expand steps must be backward-compatible. Adding a nullable column doesn't break existing code. Adding a column with a
NOT NULLconstraint does. Always add nullable first, enforce constraints later. - Keep the old code path alive until validation passes. Don't delete the code that reads from
payment_methoduntil you've confirmedinstrument_typeis fully populated and the new code path has been stable for at least 48 hours. - Use feature flags for the code switch. A feature flag lets you revert reads from new-column to old-column in seconds, without a deploy. This is your fastest rollback mechanism.
- The contract step is the point of no return. Once you drop the old column, you can't undo it without a backup restore. Make sure your validation is thorough before you get here.
-- Safe way to add NOT NULL constraint on a large table
-- Step 1: Add constraint as NOT VALID (instant, no table scan)
ALTER TABLE payment_transactions
ADD CONSTRAINT chk_instrument_type_not_null
CHECK (instrument_type IS NOT NULL) NOT VALID;
-- Step 2: Validate separately (scans table but doesn't lock writes)
ALTER TABLE payment_transactions
VALIDATE CONSTRAINT chk_instrument_type_not_null;
The NOT VALID trick is one of those PostgreSQL features that should be more widely known. Adding a constraint as NOT VALID is instant — it only enforces the constraint on new writes. The VALIDATE step scans existing rows but only takes a SHARE UPDATE EXCLUSIVE lock, which doesn't block normal reads or writes.
Lessons From Production Incidents
A few things I've learned that aren't in any textbook:
- Lock timeouts save lives. Set
lock_timeout = '3s'before any DDL in production. If the migration can't acquire its lock within 3 seconds, it fails instead of queuing behind long-running queries and creating a pileup. You can always retry during a quieter moment. - Test migrations against production-sized data. A migration that runs in 200ms on your dev database with 10,000 rows might take 45 minutes on production with 300 million. I maintain a staging environment with a recent production snapshot specifically for migration testing.
- Coordinate with your ORM. If you're using ActiveRecord, Ecto, or GORM, make sure the ORM's schema cache gets refreshed after a migration. I've seen stale schema caches cause
INSERTfailures because the ORM didn't know about a new column's default value. - Watch out for long-running transactions. A single open transaction from a reporting query can block your DDL indefinitely. Before running a migration, check
pg_stat_activityfor transactions older than a few minutes and consider canceling them.
-- Always set lock timeout before DDL in production
SET lock_timeout = '3s';
-- Check for long-running transactions that might block DDL
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND xact_start < now() - INTERVAL '2 minutes'
ORDER BY xact_start;
References
- PostgreSQL Documentation — ALTER TABLE
- PostgreSQL Documentation — Explicit Locking
- PostgreSQL Documentation — Constraints (NOT VALID)
- gh-ost — GitHub's Online Schema Migration Tool for MySQL
- pg_repack — Reorganize Tables in PostgreSQL Without Locks
- Martin Fowler — Parallel Change (Expand-Contract Pattern)
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Code examples are simplified for clarity — always review and adapt for your specific use case, security requirements, and database version. Test all migrations thoroughly in a staging environment before applying to production.