PostgreSQL Vacuum and Bloat Management for High-Volume Payment Tables

Why Payment Tables Are Bloat Magnets

Most OLTP tables follow a simple pattern: insert a row, read it many times, occasionally update it. Payment tables are different. A single transaction row gets updated three to five times in its first hour of life: pending → authorized → captured → settled, with possible detours through failed, reversed, or refunded.

PostgreSQL's MVCC model means each of those updates doesn't modify the row in place — it creates a new version and marks the old one as dead. On a table processing 500K transactions per day, that's easily 2 million dead tuples generated daily just from status changes. If autovacuum can't keep up, those dead tuples pile up, the table grows, indexes bloat, and your query latency starts climbing.

Step 1

INSERT

Live tuple v1

Step 2

UPDATE

v1 → dead

v2 live

Step 3

DEAD TUPLES

Wasting space

Step 4

VACUUM

Space reclaimed

Each UPDATE creates a new tuple version. Old versions become dead tuples until VACUUM reclaims them.

The Default Autovacuum Settings Are Wrong for Payments

PostgreSQL's default autovacuum settings are designed for general-purpose workloads. The trigger formula is: autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * n_live_tuples. With defaults of threshold=50 and scale_factor=0.2, autovacuum kicks in when 20% of the table is dead tuples.

For a payments table with 10 million rows, that means autovacuum won't trigger until there are 2 million dead tuples. By that point, your table is already significantly bloated and queries are suffering.

10x

Table size growth from stuck vacuum

50ms → 800ms

Query latency degradation

2M+

Dead tuples per day on a mid-size payment table

Here's what we use for our hot payment tables versus the defaults:

Setting	PG Default	Payment Table
vacuum_scale_factor	0.2 (20%)	0.01 (1%)
vacuum_threshold	50	1000
vacuum_cost_delay	2ms	0ms
vacuum_cost_limit	200	2000
analyze_scale_factor	0.1 (10%)	0.005 (0.5%)

Apply these per-table, not globally. Your archive tables and lookup tables don't need aggressive vacuuming — it would just waste I/O:

ALTER TABLE payments SET (
  autovacuum_vacuum_scale_factor = 0.01,
  autovacuum_vacuum_threshold = 1000,
  autovacuum_vacuum_cost_delay = 0,
  autovacuum_vacuum_cost_limit = 2000,
  autovacuum_analyze_scale_factor = 0.005
);

-- Leave archive tables on defaults or even more relaxed
ALTER TABLE payment_archive SET (
  autovacuum_vacuum_scale_factor = 0.3,
  autovacuum_enabled = true
);

Setting cost_delay to 0 and cost_limit to 2000 means vacuum runs as fast as possible on this table. On modern SSDs, the I/O impact is negligible compared to the cost of letting bloat accumulate.

Monitoring: Know Before It Hurts

The single most useful query for catching bloat early:

SELECT
  schemaname || '.' || relname AS table_name,
  n_live_tup,
  n_dead_tup,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 1) AS dead_pct,
  last_autovacuum,
  last_autoanalyze,
  autovacuum_count
FROM pg_stat_user_tables
WHERE relname LIKE 'payment%'
ORDER BY n_dead_tup DESC;

If dead_pct is consistently above 5% on your payment tables, autovacuum isn't keeping up. For more precise bloat estimation, install the pgstattuple extension:

SELECT * FROM pgstattuple('payments');
-- Look at dead_tuple_percent and free_space_percent

We run the pg_stat_user_tables query every 5 minutes via our monitoring stack and alert when dead tuple ratio exceeds 10% on any payment table. That's saved us from several slow-burn incidents.

The Long-Running Transaction Trap

Here's the thing about vacuum that catches people off guard: it can only clean up dead tuples that are no longer visible to any active transaction. If you have a connection sitting in idle in transaction state for 30 minutes, vacuum can't touch any tuple that was alive when that transaction started.

Production lesson: We had a reporting query that opened a transaction, ran a series of reads over 20 minutes, and never committed. During that window, autovacuum ran but couldn't reclaim anything. Dead tuples accumulated at full speed with zero cleanup. Set idle_in_transaction_session_timeout to something reasonable — we use 5 minutes.

-- Kill the silent killers
ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';
SELECT pg_reload_conf();

-- Find current offenders
SELECT pid, state, age(clock_timestamp(), xact_start) AS tx_age, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND xact_start < clock_timestamp() - interval '2 minutes'
ORDER BY xact_start;

The Replication Slot Incident

The worst bloat incident I've dealt with wasn't caused by bad autovacuum settings or long transactions. It was a stuck replication slot.

We had a logical replication subscriber that went down on a Friday evening. The replication slot on the primary kept holding its position, which meant PostgreSQL couldn't advance the xmin horizon. Vacuum ran on schedule but couldn't actually remove any dead tuples because the replication slot was telling Postgres "I still need those."

By Monday morning, our 12 GB payments table was 120 GB. Query latency had gone from 50ms to 800ms. The index bloat was even worse — our primary key index alone was 40 GB.

The fix was dropping the stuck replication slot, which immediately allowed vacuum to do its job. But the table didn't shrink — vacuum reclaims space for reuse, it doesn't return it to the OS. We had to run VACUUM FULL during a maintenance window, which locks the table exclusively. For a payments table, that means downtime.

-- Check for stuck replication slots
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn,
       age(xmin) AS xmin_age
FROM pg_replication_slots
WHERE NOT active;

-- Monitor slot lag
SELECT slot_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

After that incident, we added two things: an alert on inactive replication slots older than 1 hour, and a max_slot_wal_keep_size setting to prevent unbounded WAL retention. We also switched to pg_repack for online table rebuilds so we'd never need VACUUM FULL on a production payment table again.

The Checklist

If you're running PostgreSQL for payment data, here's the short version:

Set per-table autovacuum on your hot payment tables — scale_factor=0.01, cost_delay=0.
Monitor n_dead_tup from pg_stat_user_tables and alert on dead tuple ratios above 10%.
Set idle_in_transaction_session_timeout to prevent long-running transactions from blocking vacuum.
Alert on inactive replication slots — they're the silent killer of vacuum effectiveness.
Install pg_repack for when you need to reclaim space without locking the table.

Bloat management isn't glamorous work, but in payment systems it's the difference between consistent 50ms queries and a 3 AM page about degraded checkout performance.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

Why Payment Tables Are Bloat Magnets

The Default Autovacuum Settings Are Wrong for Payments

Monitoring: Know Before It Hurts

The Long-Running Transaction Trap

The Replication Slot Incident

The Checklist

References

Related Articles