The Problem: Reporting vs. Transactions
Payment databases have a split personality. On one side, you have OLTP workloads — fast, small writes for every authorization, capture, and refund. On the other, you have reporting queries — massive aggregations across millions of rows for settlement reports, merchant dashboards, and compliance exports.
These two workloads fight for the same resources. A reporting query that scans 30 days of transactions locks rows, consumes I/O bandwidth, and pushes hot data out of the buffer cache. Meanwhile, the authorization service is trying to respond in under 200ms.
We hit this wall at around 50,000 transactions per day. The merchant dashboard would run a settlement summary query, and our Datadog alerts would light up with latency spikes on the authorization endpoint. The two workloads were sharing a single PostgreSQL 14 instance on an RDS db.r6g.xlarge.
PostgreSQL Streaming Replication — How It Works
PostgreSQL's streaming replication sends WAL (Write-Ahead Log) records from the primary to one or more standby servers in near real-time. The standby replays these records to maintain an almost-identical copy of the data.
The key configuration on the primary:
# postgresql.conf on primary
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
# pg_hba.conf — allow replication connections
host replication replicator 10.0.0.0/16 scram-sha-256
On AWS RDS or Aurora, you skip most of this — creating a read replica is a few clicks or one Terraform resource. But understanding the underlying mechanism matters when things go wrong.
The Routing Layer — Deciding Where Queries Go
The hardest part isn't setting up replication. It's deciding which queries go to the primary and which go to replicas. Get this wrong and you'll serve stale data to merchants or, worse, route writes to a read-only replica.
We built a simple routing layer in Go:
type DBRouter struct {
primary *sql.DB
replicas []*sql.DB
mu sync.Mutex
next int
}
func (r *DBRouter) Primary() *sql.DB {
return r.primary
}
func (r *DBRouter) Replica() *sql.DB {
r.mu.Lock()
defer r.mu.Unlock()
db := r.replicas[r.next]
r.next = (r.next + 1) % len(r.replicas)
return db
}
// Usage in handlers:
func (h *Handler) GetSettlementReport(w http.ResponseWriter, r *http.Request) {
db := h.router.Replica() // reporting → replica
rows, err := db.QueryContext(r.Context(), settlementQuery, merchantID, startDate, endDate)
// ...
}
func (h *Handler) ProcessCharge(w http.ResponseWriter, r *http.Request) {
db := h.router.Primary() // writes → primary
_, err := db.ExecContext(r.Context(), insertCharge, chargeID, amount, currency)
// ...
}
Key rule: Any query that follows a write in the same user flow must go to the primary. If a merchant creates a refund and immediately views the transaction detail, that read must hit the primary — otherwise they'll see the pre-refund state and think it failed.
Replication Lag — The Silent Killer
Replication lag is the time between a write on the primary and that write becoming visible on the replica. Under normal conditions, PostgreSQL streaming replication keeps this under 100ms. But "normal conditions" is doing a lot of heavy lifting.
Things that spike replication lag:
- Long-running queries on the replica. If a reporting query holds a snapshot for 30 seconds, the replica can't apply WAL records that conflict with that snapshot. Lag accumulates.
- Heavy write bursts on the primary. End-of-day settlement batches that insert thousands of rows can overwhelm the replica's apply rate.
- Replica resource contention. If the replica is undersized or its disk I/O is saturated by reporting queries, WAL replay falls behind.
- Network issues. Cross-AZ replication adds latency. Cross-region replication adds a lot more.
We monitor lag with a simple query on the replica:
-- Run on replica: check replication lag
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- Or from the primary: check all replicas
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS byte_lag
FROM pg_stat_replication;
The outage: We once had a replica fall 45 minutes behind during a batch migration. The merchant dashboard was showing settlement totals from the previous day. Three merchants called support thinking they'd lost money. We now alert when lag exceeds 5 seconds and automatically failover reporting to the primary if lag exceeds 30 seconds.
| Lag Threshold | Action | Severity |
|---|---|---|
| < 1 second | Normal operation | Healthy |
| 1–5 seconds | Log warning, investigate | Warning |
| 5–30 seconds | Page on-call, show banner in dashboard | High |
| > 30 seconds | Failover reporting to primary | Critical |
Handling hot_standby_feedback and Query Conflicts
Here's a PostgreSQL-specific gotcha that bit us. When the primary vacuums dead tuples, it sends WAL records to the replica saying "these rows are gone." But if a long-running query on the replica still needs those rows, you get a conflict.
By default, PostgreSQL cancels the replica query after max_standby_streaming_delay (30 seconds). Your 5-minute settlement report gets killed mid-execution.
The fix is hot_standby_feedback:
# On the replica
hot_standby_feedback = on # tells primary not to vacuum rows the replica needs
max_standby_streaming_delay = 300s # allow longer queries before cancellation
The tradeoff: enabling hot_standby_feedback means the primary's vacuum can't clean up rows that the replica is reading. If a replica runs a query that holds a snapshot for an hour, table bloat on the primary increases. We set statement_timeout = 600s on the replica to cap this.
Read-After-Write Consistency
The trickiest pattern in a read-replica setup is read-after-write. A merchant submits a refund (write to primary), then immediately refreshes the transaction page (read from... where?).
We solved this with a simple session-level flag:
func (h *Handler) ProcessRefund(w http.ResponseWriter, r *http.Request) {
db := h.router.Primary()
_, err := db.ExecContext(r.Context(), refundQuery, refundID, txnID, amount)
if err != nil {
// handle error
return
}
// Set a cookie that forces primary reads for 5 seconds
http.SetCookie(w, &http.Cookie{
Name: "read_primary",
Value: "1",
MaxAge: 5,
HttpOnly: true,
Secure: true,
SameSite: http.SameSiteStrictMode,
})
// ...
}
func (h *Handler) routeDB(r *http.Request) *sql.DB {
if cookie, err := r.Cookie("read_primary"); err == nil && cookie.Value == "1" {
return h.router.Primary()
}
return h.router.Replica()
}
After a write, the user's reads go to the primary for 5 seconds — long enough for the replica to catch up. After that, reads go back to the replica. Simple, predictable, and it handles 99% of read-after-write scenarios without any replication lag awareness.
Sizing Your Replicas
A common mistake is making replicas smaller than the primary because "they only handle reads." Reporting queries are often more resource-intensive than OLTP writes. A settlement report that aggregates 30 days of transactions needs substantial memory for sort operations and hash joins.
Our rule of thumb:
- CPU: Same or larger than primary. Reporting queries are CPU-heavy.
- Memory: At least equal to primary. The replica needs the same buffer cache to avoid disk I/O.
- Storage I/O: This is where you can differentiate. Use provisioned IOPS for the primary (consistent latency for transactions) and GP3 for replicas (cheaper, burst capacity is fine for batch reports).
- Number of replicas: Start with one. Add a second only when you have distinct workloads (e.g., one for the merchant dashboard, one for compliance batch exports).
What We'd Do Differently
- Add replicas before you need them. We waited until the primary was struggling. Setting up read routing after the fact meant touching every database call in the codebase. If we'd built the routing layer from day one, adding a replica would have been a config change.
- Monitor replication lag from the start. We didn't have lag monitoring until the 45-minute incident. Now it's one of our top-level Datadog dashboards.
- Use connection pooling on replicas too. We had PgBouncer in front of the primary but connected directly to replicas. Under load, the replica hit
max_connectionsand started rejecting dashboard queries. - Consider Aurora over vanilla RDS. Aurora's storage-level replication has significantly lower lag than PostgreSQL's WAL-based streaming replication. If you're on AWS and can afford it, Aurora PostgreSQL simplifies a lot of this.
Read replicas aren't glamorous infrastructure. But for payment systems where reporting and transactions share a database, they're the difference between a platform that scales and one that falls over every time a merchant pulls a monthly report.
References
- PostgreSQL Documentation — High Availability, Load Balancing, and Replication
- PostgreSQL Documentation — Hot Standby
- PostgreSQL — pg_stat_replication View
- AWS RDS — Working with Read Replicas
- PgBouncer — Lightweight Connection Pooler for PostgreSQL
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.