Database Read Replicas for Payment Reporting Systems

The Problem: Reporting vs. Transactions

Payment databases have a split personality. On one side, you have OLTP workloads — fast, small writes for every authorization, capture, and refund. On the other, you have reporting queries — massive aggregations across millions of rows for settlement reports, merchant dashboards, and compliance exports.

These two workloads fight for the same resources. A reporting query that scans 30 days of transactions locks rows, consumes I/O bandwidth, and pushes hot data out of the buffer cache. Meanwhile, the authorization service is trying to respond in under 200ms.

847ms

P99 auth latency during reports

23ms

P99 auth latency after replicas

37x

Improvement factor

We hit this wall at around 50,000 transactions per day. The merchant dashboard would run a settlement summary query, and our Datadog alerts would light up with latency spikes on the authorization endpoint. The two workloads were sharing a single PostgreSQL 14 instance on an RDS db.r6g.xlarge.

PostgreSQL Streaming Replication — How It Works

PostgreSQL's streaming replication sends WAL (Write-Ahead Log) records from the primary to one or more standby servers in near real-time. The standby replays these records to maintain an almost-identical copy of the data.

Streaming Replication Flow

Primary

Writes + Reads

Auth, Capture, Refund

WAL →

~50ms lag

Replica 1

Dashboard queries

Replica 2

Compliance exports

The key configuration on the primary:

# postgresql.conf on primary
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB

# pg_hba.conf — allow replication connections
host replication replicator 10.0.0.0/16 scram-sha-256

On AWS RDS or Aurora, you skip most of this — creating a read replica is a few clicks or one Terraform resource. But understanding the underlying mechanism matters when things go wrong.

The Routing Layer — Deciding Where Queries Go

The hardest part isn't setting up replication. It's deciding which queries go to the primary and which go to replicas. Get this wrong and you'll serve stale data to merchants or, worse, route writes to a read-only replica.

We built a simple routing layer in Go:

type DBRouter struct {
    primary  *sql.DB
    replicas []*sql.DB
    mu       sync.Mutex
    next     int
}

func (r *DBRouter) Primary() *sql.DB {
    return r.primary
}

func (r *DBRouter) Replica() *sql.DB {
    r.mu.Lock()
    defer r.mu.Unlock()
    db := r.replicas[r.next]
    r.next = (r.next + 1) % len(r.replicas)
    return db
}

// Usage in handlers:
func (h *Handler) GetSettlementReport(w http.ResponseWriter, r *http.Request) {
    db := h.router.Replica() // reporting → replica
    rows, err := db.QueryContext(r.Context(), settlementQuery, merchantID, startDate, endDate)
    // ...
}

func (h *Handler) ProcessCharge(w http.ResponseWriter, r *http.Request) {
    db := h.router.Primary() // writes → primary
    _, err := db.ExecContext(r.Context(), insertCharge, chargeID, amount, currency)
    // ...
}

Key rule: Any query that follows a write in the same user flow must go to the primary. If a merchant creates a refund and immediately views the transaction detail, that read must hit the primary — otherwise they'll see the pre-refund state and think it failed.

Replication Lag — The Silent Killer

Replication lag is the time between a write on the primary and that write becoming visible on the replica. Under normal conditions, PostgreSQL streaming replication keeps this under 100ms. But "normal conditions" is doing a lot of heavy lifting.

Things that spike replication lag:

Long-running queries on the replica. If a reporting query holds a snapshot for 30 seconds, the replica can't apply WAL records that conflict with that snapshot. Lag accumulates.
Heavy write bursts on the primary. End-of-day settlement batches that insert thousands of rows can overwhelm the replica's apply rate.
Replica resource contention. If the replica is undersized or its disk I/O is saturated by reporting queries, WAL replay falls behind.
Network issues. Cross-AZ replication adds latency. Cross-region replication adds a lot more.

We monitor lag with a simple query on the replica:

-- Run on replica: check replication lag
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- Or from the primary: check all replicas
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS byte_lag
FROM pg_stat_replication;

The outage: We once had a replica fall 45 minutes behind during a batch migration. The merchant dashboard was showing settlement totals from the previous day. Three merchants called support thinking they'd lost money. We now alert when lag exceeds 5 seconds and automatically failover reporting to the primary if lag exceeds 30 seconds.

Lag Threshold	Action	Severity
< 1 second	Normal operation	Healthy
1–5 seconds	Log warning, investigate	Warning
5–30 seconds	Page on-call, show banner in dashboard	High
> 30 seconds	Failover reporting to primary	Critical

Handling hot_standby_feedback and Query Conflicts

Here's a PostgreSQL-specific gotcha that bit us. When the primary vacuums dead tuples, it sends WAL records to the replica saying "these rows are gone." But if a long-running query on the replica still needs those rows, you get a conflict.

By default, PostgreSQL cancels the replica query after max_standby_streaming_delay (30 seconds). Your 5-minute settlement report gets killed mid-execution.

The fix is hot_standby_feedback:

# On the replica
hot_standby_feedback = on    # tells primary not to vacuum rows the replica needs
max_standby_streaming_delay = 300s  # allow longer queries before cancellation

The tradeoff: enabling hot_standby_feedback means the primary's vacuum can't clean up rows that the replica is reading. If a replica runs a query that holds a snapshot for an hour, table bloat on the primary increases. We set statement_timeout = 600s on the replica to cap this.

Read-After-Write Consistency

The trickiest pattern in a read-replica setup is read-after-write. A merchant submits a refund (write to primary), then immediately refreshes the transaction page (read from... where?).

We solved this with a simple session-level flag:

func (h *Handler) ProcessRefund(w http.ResponseWriter, r *http.Request) {
    db := h.router.Primary()
    _, err := db.ExecContext(r.Context(), refundQuery, refundID, txnID, amount)
    if err != nil {
        // handle error
        return
    }

    // Set a cookie that forces primary reads for 5 seconds
    http.SetCookie(w, &http.Cookie{
        Name:     "read_primary",
        Value:    "1",
        MaxAge:   5,
        HttpOnly: true,
        Secure:   true,
        SameSite: http.SameSiteStrictMode,
    })
    // ...
}

func (h *Handler) routeDB(r *http.Request) *sql.DB {
    if cookie, err := r.Cookie("read_primary"); err == nil && cookie.Value == "1" {
        return h.router.Primary()
    }
    return h.router.Replica()
}

After a write, the user's reads go to the primary for 5 seconds — long enough for the replica to catch up. After that, reads go back to the replica. Simple, predictable, and it handles 99% of read-after-write scenarios without any replication lag awareness.

Sizing Your Replicas

A common mistake is making replicas smaller than the primary because "they only handle reads." Reporting queries are often more resource-intensive than OLTP writes. A settlement report that aggregates 30 days of transactions needs substantial memory for sort operations and hash joins.

Our rule of thumb:

CPU: Same or larger than primary. Reporting queries are CPU-heavy.
Memory: At least equal to primary. The replica needs the same buffer cache to avoid disk I/O.
Storage I/O: This is where you can differentiate. Use provisioned IOPS for the primary (consistent latency for transactions) and GP3 for replicas (cheaper, burst capacity is fine for batch reports).
Number of replicas: Start with one. Add a second only when you have distinct workloads (e.g., one for the merchant dashboard, one for compliance batch exports).

What We'd Do Differently

Add replicas before you need them. We waited until the primary was struggling. Setting up read routing after the fact meant touching every database call in the codebase. If we'd built the routing layer from day one, adding a replica would have been a config change.
Monitor replication lag from the start. We didn't have lag monitoring until the 45-minute incident. Now it's one of our top-level Datadog dashboards.
Use connection pooling on replicas too. We had PgBouncer in front of the primary but connected directly to replicas. Under load, the replica hit max_connections and started rejecting dashboard queries.
Consider Aurora over vanilla RDS. Aurora's storage-level replication has significantly lower lag than PostgreSQL's WAL-based streaming replication. If you're on AWS and can afford it, Aurora PostgreSQL simplifies a lot of this.

Read replicas aren't glamorous infrastructure. But for payment systems where reporting and transactions share a database, they're the difference between a platform that scales and one that falls over every time a merchant pulls a monthly report.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

The Problem: Reporting vs. Transactions

PostgreSQL Streaming Replication — How It Works

The Routing Layer — Deciding Where Queries Go

Replication Lag — The Silent Killer

Handling hot_standby_feedback and Query Conflicts

Read-After-Write Consistency

Sizing Your Replicas

What We'd Do Differently

References

Related Articles