Payment Data Retention and Archival Engineering

The Retention Landscape

Payment data retention isn't a single policy — it's a matrix of overlapping requirements from card networks, regulators, tax authorities, and your own business needs. Visa requires transaction records for 13 months minimum. PCI DSS says you must limit storage to what's strictly necessary. GDPR says delete personal data when it's no longer needed. And your finance team wants 7 years of data for audits.

These requirements often conflict. The engineering challenge is building a system that satisfies all of them simultaneously without turning your database into an unqueryable archive.

Regulatory Timelines You Must Know

Requirement	Retention Period	What to Keep
Visa Core Rules	13 months	Transaction records, authorization logs
Mastercard Standards	13 months	Settlement data, dispute records
PCI DSS 4.0	Business-justified only	Minimize stored cardholder data
GDPR / UK GDPR	Purpose-limited	Personal data only as long as needed
Tax / Accounting (varies)	5-7 years	Financial records, invoices, receipts
Chargeback Window	540 days (Visa)	Evidence for dispute resolution

Key insight: The chargeback window is often the binding constraint, not the card network retention rule. Visa allows disputes up to 540 days after the transaction date for certain reason codes. If you archive transaction details before that window closes, you can't defend chargebacks.

Tiered Storage Architecture

The solution is tiered storage — keep recent data hot, move aging data to warm storage, and archive old data to cold storage. Each tier has different query performance characteristics and cost profiles.

Hot

0-90 days

Primary PostgreSQL
Full indexes, fast queries

Warm

90 days - 2 years

Read replica or ClickHouse
Reduced indexes, slower queries

Cold

2-7 years

S3 + Parquet
Batch queries only

Partition-Based Archival with PostgreSQL

PostgreSQL's declarative partitioning is the foundation of the hot tier. Partition your transactions table by month, and archival becomes detaching a partition rather than running a massive DELETE.

CREATE TABLE transactions (
    id          BIGINT GENERATED ALWAYS AS IDENTITY,
    created_at  TIMESTAMPTZ NOT NULL,
    merchant_id UUID NOT NULL,
    amount      BIGINT NOT NULL,
    currency    CHAR(3) NOT NULL,
    status      TEXT NOT NULL
) PARTITION BY RANGE (created_at);

-- Monthly partitions
CREATE TABLE transactions_2026_03
    PARTITION OF transactions
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

CREATE TABLE transactions_2026_04
    PARTITION OF transactions
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

When a partition ages out of the hot tier, detach it and move it to the warm tier:

-- Detach without blocking writes (PostgreSQL 14+)
ALTER TABLE transactions
    DETACH PARTITION transactions_2025_12 CONCURRENTLY;

-- Export to Parquet for cold storage
COPY transactions_2025_12 TO PROGRAM
    'parquet-converter --output s3://payments-archive/2025/12/'
    WITH (FORMAT csv, HEADER);

Real-world gotcha: DETACH PARTITION CONCURRENTLY requires PostgreSQL 14+. On older versions, detaching takes an ACCESS EXCLUSIVE lock that blocks all reads and writes on the parent table. We learned this the hard way during a maintenance window that turned into an outage.

The Cold Storage Layer

For data older than 2 years, Parquet files on S3 (or equivalent object storage) give you the best cost-to-queryability ratio. Parquet's columnar format means you can scan specific fields across billions of rows without reading the entire dataset.

Structure your cold storage by date hierarchy for efficient partition pruning:

s3://payments-archive/
  transactions/
    year=2024/
      month=01/
        part-00000.parquet
        part-00001.parquet
      month=02/
        ...
  settlements/
    year=2024/
      month=01/
        ...

Query cold data with tools like AWS Athena, Trino, or DuckDB. For compliance queries that happen once a quarter, the per-query cost model of Athena is far cheaper than keeping the data in a running database.

Here's where it gets tricky. GDPR gives individuals the right to request deletion of their personal data. But tax law requires you to keep financial records for 5-7 years. And PCI DSS requires audit trails.

The practical solution is pseudonymization rather than deletion for data that has a legal retention requirement:

-- Instead of deleting, pseudonymize the personal data
UPDATE transactions_archive
SET
    cardholder_name = 'REDACTED',
    email = encode(digest(email, 'sha256'), 'hex'),
    ip_address = '0.0.0.0',
    billing_address = 'REDACTED'
WHERE customer_id = $1
AND created_at < NOW() - INTERVAL '13 months';

This satisfies GDPR because the data is no longer personally identifiable, while preserving the financial record for tax and audit purposes. Document this approach in your data processing agreement — regulators generally accept pseudonymization as a valid response to erasure requests when legal retention obligations apply.

Building the Archival Pipeline

The archival pipeline runs as a scheduled job — we use a weekly cadence. The key principle: archival must be idempotent. If the job fails halfway through, re-running it should produce the same result.

Identify
Aged partitions

→

Export
To Parquet/S3

→

Verify
Row count + checksum

→

Detach
Partition

→

Drop
After 30-day hold

The 30-day hold between detach and drop is critical. If someone discovers a data issue after archival, you can reattach the partition without restoring from cold storage. We've used this escape hatch twice in production — both times it saved hours of recovery time.

Lessons from Production

Never delete without exporting first. This sounds obvious, but under pressure to free disk space, teams skip the verification step. Always confirm the row count in cold storage matches the source before dropping the partition.
Keep your archival metadata in a separate table. Track which partitions were archived, when, where the Parquet files live, and the row count. This table is your index into cold storage and your audit trail for compliance.
Test your restore path quarterly. Archival is useless if you can't restore. Pick a random archived month each quarter and verify you can query it end-to-end. We found a Parquet schema drift issue this way that would have made 3 months of data unreadable.
Encrypt cold storage at rest and in transit. Payment data in S3 needs server-side encryption (SSE-S3 or SSE-KMS) and bucket policies that enforce HTTPS. A misconfigured bucket with payment data is a PCI DSS violation and a breach waiting to happen.
Automate partition creation, not just archival. If you forget to create next month's partition, inserts fail. Use a cron job or pg_partman to create partitions 3 months ahead.

References

Disclaimer: This article reflects the author's personal experience and opinions. It is not legal advice — consult qualified legal counsel for compliance decisions specific to your jurisdiction. Product names, logos, and brands are property of their respective owners.

The Retention Landscape

Regulatory Timelines You Must Know

Tiered Storage Architecture

Partition-Based Archival with PostgreSQL

The Cold Storage Layer

GDPR and the Right to Erasure Conflict

Building the Archival Pipeline

Lessons from Production

References

Related Articles