Building Multi-Tenant Payment Systems — Lessons from Isolating $2B in Annual Volume

Why Multi-Tenancy Is Uniquely Hard in Payments

Most SaaS multi-tenancy guides talk about data isolation and noisy neighbors. Payments add three dimensions that make everything harder:

Compliance boundaries are per-tenant. Merchant A might be PCI Level 1 while Merchant B is Level 4. Their data can't just be logically separated — audit trails need to prove isolation to different QSAs with different expectations.
Settlement is real money. If your reconciliation job accidentally attributes Tenant A's Stripe payout to Tenant B's ledger, you don't have a bug — you have a financial incident. I've seen this happen exactly once, and the cleanup took three engineers two weeks.
Provider configurations diverge wildly. One tenant uses Stripe Connect with direct charges, another uses Adyen with split payments, a third has a legacy Braintree integration they refuse to migrate. Your abstraction layer needs to handle all of this without leaking state between tenants.

The Architecture at a Glance

Here's the layered isolation model we settled on after two rewrites. Every request passes through tenant identification, then hits middleware that sets the isolation context for everything downstream — database queries, provider API calls, rate limits, and settlement jobs.

Multi-Tenant Payment Isolation Layers

API Gateway

X-Tenant-ID header extracted & verified

Tenant Context Middleware

Context propagated to all downstream services

Payment Service

Settlement Service

Reconciliation

RLS policies enforce row-level tenant filtering

PostgreSQL + RLS

Provider Configs

Ledger (per-tenant)

Choosing an Isolation Strategy

We evaluated three approaches. The "right" answer depends on your tenant count, compliance requirements, and how much operational complexity your team can absorb.

Criteria	Shared DB + RLS	Schema-per-Tenant	DB-per-Tenant
Data isolation	Logical	Strong	Complete
Compliance audit	Harder to prove to QSAs	Moderate — schema boundaries help	Easiest — physical separation
Operational cost	Low	Medium	High
Migration complexity	Single migration, all tenants	N migrations (one per schema)	N migrations (one per DB)
Connection pooling	Shared pool, simple	Pool per schema or `SET search_path`	Pool per DB — connection explosion
Noisy neighbor risk	High	Medium	None
Best for	100+ tenants, similar compliance	10–100 tenants, mixed compliance	<10 large tenants, strict regulation

We went with shared DB + RLS for most tenants, with the option to "graduate" high-volume merchants to their own schema. The two biggest tenants (each doing 300M+ annually) got dedicated databases. Pragmatism over purity.

Tenant Context Propagation in Go

The foundation of everything is getting the tenant ID into context.Context early and making sure it's impossible to run a database query without it. Here's the middleware we use:

type tenantKey struct{}

// TenantFromContext extracts the tenant ID or panics.
// We intentionally panic here — a missing tenant ID is a
// programming error, not a runtime condition.
func TenantFromContext(ctx context.Context) string {
    tid, ok := ctx.Value(tenantKey{}).(string)
    if !ok || tid == "" {
        panic("tenant ID missing from context — this is a bug")
    }
    return tid
}

func TenantMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tid := r.Header.Get("X-Tenant-ID")
        if tid == "" {
            http.Error(w, `{"error":"missing tenant"}`, 403)
            return
        }
        // Validate tenant exists and is active
        // (cached lookup, ~0.2ms p99)
        if !tenantCache.IsActive(tid) {
            http.Error(w, `{"error":"unknown tenant"}`, 403)
            return
        }
        ctx := context.WithValue(r.Context(), tenantKey{}, tid)
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Then at the database layer, every query sets the RLS session variable before executing:

func (db *TenantDB) QueryContext(ctx context.Context, query string, args ...any) (*sql.Rows, error) {
    tid := TenantFromContext(ctx)

    // SET LOCAL scopes to the current transaction only
    _, err := db.pool.ExecContext(ctx, "SET LOCAL app.tenant_id = $1", tid)
    if err != nil {
        return nil, fmt.Errorf("setting tenant context: %w", err)
    }
    return db.pool.QueryContext(ctx, query, args...)
}

The corresponding PostgreSQL RLS policy looks like this:

ALTER TABLE transactions ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON transactions
    USING (tenant_id = current_setting('app.tenant_id')::uuid);

This is the part that lets me sleep at night. Even if application code has a bug and forgets a WHERE tenant_id = ? clause, RLS catches it at the database level. Defense in depth.

Hard-learned lesson: Never make tenant context optional. We initially had a "system" context that bypassed RLS for admin operations and migration scripts. Within three months, two different services were accidentally using the system context for regular queries because a developer copied the wrong initialization code. We ripped it out and created a separate, dedicated connection pool for admin operations with its own credentials and audit logging. The extra operational overhead was worth the guarantee.

Per-Tenant Provider Configurations

Each tenant brings their own payment provider relationship. Tenant A has a Stripe account with specific webhook endpoints. Tenant B uses Adyen with a different merchant account per currency. We store these configs encrypted at rest and load them per-request:

type ProviderConfig struct {
    TenantID    string
    Provider    string // "stripe", "adyen", "braintree"
    Credentials EncryptedBlob
    WebhookURL  string
    Metadata    map[string]string // provider-specific settings
}

func (s *PaymentService) Charge(ctx context.Context, req ChargeRequest) (*ChargeResult, error) {
    tid := TenantFromContext(ctx)
    cfg, err := s.configStore.GetProvider(ctx, tid, req.ProviderHint)
    if err != nil {
        return nil, fmt.Errorf("loading provider config for tenant %s: %w", tid, err)
    }

    provider, err := s.providerFactory.Create(cfg)
    if err != nil {
        return nil, err
    }
    return provider.Charge(ctx, req)
}

The providerFactory returns a provider-specific client initialized with that tenant's credentials. We cache the decrypted credentials in memory with a 5-minute TTL — long enough to avoid hitting KMS on every request, short enough that credential rotations propagate quickly.

Settlement and Reconciliation Isolation

Settlement is where tenant isolation gets really unforgiving. Our reconciliation pipeline runs as a series of tenant-scoped batch jobs. Each job:

Pulls the settlement file from the provider (Stripe payouts, Adyen settlement reports)
Matches each line item against our internal ledger for that specific tenant
Flags discrepancies into a tenant-scoped exceptions queue
Updates the tenant's ledger entries with settlement confirmation

The critical design decision: settlement jobs never share database transactions across tenants. Each tenant's reconciliation runs in its own transaction with its own RLS context. If Tenant A's reconciliation fails and rolls back, Tenant B's settlement is completely unaffected.

We also partition the settlements table by tenant_id using PostgreSQL declarative partitioning. This gives us the ability to run VACUUM and maintenance operations per-tenant without locking the entire table — which matters a lot when your largest tenant has 40M rows and your smallest has 2,000.

Rate Limiting and Fair Usage

Without per-tenant rate limiting, one merchant running a batch import can starve everyone else. We use a token bucket per tenant, implemented in Redis:

func (rl *RateLimiter) Allow(ctx context.Context, operation string) error {
    tid := TenantFromContext(ctx)
    key := fmt.Sprintf("rl:%s:%s", tid, operation)

    limit := rl.getTenantLimit(tid, operation) // from config
    allowed, err := rl.redis.Do(ctx, "CL.THROTTLE", key,
        limit.Burst, limit.Rate, limit.Period.Seconds(), 1,
    ).Result()
    if err != nil {
        // Fail open — don't block payments because Redis is down
        return nil
    }
    if !allowed {
        return ErrRateLimited
    }
    return nil
}

Limits are configurable per tenant. Our largest merchant gets 500 req/s on the charge endpoint; smaller tenants default to 50. We expose current usage in a dashboard so merchants can see when they're approaching limits and request increases.

One thing we got wrong initially: we rate-limited at the API gateway level only. That didn't account for async operations — webhook retries, settlement batch jobs, reconciliation queries. We added a second layer of rate limiting at the service level for database-heavy operations, which finally solved the noisy neighbor problem for real.

What I'd Do Differently

If I were starting over, I'd invest in tenant-aware observability from day one. We spent months debugging issues where metrics were aggregated across all tenants, making it impossible to tell if a latency spike was a platform problem or one tenant doing something weird. Adding tenant_id as a label to every metric, trace span, and log line should be non-negotiable from the start.

I'd also push harder for schema-per-tenant as the default instead of shared-with-RLS. The operational overhead of managing migrations across schemas is real, but the isolation guarantees and the ability to do per-tenant maintenance windows would have saved us several late-night incidents.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.