Why Multi-Tenancy Is Uniquely Hard in Payments
Most SaaS multi-tenancy guides talk about data isolation and noisy neighbors. Payments add three dimensions that make everything harder:
- Compliance boundaries are per-tenant. Merchant A might be PCI Level 1 while Merchant B is Level 4. Their data can't just be logically separated — audit trails need to prove isolation to different QSAs with different expectations.
- Settlement is real money. If your reconciliation job accidentally attributes Tenant A's Stripe payout to Tenant B's ledger, you don't have a bug — you have a financial incident. I've seen this happen exactly once, and the cleanup took three engineers two weeks.
- Provider configurations diverge wildly. One tenant uses Stripe Connect with direct charges, another uses Adyen with split payments, a third has a legacy Braintree integration they refuse to migrate. Your abstraction layer needs to handle all of this without leaking state between tenants.
The Architecture at a Glance
Here's the layered isolation model we settled on after two rewrites. Every request passes through tenant identification, then hits middleware that sets the isolation context for everything downstream — database queries, provider API calls, rate limits, and settlement jobs.
Choosing an Isolation Strategy
We evaluated three approaches. The "right" answer depends on your tenant count, compliance requirements, and how much operational complexity your team can absorb.
| Criteria | Shared DB + RLS | Schema-per-Tenant | DB-per-Tenant |
|---|---|---|---|
| Data isolation | Logical | Strong | Complete |
| Compliance audit | Harder to prove to QSAs | Moderate — schema boundaries help | Easiest — physical separation |
| Operational cost | Low | Medium | High |
| Migration complexity | Single migration, all tenants | N migrations (one per schema) | N migrations (one per DB) |
| Connection pooling | Shared pool, simple | Pool per schema or SET search_path |
Pool per DB — connection explosion |
| Noisy neighbor risk | High | Medium | None |
| Best for | 100+ tenants, similar compliance | 10–100 tenants, mixed compliance | <10 large tenants, strict regulation |
We went with shared DB + RLS for most tenants, with the option to "graduate" high-volume merchants to their own schema. The two biggest tenants (each doing 300M+ annually) got dedicated databases. Pragmatism over purity.
Tenant Context Propagation in Go
The foundation of everything is getting the tenant ID into context.Context early and making sure it's impossible to run a database query without it. Here's the middleware we use:
type tenantKey struct{}
// TenantFromContext extracts the tenant ID or panics.
// We intentionally panic here — a missing tenant ID is a
// programming error, not a runtime condition.
func TenantFromContext(ctx context.Context) string {
tid, ok := ctx.Value(tenantKey{}).(string)
if !ok || tid == "" {
panic("tenant ID missing from context — this is a bug")
}
return tid
}
func TenantMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
tid := r.Header.Get("X-Tenant-ID")
if tid == "" {
http.Error(w, `{"error":"missing tenant"}`, 403)
return
}
// Validate tenant exists and is active
// (cached lookup, ~0.2ms p99)
if !tenantCache.IsActive(tid) {
http.Error(w, `{"error":"unknown tenant"}`, 403)
return
}
ctx := context.WithValue(r.Context(), tenantKey{}, tid)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Then at the database layer, every query sets the RLS session variable before executing:
func (db *TenantDB) QueryContext(ctx context.Context, query string, args ...any) (*sql.Rows, error) {
tid := TenantFromContext(ctx)
// SET LOCAL scopes to the current transaction only
_, err := db.pool.ExecContext(ctx, "SET LOCAL app.tenant_id = $1", tid)
if err != nil {
return nil, fmt.Errorf("setting tenant context: %w", err)
}
return db.pool.QueryContext(ctx, query, args...)
}
The corresponding PostgreSQL RLS policy looks like this:
ALTER TABLE transactions ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON transactions
USING (tenant_id = current_setting('app.tenant_id')::uuid);
This is the part that lets me sleep at night. Even if application code has a bug and forgets a WHERE tenant_id = ? clause, RLS catches it at the database level. Defense in depth.
Hard-learned lesson: Never make tenant context optional. We initially had a "system" context that bypassed RLS for admin operations and migration scripts. Within three months, two different services were accidentally using the system context for regular queries because a developer copied the wrong initialization code. We ripped it out and created a separate, dedicated connection pool for admin operations with its own credentials and audit logging. The extra operational overhead was worth the guarantee.
Per-Tenant Provider Configurations
Each tenant brings their own payment provider relationship. Tenant A has a Stripe account with specific webhook endpoints. Tenant B uses Adyen with a different merchant account per currency. We store these configs encrypted at rest and load them per-request:
type ProviderConfig struct {
TenantID string
Provider string // "stripe", "adyen", "braintree"
Credentials EncryptedBlob
WebhookURL string
Metadata map[string]string // provider-specific settings
}
func (s *PaymentService) Charge(ctx context.Context, req ChargeRequest) (*ChargeResult, error) {
tid := TenantFromContext(ctx)
cfg, err := s.configStore.GetProvider(ctx, tid, req.ProviderHint)
if err != nil {
return nil, fmt.Errorf("loading provider config for tenant %s: %w", tid, err)
}
provider, err := s.providerFactory.Create(cfg)
if err != nil {
return nil, err
}
return provider.Charge(ctx, req)
}
The providerFactory returns a provider-specific client initialized with that tenant's credentials. We cache the decrypted credentials in memory with a 5-minute TTL — long enough to avoid hitting KMS on every request, short enough that credential rotations propagate quickly.
Settlement and Reconciliation Isolation
Settlement is where tenant isolation gets really unforgiving. Our reconciliation pipeline runs as a series of tenant-scoped batch jobs. Each job:
- Pulls the settlement file from the provider (Stripe payouts, Adyen settlement reports)
- Matches each line item against our internal ledger for that specific tenant
- Flags discrepancies into a tenant-scoped exceptions queue
- Updates the tenant's ledger entries with settlement confirmation
The critical design decision: settlement jobs never share database transactions across tenants. Each tenant's reconciliation runs in its own transaction with its own RLS context. If Tenant A's reconciliation fails and rolls back, Tenant B's settlement is completely unaffected.
We also partition the settlements table by tenant_id using PostgreSQL declarative partitioning. This gives us the ability to run VACUUM and maintenance operations per-tenant without locking the entire table — which matters a lot when your largest tenant has 40M rows and your smallest has 2,000.
Rate Limiting and Fair Usage
Without per-tenant rate limiting, one merchant running a batch import can starve everyone else. We use a token bucket per tenant, implemented in Redis:
func (rl *RateLimiter) Allow(ctx context.Context, operation string) error {
tid := TenantFromContext(ctx)
key := fmt.Sprintf("rl:%s:%s", tid, operation)
limit := rl.getTenantLimit(tid, operation) // from config
allowed, err := rl.redis.Do(ctx, "CL.THROTTLE", key,
limit.Burst, limit.Rate, limit.Period.Seconds(), 1,
).Result()
if err != nil {
// Fail open — don't block payments because Redis is down
return nil
}
if !allowed {
return ErrRateLimited
}
return nil
}
Limits are configurable per tenant. Our largest merchant gets 500 req/s on the charge endpoint; smaller tenants default to 50. We expose current usage in a dashboard so merchants can see when they're approaching limits and request increases.
One thing we got wrong initially: we rate-limited at the API gateway level only. That didn't account for async operations — webhook retries, settlement batch jobs, reconciliation queries. We added a second layer of rate limiting at the service level for database-heavy operations, which finally solved the noisy neighbor problem for real.
What I'd Do Differently
If I were starting over, I'd invest in tenant-aware observability from day one. We spent months debugging issues where metrics were aggregated across all tenants, making it impossible to tell if a latency spike was a platform problem or one tenant doing something weird. Adding tenant_id as a label to every metric, trace span, and log line should be non-negotiable from the start.
I'd also push harder for schema-per-tenant as the default instead of shared-with-RLS. The operational overhead of managing migrations across schemas is real, but the isolation guarantees and the ability to do per-tenant maintenance windows would have saved us several late-night incidents.
References
- PostgreSQL Row-Level Security Documentation
- Stripe Connect — Multi-Party Payments
- Adyen for Platforms — Multi-Tenant Payment Processing
- PostgreSQL Table Partitioning
- Redis Data Types and Rate Limiting Patterns
- PCI DSS Document Library — Multi-Tenant Guidance
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.