Why Rolling Deployments Are Risky for Payment Services
Rolling deployments work fine for most web apps. You gradually replace old pods with new ones, and if something breaks, Kubernetes rolls back. The problem is the "gradually" part. During a rolling update, you have old and new code running simultaneously, handling the same payment flows.
For a stateless API that serves product pages, this is fine. For a payment authorization service, it's a minefield. Consider what happens when a customer starts a checkout on the old version (which creates an authorization hold) and the capture request hits the new version (which expects a different payload format). The capture fails silently, the hold expires in 7 days, and the merchant never gets paid.
We hit exactly this scenario. A schema change to how we stored authorization references meant the new code couldn't find authorizations created by the old code. Twenty-three transactions got stuck in limbo. Blue-green eliminates this class of bug entirely — all traffic hits one version or the other, never both.
Single traffic switch point
Both environments read/write to the same DB
The Deployment Pipeline: Step by Step
Here's the actual sequence we follow for every payment service release. The whole process takes about 15 minutes, and most of it is automated.
The traffic switch is the key moment. Here's the nginx config that makes it work:
# /etc/nginx/conf.d/payment-service.conf
upstream payment_blue {
server 10.0.1.10:8080;
server 10.0.1.11:8080;
server 10.0.1.12:8080;
}
upstream payment_green {
server 10.0.2.10:8080;
server 10.0.2.11:8080;
server 10.0.2.12:8080;
}
# This single line controls which environment is live.
# Switch by changing "payment_blue" to "payment_green"
# and running: nginx -s reload
server {
listen 443 ssl;
server_name payments.internal;
location / {
proxy_pass http://payment_blue; # <-- THE SWITCH
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
proxy_next_upstream error timeout http_502;
}
}
Why nginx reload and not DNS? DNS-based switching (changing A records) has TTL propagation delays. Even with a 30-second TTL, some clients cache longer. An nginx reload takes effect in under 2 seconds and affects all new connections immediately. For payment traffic, those extra seconds of propagation delay are unacceptable.
The Hard Part: Database Migrations
Blue-green is straightforward for stateless services. It gets complicated when both environments share a database — which they must for payment systems, because you can't have two separate sources of truth for financial data.
The rule: every database migration must be backward-compatible. The old version of the code must still work after the migration runs. This means:
- Adding a column? Fine — old code ignores it.
- Removing a column? Do it in two deploys. First deploy: stop reading/writing the column. Second deploy (next release): drop the column.
- Renaming a column? Never. Add the new column, backfill, migrate reads, then drop the old one across multiple releases.
- Changing a column type? Same as rename — add new, migrate, drop old.
# Migration that's safe for blue-green:
# Both v2.3 (Blue) and v2.4 (Green) can work with this schema
class AddSettlementBatchIdToPayments < ActiveRecord::Migration[7.1]
def change
# Adding a nullable column is always safe
add_column :payments, :settlement_batch_id, :bigint, null: true
add_index :payments, :settlement_batch_id, algorithm: :concurrently
end
end
The migration that bit us: Early on, a developer added a NOT NULL constraint in the same deploy that started writing to the column. Blue (old code) didn't write to the new column, so every Blue insert failed with a constraint violation. We lost 12 minutes of payment records before catching it. Now we enforce a strict rule: NOT NULL constraints are always a separate deploy from the column addition.
Health Checks: Don't Switch Until You're Sure
A basic HTTP 200 health check isn't enough for payment services. Our pre-switch verification script checks five things before allowing the traffic switch:
#!/bin/bash
# pre-switch-verify.sh — Run against the idle (Green) environment
set -euo pipefail
GREEN_HOST="10.0.2.10:8080"
FAILURES=0
# 1. Basic health
echo "Checking health endpoint..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "http://$GREEN_HOST/health")
[[ "$HTTP_CODE" == "200" ]] || { echo "FAIL: health returned $HTTP_CODE"; FAILURES=$((FAILURES+1)); }
# 2. Database connectivity
echo "Checking database..."
DB_CHECK=$(curl -s "http://$GREEN_HOST/health/db" | jq -r '.status')
[[ "$DB_CHECK" == "ok" ]] || { echo "FAIL: database check returned $DB_CHECK"; FAILURES=$((FAILURES+1)); }
# 3. Redis connectivity
echo "Checking Redis..."
REDIS_CHECK=$(curl -s "http://$GREEN_HOST/health/redis" | jq -r '.status')
[[ "$REDIS_CHECK" == "ok" ]] || { echo "FAIL: Redis check returned $REDIS_CHECK"; FAILURES=$((FAILURES+1)); }
# 4. Synthetic payment test (test merchant, $0.01 auth + void)
echo "Running synthetic payment..."
SYNTH=$(curl -s -X POST "http://$GREEN_HOST/internal/synthetic-payment" \
-H "Content-Type: application/json" \
-d '{"merchant_id":"test_merchant","amount_cents":1,"currency":"USD"}')
SYNTH_STATUS=$(echo "$SYNTH" | jq -r '.status')
[[ "$SYNTH_STATUS" == "authorized" ]] || { echo "FAIL: synthetic payment returned $SYNTH_STATUS"; FAILURES=$((FAILURES+1)); }
# 5. Schema version check
echo "Checking schema version..."
SCHEMA=$(curl -s "http://$GREEN_HOST/health/schema" | jq -r '.version')
echo "Schema version: $SCHEMA"
if [[ $FAILURES -gt 0 ]]; then
echo "BLOCKED: $FAILURES checks failed. Do NOT switch traffic."
exit 1
fi
echo "ALL CHECKS PASSED. Safe to switch."
Instant Rollback: The Killer Feature
This is why blue-green is worth the infrastructure cost. When something goes wrong after a switch, rollback is literally one command:
# Rollback: switch traffic back to Blue
sed -i 's/proxy_pass http:\/\/payment_green/proxy_pass http:\/\/payment_blue/' \
/etc/nginx/conf.d/payment-service.conf
nginx -s reload
echo "Rolled back to Blue at $(date)"
No redeployment. No waiting for pods to spin up. No praying that the old Docker image is still in the registry. Blue is already running, warm, and ready. The switch takes under 2 seconds.
Compare this to a rolling deployment rollback: you need to trigger a new deployment with the old image, wait for pods to pull and start (60-90 seconds minimum), and hope the old code can handle whatever state the new code left behind. For payment services, those 60-90 seconds are an eternity.
The Incident That Proved It Was Worth It
Three months after adopting blue-green, we deployed a change to our currency conversion logic. The code passed all tests, but there was an edge case with JPY (Japanese Yen) — a zero-decimal currency. The new code was dividing by 100 to convert from cents, but JPY doesn't use cents. Every JPY transaction was being authorized for 1/100th of the correct amount.
Our monitoring caught the anomaly within 90 seconds — the average transaction amount for JPY merchants dropped by two orders of magnitude. The on-call engineer ran the rollback script. Total time from deploy to rollback: 3 minutes. Total affected transactions: 4 (all JPY, all for a single merchant). We contacted the merchant, voided the incorrect authorizations, and reprocessed them on the rolled-back version.
Without blue-green, this would have been a rolling deployment affecting all traffic for the 10+ minutes it would take to notice, diagnose, and redeploy. At our JPY volume, that's roughly 200 transactions at the wrong amount. The cleanup alone would have taken days.
Cost Considerations
The obvious downside: you're running two environments. That's roughly double the compute cost for the payment service. In practice, it's about 30% more because the idle environment runs at minimum replica count (we scale it down to 1 pod per service when idle, then scale up before a deploy).
For our team, the math is simple. The infrastructure overhead is about $800/month. One prevented incident saves us $5K-$50K in refunds, merchant escalations, and engineering time. It paid for itself in the first month.
Cost optimization tip: Use spot/preemptible instances for the idle environment. It doesn't need to be highly available when it's not serving traffic. Scale it up to on-demand instances 5 minutes before a deploy, run the switch, then scale the now-idle environment back down to spot instances.
Lessons Learned
- Automate the switch, but keep a human in the loop. Our deploy script runs health checks automatically, but the actual traffic switch requires a human to type
CONFIRM. Fully automated switches are fine once you trust the process — we weren't there yet after the database migration incident. - Keep both environments on the same infrastructure. Same VPC, same database, same Redis cluster. The only difference should be the application code. If you diverge infrastructure, you're testing two things at once.
- Practice rollbacks regularly. We do a "fire drill" rollback once a month during low-traffic hours. The team needs to be comfortable with the process so it's muscle memory during a real incident.
- Tag everything with the environment name. Logs, metrics, traces — all tagged with
env:blueorenv:green. When you're debugging post-switch, you need to know which environment generated which data.
References
- Martin Fowler — Blue-Green Deployment
- AWS — Blue/Green Deployments Whitepaper
- Nginx — Upstream Module Documentation
- Kubernetes — Services and Traffic Routing
- PostgreSQL — Altering Tables (Safe Migration Patterns)
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Infrastructure costs and metrics mentioned are illustrative — your values will depend on your scale and provider. Always verify with official documentation.