GitOps for Payment Infrastructure — Why We Stopped SSHing Into Production

The 3 AM Deploy That Changed Everything

About two years ago, one of our engineers SSH'd into a production payment gateway node at 3 AM to hotfix a currency conversion bug. The fix worked — for that node. The other three nodes in the cluster kept running the broken code. For about forty minutes, we had inconsistent pricing across our fleet, and a handful of merchants got charged incorrect settlement amounts.

Nobody did anything malicious. The engineer followed the runbook. But the runbook assumed you'd remember to roll the fix across every node, and at 3 AM, after being paged out of sleep, that assumption fell apart. We spent the next week reconciling transactions and filing incident reports.

That was the week we started taking GitOps seriously.

What GitOps Actually Means (Not Just "Git + Deployments")

I've seen teams claim they're "doing GitOps" because they trigger deployments from a CI pipeline that reads from Git. That's continuous deployment — it's good, but it's not GitOps. The distinction matters, especially for payment systems.

GitOps has a specific contract: your Git repository is the single source of truth for your desired infrastructure state, and an operator running inside the cluster continuously reconciles actual state against that declared state. The key word is continuously. It's not a one-shot deploy — it's a control loop.

GitOps Deployment Pipeline

Git Push

CI Build
tests + image

Registry
container image

ArgoCD
reconcile loop

K8s Cluster
payment services

↻ Continuous reconciliation — not a one-shot pipeline

This matters because in payment infrastructure, drift is not just a nuisance — it's a compliance risk. If someone kubectl edits a deployment in production and bumps a resource limit, that change exists nowhere in your audit trail. With GitOps, the operator detects the drift and either reverts it or alerts you. Every change flows through a pull request with reviews, approvals, and a permanent record.

Why Payment Systems Specifically

You could make the case for GitOps in any production environment. But payment infrastructure has characteristics that make it especially compelling:

PCI DSS requires change audit trails. Requirement 6.5.6 and the broader change management controls in PCI DSS v4.0 demand that you can trace every production change back to an authorized request. Git history gives you that for free.
Consistency across nodes is non-negotiable. A payment gateway processing $2M/day cannot have nodes running different versions. The reconciliation loop guarantees convergence.
Rollbacks need to be instant and reliable. When a bad deploy starts declining valid cards, you need to revert in seconds, not minutes. git revert + auto-sync gets you there.
Separation of duties. The person who writes the code shouldn't be the same person who pushes it to production. GitOps enforces this structurally — the merge approval is the deployment approval.

	Traditional Deploy	GitOps Deploy
Trigger	CI pushes to cluster	Operator pulls from Git
Drift Detection	None — manual checks	Continuous reconciliation
Audit Trail	CI logs (often ephemeral)	Git history (permanent)
Rollback	Re-run old pipeline	`git revert` + auto-sync
Cluster Access	CI needs cluster creds	Operator runs in-cluster
Secret Mgmt	Injected by CI env vars	Sealed Secrets / SOPS

Our Setup: ArgoCD + Kustomize

We evaluated both ArgoCD and Flux. Both are solid CNCF projects. We went with ArgoCD mostly because the UI made it easier to onboard the rest of the team — being able to visualize the sync state of every payment microservice in a dashboard was a big win for our on-call engineers. Flux is arguably more "Kubernetes-native" in its design, and if your team is already comfortable with CRDs and controllers, it's a great choice.

Our repo structure looks roughly like this:

infra-manifests/
├── base/
│   ├── payment-gateway/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── hpa.yaml
│   ├── transaction-processor/
│   └── settlement-service/
├── overlays/
│   ├── staging/
│   │   └── kustomization.yaml
│   └── production/
│       ├── kustomization.yaml
│       └── patches/
│           ├── replicas.yaml
│           └── resource-limits.yaml
└── argocd/
    └── applications.yaml

Kustomize overlays let us keep a single base definition and patch per environment. Production gets higher replica counts, stricter resource limits, and tighter PodDisruptionBudgets. Staging mirrors production topology but with smaller instances. The key discipline: nothing gets applied to the cluster except what's in this repo.

Handling Secrets

The obvious question with GitOps is: "What about secrets?" You can't commit Stripe API keys or database passwords to Git, even a private repo. We use Sealed Secrets — you encrypt the secret client-side with the cluster's public key, commit the sealed version, and the controller decrypts it in-cluster. Mozilla SOPS with age encryption is another solid option, especially if you're already using it for other config.

Tip: Set up a pre-commit hook that rejects any file matching common secret patterns (*secret*.yaml that isn't a SealedSecret kind). It's a simple safety net that's saved us more than once from accidentally committing plaintext credentials.

Drift Detection in Practice

This is where GitOps really earns its keep in payment infrastructure. ArgoCD checks the cluster state against the Git repo every three minutes by default. When it detects drift, you have two options: auto-sync (revert automatically) or manual sync (alert and wait for human approval).

We use auto-sync for most services but require manual sync for our core payment gateway. The reasoning: if someone makes an emergency change to the gateway (say, scaling up during a traffic spike), we don't want ArgoCD immediately reverting it. Instead, it flags the drift, the on-call engineer gets a Slack alert, and they either commit the change to Git to make it permanent or acknowledge that ArgoCD should revert it.

Warning: If you enable auto-sync with selfHeal, make sure your team understands that any manual kubectl change will be reverted. We had an engineer spend twenty confused minutes wondering why his manual scaling kept getting undone. Document this clearly in your runbooks.

92%

Reduction in
deployment incidents

4 min

Average rollback
time (was 25 min)

SSH sessions to
production (6 months)

Lessons Learned

After running this setup for over a year across three payment service clusters, here's what I'd tell someone starting out:

Start with one non-critical service. We piloted GitOps on our merchant notification service before touching the payment gateway. It let us work out the kinks — sync policies, secret management, RBAC — without risking transaction processing.
Lock down kubectl access aggressively. GitOps only works if the Git repo is actually the source of truth. If half your team still has write access to production namespaces, you'll end up with drift constantly. We moved to read-only cluster access for everyone except break-glass emergency roles.
Invest in your PR review process. Since every production change is now a pull request, your review process is your change management process. We require two approvals for production overlay changes, with at least one from the platform team. It sounds heavy, but it's faster than the old change advisory board meetings.
Monitor the operator itself. ArgoCD is infrastructure too. We run it in a dedicated namespace with its own alerting. If ArgoCD goes down, you lose drift detection — and you might not notice until something goes wrong.
Keep app config and infra config in separate repos. We tried a monorepo initially. It got noisy fast. Application manifests change frequently; cluster-level infrastructure (ingress controllers, cert-manager, monitoring stack) changes rarely. Separate repos with separate sync policies made life much simpler.

Is It Worth It?

The migration took us about six weeks for the full payment platform — two weeks for the tooling setup, four weeks to migrate services incrementally. The upfront cost was real. Writing Kustomize overlays for thirty-something services, setting up Sealed Secrets, configuring RBAC, training the team on the new workflow.

But the payoff has been clear. Our PCI audits are smoother because we can point auditors at Git history instead of scrambling to reconstruct who changed what. Deployments went from a source of anxiety to a non-event. And nobody has SSH'd into a production payment node in over six months.

If you're running payment infrastructure on Kubernetes and you're still doing push-based deployments, GitOps is worth the investment. Not because it's trendy — because the properties it gives you (auditability, consistency, drift detection, instant rollback) are exactly what payment systems demand.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

The 3 AM Deploy That Changed Everything

What GitOps Actually Means (Not Just "Git + Deployments")

Why Payment Systems Specifically

Our Setup: ArgoCD + Kustomize

Handling Secrets

Drift Detection in Practice

Lessons Learned

Is It Worth It?

References

Related Articles