Mutual TLS Between Payment Microservices

The Problem With One-Way TLS

Standard TLS — the kind you get with HTTPS — solves one problem well: the client verifies the server's identity. Your browser checks that stripe.com is actually Stripe, not someone intercepting traffic. The connection is encrypted, the server is authenticated, and everyone feels safe.

But inside a microservices cluster, one-way TLS has a blind spot. The server has no idea who the client is. Any service that can reach the network endpoint can make requests. If an attacker compromises a single container — say, a log aggregator or a metrics exporter — they can start calling your payment authorization service, your ledger, your settlement engine. The TLS handshake succeeds because the client never had to prove its identity in the first place.

In a payment platform, that's not a theoretical risk. It's the difference between a contained breach and a wire transfer to an account you don't control.

One-Way TLS (Standard)

Client
No cert

Server cert

←

No client cert sent

Server
Has cert

Server cannot verify client identity

Mutual TLS (mTLS)

Client
Has cert

Server cert

←

→

Client cert

Server
Has cert

Both sides verified — mutual trust

How mTLS Actually Works

The mTLS handshake extends the standard TLS handshake with one critical addition: the server requests a certificate from the client, and the client provides one.

The client initiates a connection and the server responds with its certificate (same as regular TLS).
The server sends a CertificateRequest message — this is the mTLS part. It tells the client: "I need to see your ID too."
The client sends its own certificate, signed by a Certificate Authority the server trusts.
The server validates the client certificate against its trusted CA pool. If the cert is expired, revoked, or signed by an unknown CA, the handshake fails.
Both sides derive session keys and start exchanging encrypted data.

The result: every connection between services is encrypted and both endpoints are cryptographically authenticated. A compromised container without a valid certificate simply cannot establish a connection to your payment services.

Certificate Management — The Hard Part

The mTLS handshake itself is straightforward. What kills teams is certificate management. You need an internal Certificate Authority, a distribution mechanism, rotation automation, and a revocation strategy. Get any of these wrong and you'll be paged at 3 AM because a cert expired and your payment pipeline is down.

Internal CA

Don't use your public-facing CA for internal service certs. Stand up a dedicated internal CA — or better, use an intermediate CA chained to an offline root. Tools like cfssl, HashiCorp Vault's PKI secrets engine, or step-ca from Smallstep make this manageable. We went with Vault because we were already using it for secrets, and the PKI backend lets you issue certs programmatically with TTLs as short as a few hours.

Short-Lived Certificates

Long-lived certificates are a liability. If a cert with a one-year expiry gets compromised, you have a one-year window of exposure unless you catch it and revoke it manually. Short-lived certs — 24 hours or less — flip the model. Even if a cert leaks, it's useless by tomorrow. The tradeoff is that you need automated renewal, which brings us to rotation.

Rotation Strategy

We rotate certs every 12 hours with a 24-hour validity window. That overlap is intentional — it gives you a buffer if the renewal process hiccups. The rotation runs as a sidecar process that fetches a new cert from Vault, writes it to a shared volume, and sends a signal to the application to reload its TLS config. No restarts, no downtime.

Certificate Lifecycle

Generate

→

Deploy

→

Monitor

→

Rotate

→

Revoke

CA issues cert with short TTL

Sidecar writes to shared volume

Alert on expiry < threshold

Renew before expiry with overlap

CRL or OCSP for compromised certs

Tip: If you're on Kubernetes, cert-manager with a Vault issuer handles most of this automatically. It watches certificate resources, renews before expiry, and stores certs as Kubernetes secrets. We cut our cert-related incidents by about 80% after adopting it.

Implementation in Go — The Real Code

Most mTLS tutorials show you a five-line snippet and call it done. Here's what a production-grade setup actually looks like using Go's crypto/tls package. This is close to what we run for inter-service communication between our payment gateway and the ledger service.

mTLS Server

package main

import (
    "crypto/tls"
    "crypto/x509"
    "fmt"
    "log"
    "net/http"
    "os"
)

func main() {
    // Load the CA cert that signed client certificates
    caCert, err := os.ReadFile("/etc/certs/ca.pem")
    if err != nil {
        log.Fatalf("failed to read CA cert: %v", err)
    }
    caPool := x509.NewCertPool()
    if !caPool.AppendCertsFromPEM(caCert) {
        log.Fatal("failed to parse CA cert")
    }

    // Load the server's own certificate and private key
    serverCert, err := tls.LoadX509KeyPair(
        "/etc/certs/server.pem",
        "/etc/certs/server-key.pem",
    )
    if err != nil {
        log.Fatalf("failed to load server cert: %v", err)
    }

    tlsConfig := &tls.Config{
        Certificates: []tls.Certificate{serverCert},
        ClientCAs:    caPool,
        ClientAuth:   tls.RequireAndVerifyClientCert, // This is the mTLS part
        MinVersion:   tls.VersionTLS13,
    }

    mux := http.NewServeMux()
    mux.HandleFunc("/api/v1/settle", func(w http.ResponseWriter, r *http.Request) {
        // The client's identity is in the verified certificate
        cn := r.TLS.PeerCertificates[0].Subject.CommonName
        fmt.Fprintf(w, "settlement request accepted from: %s", cn)
    })

    server := &http.Server{
        Addr:      ":8443",
        Handler:   mux,
        TLSConfig: tlsConfig,
    }

    log.Println("mTLS server listening on :8443")
    log.Fatal(server.ListenAndServeTLS("", ""))
}

mTLS Client

package main

import (
    "crypto/tls"
    "crypto/x509"
    "fmt"
    "io"
    "log"
    "net/http"
    "os"
)

func newMTLSClient(caCertPath, clientCertPath, clientKeyPath string) (*http.Client, error) {
    caCert, err := os.ReadFile(caCertPath)
    if err != nil {
        return nil, fmt.Errorf("read CA cert: %w", err)
    }
    caPool := x509.NewCertPool()
    if !caPool.AppendCertsFromPEM(caCert) {
        return nil, fmt.Errorf("parse CA cert failed")
    }

    clientCert, err := tls.LoadX509KeyPair(clientCertPath, clientKeyPath)
    if err != nil {
        return nil, fmt.Errorf("load client cert: %w", err)
    }

    return &http.Client{
        Transport: &http.Transport{
            TLSClientConfig: &tls.Config{
                Certificates: []tls.Certificate{clientCert},
                RootCAs:      caPool,
                MinVersion:   tls.VersionTLS13,
            },
        },
    }, nil
}

func main() {
    client, err := newMTLSClient(
        "/etc/certs/ca.pem",
        "/etc/certs/client.pem",
        "/etc/certs/client-key.pem",
    )
    if err != nil {
        log.Fatalf("failed to create mTLS client: %v", err)
    }

    resp, err := client.Get("https://settlement-service:8443/api/v1/settle")
    if err != nil {
        log.Fatalf("request failed: %v", err)
    }
    defer resp.Body.Close()

    body, _ := io.ReadAll(resp.Body)
    fmt.Println(string(body))
}

The key line is ClientAuth: tls.RequireAndVerifyClientCert. Without it, the server accepts any connection. With it, the server demands a valid client certificate before the HTTP handler ever runs. No cert, no connection — the handshake fails at the TLS layer, before your application code even sees the request.

Warning: Don't set InsecureSkipVerify: true in production, even "temporarily." I've seen this in payment codebases more times than I'd like to admit. It disables certificate validation entirely and defeats the purpose of mTLS. If you're tempted to use it because certs aren't working, fix the certs.

Service Mesh vs DIY — Pick Your Pain

You have two paths to mTLS in a microservices environment, and the right choice depends on your team size and operational maturity.

DIY mTLS (Application-Level)

This is what I showed above — your application code manages TLS configuration directly. You control everything: which CA to trust, which cipher suites to allow, how to reload certs on rotation.

Full control over the TLS configuration per service
No infrastructure dependency — works on bare metal, VMs, or Kubernetes
You own the complexity: cert distribution, rotation, monitoring, revocation
Every service team needs to get it right, or you have gaps

Service Mesh (Istio, Linkerd)

A service mesh injects a sidecar proxy (Envoy for Istio, linkerd2-proxy for Linkerd) next to each service. The proxy handles mTLS transparently — your application code doesn't change at all. It just talks plain HTTP to localhost, and the proxy encrypts and authenticates the connection.

mTLS is automatic — no code changes in your services
Centralized certificate management and rotation
Policy enforcement: you can define which services can talk to which
Adds latency (small but measurable), memory overhead, and operational complexity

We started with DIY mTLS for our core payment services — there were only four of them, and we wanted tight control. When we grew to fifteen services, we migrated to Istio. The tipping point was cert rotation. Managing rotation sidecars for fifteen services was eating more engineering time than the mesh overhead would cost.

Tip: If you go the Istio route, start with PeerAuthentication in PERMISSIVE mode first. This accepts both mTLS and plain-text traffic, so you can migrate services incrementally without breaking everything at once. Switch to STRICT mode only after you've confirmed all services are sending mTLS traffic.

Debugging mTLS in Production

mTLS failures are some of the most frustrating issues to debug because the error messages are often cryptic and the failure happens at the TLS layer, below your application logs. Here are the three issues that have burned us the most:

1. Certificate Expiry

The most common cause of mTLS outages. A cert expires, the handshake fails, and suddenly your payment service can't talk to the ledger. The fix is monitoring: we run a Prometheus exporter that tracks cert_expiry_seconds for every service and alert when any cert is within 6 hours of expiry. But the real fix is short-lived certs with automated rotation — if your certs live for 24 hours and rotate every 12, expiry becomes a non-event.

2. Clock Skew

Certificates have a "not before" and "not after" timestamp. If the clock on your server is off by even a few minutes, a perfectly valid cert can be rejected because the server thinks it's not yet valid or already expired. We hit this in production when a Kubernetes node's NTP sync failed silently. The payment-to-fraud-check connection started failing intermittently — only on pods scheduled to that node.

Warning: Clock skew is insidious because it causes intermittent failures. If mTLS connections fail on some pods but not others, check NTP sync on the underlying nodes before you start blaming certificates. Run timedatectl status or check chronyc tracking on the affected nodes.

3. CN/SAN Mismatch

The client connects to settlement-service.payments.svc.cluster.local but the server certificate's Subject Alternative Name lists settlement-service.payments. The hostname doesn't match, the handshake fails. This bites teams that rename services or change namespaces without reissuing certificates. Always include the full FQDN and any short aliases in the SAN field.

The Lateral Movement Attack That Changed Our Minds

I mentioned the staging incident at the top, but let me give more detail because it's what convinced our team to prioritize mTLS.

We had a log aggregation service — a third-party agent running as a DaemonSet — that had a known CVE we hadn't patched yet. An attacker exploited it during a penetration test (thankfully, not a real attack) and got shell access inside the container. From there, they discovered that our internal services communicated over one-way TLS. The services authenticated users via JWTs at the API gateway, but service-to-service calls behind the gateway had no authentication at all.

The pen tester curled our settlement API from the compromised logging container. It worked. They could have initiated settlement requests, queried transaction data, or called the refund endpoint. The only thing stopping a real attacker would have been guessing the API contract — and with access to internal DNS and some patience, that's not hard.

With mTLS in place, that curl would have failed at the TLS handshake. The compromised container didn't have a valid client certificate signed by our internal CA, so the settlement service would have refused the connection before any HTTP request was processed. The blast radius of the compromise would have been limited to the logging service itself.

That pen test report went straight to leadership, and mTLS moved from "nice to have" to "next sprint."

Getting It Right

If you're running payment microservices without mTLS, you're trusting your network perimeter to do the job of authentication. That's a bet that gets riskier every time you add a service, a dependency, or a third-party agent to your cluster.

Start small. Pick two services that handle the most sensitive data — probably your payment gateway and your ledger — and implement mTLS between them. Use short-lived certs from the beginning, even if manual rotation feels painful at first. That pain is what motivates you to automate it properly.

Once you've proven the pattern, expand to the rest of your services. And when the operational burden of managing certs across a dozen services starts to hurt, that's your signal to evaluate a service mesh.

The goal isn't perfect security — it's defense in depth. mTLS is one layer, but it's the layer that stops a compromised container from becoming a compromised payment platform.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Security configurations should be reviewed by your team's security engineers — always verify with official documentation before deploying to production.

The Problem With One-Way TLS

How mTLS Actually Works

Certificate Management — The Hard Part

Internal CA

Short-Lived Certificates

Rotation Strategy

Implementation in Go — The Real Code

mTLS Server

mTLS Client

Service Mesh vs DIY — Pick Your Pain

DIY mTLS (Application-Level)

Service Mesh (Istio, Linkerd)

Debugging mTLS in Production

1. Certificate Expiry

2. Clock Skew

3. CN/SAN Mismatch

The Lateral Movement Attack That Changed Our Minds

Getting It Right

References

Related Articles