Distributed Tracing for Payment Microservices — Finding the Needle in a $10M Haystack

Why Payment Systems Need Tracing More Than Most

A typical payment transaction in our system touches six services before a customer sees "Payment Successful." API gateway, authentication, fraud detection, the payment processor integration, ledger writes, and webhook dispatch. That's six places where something can go wrong, six places where latency can creep in, and six places where you need visibility when things break.

I learned this the hard way. We had an incident where settlement amounts were off by a few cents on roughly 2% of transactions. Logs showed everything was "successful." It took us eleven hours to trace the issue to a floating-point rounding error in the fraud scoring service that was silently modifying the amount field before passing it downstream. With distributed tracing, we would have seen the amount change between spans in minutes, not hours.

Payment systems are different from most microservice architectures in a few critical ways:

Every transaction has a dollar value attached. A 500ms latency spike isn't just a bad user experience — it can mean timeouts that leave money in limbo between systems.
You need an audit trail. Regulators want to know exactly what happened to a transaction, and "check the logs across eight services" isn't an acceptable answer.
Failure modes are complex. A payment can partially succeed — the charge goes through but the ledger write fails, or the webhook never fires. Tracing shows you exactly where the chain broke.

The OpenTelemetry Setup That Actually Works

I've tried a few approaches to instrumenting Go payment services. Vendor-specific SDKs, hand-rolled tracing, and eventually OpenTelemetry. OTel won because it's vendor-neutral and the Go SDK is genuinely solid now. Here's the initialization pattern I use across all our payment services:

package tracing

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func InitTracer(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
            semconv.DeploymentEnvironmentKey.String("production"),
        )),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

The key detail here is ParentBased(TraceIDRatioBased(0.1)). This means we sample 10% of new traces, but if an incoming request already has a trace ID (from an upstream service), we always honor that decision. This keeps traces complete — you never get a trace that's missing spans because a downstream service decided not to sample.

The Middleware Pattern

Every HTTP handler in our payment services goes through this middleware. It creates a span, attaches payment-specific attributes, and ensures the trace context propagates to the response:

func TracingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx, span := otel.Tracer("payment-api").Start(r.Context(),
            r.Method+" "+r.URL.Path,
            trace.WithAttributes(
                attribute.String("payment.merchant_id", r.Header.Get("X-Merchant-ID")),
                attribute.String("payment.idempotency_key", r.Header.Get("Idempotency-Key")),
            ),
        )
        defer span.End()

        sw := &statusWriter{ResponseWriter: w}
        next.ServeHTTP(sw, r.WithContext(ctx))

        span.SetAttributes(attribute.Int("http.status_code", sw.status))
        if sw.status >= 400 {
            span.SetStatus(codes.Error, http.StatusText(sw.status))
        }
    })
}

Notice we're attaching merchant_id and idempotency_key as span attributes. This is critical for payment systems — when something goes wrong, you need to search traces by business identifiers, not just technical ones.

What to Trace in a Payment Flow

Not every function call needs a span. Over-instrumenting is a real problem — it inflates costs and makes traces unreadable. Here's what I've found actually matters in a payment flow:

Payment Transaction Trace — 847ms total

API Gateway

847ms

Auth Service

45ms

Fraud Check

180ms

Payment Processor

320ms

Ledger Write

125ms

Webhook Dispatch

28ms

0ms 200ms 400ms 600ms 847ms

Each of these spans carries attributes that matter for debugging. The fraud check span includes the risk score and decision. The payment processor span includes the processor's own transaction ID and response code. The ledger write span includes the debit and credit account IDs. When a transaction fails, you can open one trace and see the full story.

Propagating Context Across Service Boundaries

HTTP propagation is the easy part. OpenTelemetry's Go SDK handles W3C Trace Context headers (traceparent and tracestate) automatically when you use the otelhttp transport. The tricky part is message queues.

We use Kafka heavily for async operations — webhook dispatch, settlement batching, reconciliation jobs. The trace context doesn't magically flow through Kafka. You have to inject it into message headers explicitly:

// Producer: inject trace context into Kafka headers
func produceWithTrace(ctx context.Context, topic string, msg []byte) error {
    carrier := propagation.MapCarrier{}
    otel.GetTextMapPropagator().Inject(ctx, carrier)

    headers := make([]kafka.Header, 0, len(carrier))
    for k, v := range carrier {
        headers = append(headers, kafka.Header{Key: k, Value: []byte(v)})
    }

    return producer.Produce(&kafka.Message{
        TopicPartition: kafka.TopicPartition{Topic: &topic},
        Headers:        headers,
        Value:          msg,
    }, nil)
}

// Consumer: extract trace context from Kafka headers
func consumeWithTrace(msg *kafka.Message) context.Context {
    carrier := propagation.MapCarrier{}
    for _, h := range msg.Headers {
        carrier[h.Key] = string(h.Value)
    }
    return otel.GetTextMapPropagator().Extract(context.Background(), carrier)
}

The gotcha that bit us: async workers that process messages in batches. If your worker pulls 50 messages off Kafka and processes them in a loop, each message has its own trace context. You need to create a new span for each message using that message's extracted context — not the worker's ambient context. I've seen teams accidentally parent all 50 message spans under a single "batch processing" span, which makes the traces useless for debugging individual transactions.

Tip on sampling: For payment systems, I recommend a dual sampling strategy. Use a low base rate (5-10%) for normal traffic, but always sample transactions that result in errors, timeouts, or amounts above a threshold. OpenTelemetry's ParentBased sampler combined with a custom tail-sampling rule in the OTel Collector gives you this. You get cost control without losing visibility into the transactions that matter most.

Alerting on Trace Data

Traces aren't just for debugging after the fact. The real power is using trace duration data to catch problems before customers start complaining.

We track P50, P95, and P99 latency for every span type in our payment flow. The thresholds we've settled on after months of tuning:

Payment Processor span P99 > 2s: Alert immediately. This usually means the upstream processor is degraded, and we need to consider failover.
Fraud Check span P95 > 500ms: Warning. The ML model might be under-provisioned, or the feature store is slow.
Ledger Write span P99 > 300ms: Alert. Database contention or a missing index. This one has caught two production issues before they became incidents.
End-to-end transaction P99 > 3s: Critical. Customers are waiting. Something in the chain is broken.

We derive these metrics from traces using the OTel Collector's spanmetrics connector, which generates histograms from span data. These feed into Prometheus, and we alert with standard Prometheus alerting rules. No extra tooling needed.

Comparing Tracing Backends

We've evaluated four backends over the past two years. Here's an honest comparison based on running them in production with payment workloads:

Backend	Cost	Retention	Query Power	Setup Complexity
Jaeger	Free (self-hosted)	Configurable	Basic — tag search	Medium — needs storage backend
Grafana Tempo	Free (self-hosted)	Configurable	Good with TraceQL	Low — object storage only
Datadog APM	$$$	15 days default	Excellent — full analytics	Low — SaaS
Honeycomb	$$	60 days	Excellent — BubbleUp	Low — SaaS

We started with Jaeger, moved to Tempo when our Elasticsearch costs got out of hand, and use Honeycomb for a subset of high-value traces where we need deep query capabilities. The OTel Collector makes this easy — you can fan out traces to multiple backends with different sampling rules.

Lessons from Production

After running distributed tracing across payment microservices for over two years, here's what I wish someone had told me on day one:

Cardinality will bite you. We once added payment.amount as a span attribute with the exact cent value. Millions of unique values. Our Jaeger instance fell over within a week. Use bucketed ranges instead — amount_bucket: "100-500" — and put the exact amount in span events, not attributes.
Trace IDs belong in your API responses. We return the trace ID in a X-Trace-ID response header. When a merchant reports a failed payment, they give us the trace ID from the response, and we can pull up the exact trace in seconds. This alone cut our support resolution time by 60%.
Don't trace PII. It's tempting to attach card numbers or customer emails to spans for easier debugging. Don't. Trace data often has weaker access controls than your primary databases. We scrub everything through an OTel Collector processor that redacts sensitive fields before export.
Start with the OTel Collector, not direct export. Even if you only have one backend today, run the Collector as an intermediary. It gives you buffering, retry logic, and the ability to switch backends without redeploying every service. We've switched backends twice with zero application code changes.
Budget for storage early. At 10% sampling with 50,000 transactions per hour, we generate roughly 12GB of trace data per day. That adds up fast. Set retention policies from day one and use tail-based sampling in the Collector to keep interesting traces and drop the boring ones.

References

Disclaimer: The code examples and architecture patterns in this article are simplified for clarity. Production payment systems require additional considerations around PCI DSS compliance, encryption, error handling, and resilience that go beyond the scope of this post. Always consult your compliance team before implementing changes to payment infrastructure.