April 7, 2026 10 min read

Distributed Tracing for Payment Microservices — Finding the Needle in a $10M Haystack

When a payment fails at 2am and millions of dollars are in flight, you need more than logs. You need to see the exact path a transaction took across every service it touched. Here's how I built that visibility with OpenTelemetry.

Why Payment Systems Need Tracing More Than Most

A typical payment transaction in our system touches six services before a customer sees "Payment Successful." API gateway, authentication, fraud detection, the payment processor integration, ledger writes, and webhook dispatch. That's six places where something can go wrong, six places where latency can creep in, and six places where you need visibility when things break.

I learned this the hard way. We had an incident where settlement amounts were off by a few cents on roughly 2% of transactions. Logs showed everything was "successful." It took us eleven hours to trace the issue to a floating-point rounding error in the fraud scoring service that was silently modifying the amount field before passing it downstream. With distributed tracing, we would have seen the amount change between spans in minutes, not hours.

Payment systems are different from most microservice architectures in a few critical ways:

The OpenTelemetry Setup That Actually Works

I've tried a few approaches to instrumenting Go payment services. Vendor-specific SDKs, hand-rolled tracing, and eventually OpenTelemetry. OTel won because it's vendor-neutral and the Go SDK is genuinely solid now. Here's the initialization pattern I use across all our payment services:

package tracing

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func InitTracer(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
            semconv.DeploymentEnvironmentKey.String("production"),
        )),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

The key detail here is ParentBased(TraceIDRatioBased(0.1)). This means we sample 10% of new traces, but if an incoming request already has a trace ID (from an upstream service), we always honor that decision. This keeps traces complete — you never get a trace that's missing spans because a downstream service decided not to sample.

The Middleware Pattern

Every HTTP handler in our payment services goes through this middleware. It creates a span, attaches payment-specific attributes, and ensures the trace context propagates to the response:

func TracingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx, span := otel.Tracer("payment-api").Start(r.Context(),
            r.Method+" "+r.URL.Path,
            trace.WithAttributes(
                attribute.String("payment.merchant_id", r.Header.Get("X-Merchant-ID")),
                attribute.String("payment.idempotency_key", r.Header.Get("Idempotency-Key")),
            ),
        )
        defer span.End()

        sw := &statusWriter{ResponseWriter: w}
        next.ServeHTTP(sw, r.WithContext(ctx))

        span.SetAttributes(attribute.Int("http.status_code", sw.status))
        if sw.status >= 400 {
            span.SetStatus(codes.Error, http.StatusText(sw.status))
        }
    })
}

Notice we're attaching merchant_id and idempotency_key as span attributes. This is critical for payment systems — when something goes wrong, you need to search traces by business identifiers, not just technical ones.

What to Trace in a Payment Flow

Not every function call needs a span. Over-instrumenting is a real problem — it inflates costs and makes traces unreadable. Here's what I've found actually matters in a payment flow:

Payment Transaction Trace — 847ms total
API Gateway
847ms
Auth Service
45ms
Fraud Check
180ms
Payment Processor
320ms
Ledger Write
125ms
Webhook Dispatch
28ms
0ms 200ms 400ms 600ms 847ms

Each of these spans carries attributes that matter for debugging. The fraud check span includes the risk score and decision. The payment processor span includes the processor's own transaction ID and response code. The ledger write span includes the debit and credit account IDs. When a transaction fails, you can open one trace and see the full story.

Propagating Context Across Service Boundaries

HTTP propagation is the easy part. OpenTelemetry's Go SDK handles W3C Trace Context headers (traceparent and tracestate) automatically when you use the otelhttp transport. The tricky part is message queues.

We use Kafka heavily for async operations — webhook dispatch, settlement batching, reconciliation jobs. The trace context doesn't magically flow through Kafka. You have to inject it into message headers explicitly:

// Producer: inject trace context into Kafka headers
func produceWithTrace(ctx context.Context, topic string, msg []byte) error {
    carrier := propagation.MapCarrier{}
    otel.GetTextMapPropagator().Inject(ctx, carrier)

    headers := make([]kafka.Header, 0, len(carrier))
    for k, v := range carrier {
        headers = append(headers, kafka.Header{Key: k, Value: []byte(v)})
    }

    return producer.Produce(&kafka.Message{
        TopicPartition: kafka.TopicPartition{Topic: &topic},
        Headers:        headers,
        Value:          msg,
    }, nil)
}

// Consumer: extract trace context from Kafka headers
func consumeWithTrace(msg *kafka.Message) context.Context {
    carrier := propagation.MapCarrier{}
    for _, h := range msg.Headers {
        carrier[h.Key] = string(h.Value)
    }
    return otel.GetTextMapPropagator().Extract(context.Background(), carrier)
}

The gotcha that bit us: async workers that process messages in batches. If your worker pulls 50 messages off Kafka and processes them in a loop, each message has its own trace context. You need to create a new span for each message using that message's extracted context — not the worker's ambient context. I've seen teams accidentally parent all 50 message spans under a single "batch processing" span, which makes the traces useless for debugging individual transactions.

Tip on sampling: For payment systems, I recommend a dual sampling strategy. Use a low base rate (5-10%) for normal traffic, but always sample transactions that result in errors, timeouts, or amounts above a threshold. OpenTelemetry's ParentBased sampler combined with a custom tail-sampling rule in the OTel Collector gives you this. You get cost control without losing visibility into the transactions that matter most.

Alerting on Trace Data

Traces aren't just for debugging after the fact. The real power is using trace duration data to catch problems before customers start complaining.

We track P50, P95, and P99 latency for every span type in our payment flow. The thresholds we've settled on after months of tuning:

We derive these metrics from traces using the OTel Collector's spanmetrics connector, which generates histograms from span data. These feed into Prometheus, and we alert with standard Prometheus alerting rules. No extra tooling needed.

Comparing Tracing Backends

We've evaluated four backends over the past two years. Here's an honest comparison based on running them in production with payment workloads:

Backend Cost Retention Query Power Setup Complexity
Jaeger Free (self-hosted) Configurable Basic — tag search Medium — needs storage backend
Grafana Tempo Free (self-hosted) Configurable Good with TraceQL Low — object storage only
Datadog APM $$$ 15 days default Excellent — full analytics Low — SaaS
Honeycomb $$ 60 days Excellent — BubbleUp Low — SaaS

We started with Jaeger, moved to Tempo when our Elasticsearch costs got out of hand, and use Honeycomb for a subset of high-value traces where we need deep query capabilities. The OTel Collector makes this easy — you can fan out traces to multiple backends with different sampling rules.

Lessons from Production

After running distributed tracing across payment microservices for over two years, here's what I wish someone had told me on day one:

References

Disclaimer: The code examples and architecture patterns in this article are simplified for clarity. Production payment systems require additional considerations around PCI DSS compliance, encryption, error handling, and resilience that go beyond the scope of this post. Always consult your compliance team before implementing changes to payment infrastructure.