OpenTelemetry Instrumentation for Go Payment Services

Why OpenTelemetry Won Over Vendor SDKs

About eighteen months ago, our payment platform was a mess of observability tooling. We had Datadog's APM agent in the authorization service, New Relic in the settlement pipeline, and a homegrown Prometheus setup for the fraud engine. Three vendors, three sets of instrumentation code, and zero ability to trace a single payment from ingress to settlement. When a merchant reported intermittent timeouts on captures, we spent two days stitching together logs from different systems just to find the root cause — a connection pool exhaustion in a downstream acquirer adapter.

That was the moment I pushed for OpenTelemetry. The pitch was simple: one instrumentation layer, vendor-neutral, and we could swap backends without touching application code. The Go SDK was mature enough, the collector architecture meant we could fan out to multiple backends during migration, and CNCF backing gave us confidence it wasn't going anywhere.

The real win with OTel isn't the traces themselves — it's that your instrumentation code becomes infrastructure you own, decoupled from whatever backend you're paying for this year.

Setting Up the OTel SDK in Go

The bootstrap is straightforward but there are a few gotchas specific to payment services. You want to initialize the tracer and meter providers early in your application lifecycle — before any payment processing goroutines spin up. Here's the core setup I use across our services:

package main

import (
    "context"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func initOTel(ctx context.Context) (func(), error) {
    res, _ := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceNameKey.String("payment-gateway"),
            semconv.ServiceVersionKey.String("2.4.1"),
            semconv.DeploymentEnvironmentKey.String("production"),
        ),
    )

    traceExp, _ := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExp,
            sdktrace.WithMaxExportBatchSize(256),
            sdktrace.WithBatchTimeout(5*time.Second),
        ),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1),
        )),
    )
    otel.SetTracerProvider(tp)

    metricExp, _ := otlpmetricgrpc.New(ctx,
        otlpmetricgrpc.WithEndpoint("otel-collector:4317"),
        otlpmetricgrpc.WithInsecure(),
    )

    mp := metric.NewMeterProvider(
        metric.WithReader(metric.NewPeriodicReader(metricExp,
            metric.WithInterval(15*time.Second),
        )),
        metric.WithResource(res),
    )
    otel.SetMeterProvider(mp)

    return func() {
        tp.Shutdown(context.Background())
        mp.Shutdown(context.Background())
    }, nil
}

A couple of things worth noting. The ParentBased sampler is critical for payment flows — if an upstream service decides to sample a trace, you want every downstream hop to respect that decision. We sample at 10% for general traffic but force-sample any transaction over a configurable amount threshold. The batch timeout of 5 seconds is a balance between export latency and not hammering the collector during peak checkout hours.

Architecture: How the Data Flows

Before diving into the instrumentation details, here's how our OTel pipeline is wired up. The collector acts as a buffer and routing layer — we never export directly from services to backends in production.

Auth Service

Capture Service

Settlement Service

OTLP/gRPC

OTel Collector

batch • filter • route

OTLP/gRPC

Jaeger (Traces)

Grafana (Metrics)

Loki (Logs)

Fig 1: OTel pipeline — services export via OTLP to the collector, which fans out to storage backends

Custom Spans for Payment Flows

Generic HTTP spans tell you almost nothing useful about payment processing. What you actually need is domain-specific spans that map to your payment lifecycle. I create explicit spans for each phase: authorize, fraud check, capture, and settle. The key is attaching payment-specific attributes that make traces searchable and meaningful.

var tracer = otel.Tracer("payment-gateway")

func (s *PaymentService) AuthorizePayment(ctx context.Context, req *AuthRequest) (*AuthResponse, error) {
    ctx, span := tracer.Start(ctx, "payment.authorize",
        trace.WithAttributes(
            attribute.String("payment.merchant_id", req.MerchantID),
            attribute.String("payment.transaction_id", req.TransactionID),
            attribute.Float64("payment.amount", req.Amount),
            attribute.String("payment.currency", req.Currency),
            attribute.String("payment.card_network", req.CardNetwork),
            attribute.String("payment.acquirer", req.AcquirerCode),
        ),
    )
    defer span.End()

    // Fraud check as a child span
    fraudResult, err := s.checkFraud(ctx, req)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "fraud check failed")
        return nil, err
    }
    span.SetAttributes(attribute.String("payment.fraud_score", fraudResult.Score))

    // Forward to acquirer
    resp, err := s.acquirerClient.Authorize(ctx, req)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "acquirer authorization failed")
        return nil, err
    }

    span.SetAttributes(
        attribute.String("payment.auth_code", resp.AuthCode),
        attribute.String("payment.response_code", resp.ResponseCode),
    )
    return resp, nil
}

Notice how the ctx gets threaded through every call. That's how OTel propagates the trace context — the fraud check and acquirer call automatically become child spans of payment.authorize. When something goes wrong at 2 AM, you can search Jaeger for payment.merchant_id=MCH_29481 and see the entire payment journey in one waterfall view.

Trace Waterfall: What a Payment Flow Looks Like

Here's a simplified view of what a traced payment authorization looks like in our system. Each bar represents a span, and nesting shows the parent-child relationship:

Trace: payment.authorize — txn_8f3a2c91

payment-gateway

payment.authorize (342ms)

fraud-engine

fraud.check (95ms)

acquirer-adapter

acquirer.authorize (188ms)

payment-gateway

db.write (24ms)

Fig 2: Trace waterfall for a single payment authorization — fraud check and acquirer call are child spans

Metrics That Actually Matter

Traces are great for debugging individual transactions, but metrics are what keep you ahead of incidents. I set up three core instruments for every payment service: a latency histogram, an error counter, and a throughput gauge. The histogram bucket boundaries are tuned for payment latencies — most authorizations complete in 200-400ms, but acquirer timeouts can push into multi-second territory.

var meter = otel.Meter("payment-gateway")

var (
    authLatency, _ = meter.Float64Histogram(
        "payment.authorize.duration",
        otelmetric.WithDescription("Authorization latency in milliseconds"),
        otelmetric.WithUnit("ms"),
        otelmetric.WithExplicitBucketBoundaries(
            25, 50, 100, 200, 350, 500, 750, 1000, 2000, 5000,
        ),
    )

    authErrors, _ = meter.Int64Counter(
        "payment.authorize.errors",
        otelmetric.WithDescription("Authorization error count by type"),
    )

    authTotal, _ = meter.Int64Counter(
        "payment.authorize.total",
        otelmetric.WithDescription("Total authorization attempts"),
    )
)

func recordAuthMetrics(ctx context.Context, duration time.Duration, req *AuthRequest, err error) {
    attrs := []attribute.KeyValue{
        attribute.String("card_network", req.CardNetwork),
        attribute.String("acquirer", req.AcquirerCode),
        attribute.String("currency", req.Currency),
    }

    authLatency.Record(ctx, float64(duration.Milliseconds()), otelmetric.WithAttributes(attrs...))
    authTotal.Add(ctx, 1, otelmetric.WithAttributes(attrs...))

    if err != nil {
        errAttrs := append(attrs, attribute.String("error_type", classifyError(err)))
        authErrors.Add(ctx, 1, otelmetric.WithAttributes(errAttrs...))
    }
}

247ms

p99 Auth Latency

0.03%

Error Rate (7d avg)

1,842

Auth TPS (peak)

One lesson learned the hard way: keep your attribute cardinality under control. Early on, I added merchant_id as a metric attribute. With 12,000+ merchants, that blew up our Prometheus storage. Now merchant_id only goes on trace spans — metrics get aggregated by card_network, acquirer, and currency, which gives us enough dimensionality for dashboards and alerts without the cardinality explosion.

Context Propagation Across Services

Payment flows typically cross 3-5 service boundaries. The authorization service calls the fraud engine, which calls the risk scoring service, which might call an external 3DS provider. If context propagation breaks at any hop, you lose the trace.

For gRPC services, the OTel interceptors handle this automatically. For HTTP calls to external acquirers, you need to inject the trace context into outgoing headers:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

// Wrap your HTTP client — this injects W3C traceparent headers
acquirerClient := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
    Timeout:   30 * time.Second,
}

// For gRPC, use the unary interceptor
conn, _ := grpc.Dial(target,
    grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()),
)

Always verify propagation works end-to-end in staging before going to production. I've seen cases where a reverse proxy strips the traceparent header, silently breaking the entire trace chain. A quick integration test that asserts on trace ID continuity saves hours of debugging later.

Collector Configuration

We run the OTel Collector as a deployment in Kubernetes (not as a sidecar — the overhead wasn't worth it for our scale). The config routes traces to Jaeger and metrics to Grafana Mimir. The tail-sampling processor is a game changer for payment services: it lets you keep 100% of error traces and high-latency traces while sampling routine successful transactions at a lower rate.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: false
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

What I'd Do Differently

If I were starting this instrumentation from scratch, I'd invest more time upfront in defining a span naming convention and attribute schema. We ended up with inconsistencies — some services used payment.authorize, others used authorize_payment — and cleaning that up across 14 services was painful. Write an internal OTel style guide before you write a single line of instrumentation code. Document your attribute names, expected value formats, and cardinality limits.

I'd also set up the collector's tail sampling from day one instead of retrofitting it. Head-based sampling at 10% meant we missed rare but critical error traces during our first few months. Tail sampling fixed that, but we lost some valuable debugging data in the interim.

Overall, the migration took about six weeks for our team of four. The payoff has been significant: our mean time to detect payment anomalies dropped from 12 minutes to under 90 seconds, and incident resolution is roughly 3x faster now that we can follow a single trace across the entire payment lifecycle.

References

Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.

Why OpenTelemetry Won Over Vendor SDKs

Setting Up the OTel SDK in Go

Architecture: How the Data Flows

Custom Spans for Payment Flows

Trace Waterfall: What a Payment Flow Looks Like

Metrics That Actually Matter

Context Propagation Across Services

Collector Configuration

What I'd Do Differently

References

Related Articles