The Moment REST Stopped Working for Us
Our payment platform had about a dozen microservices — authorization, fraud scoring, ledger, settlement, notifications. All talking REST over JSON. It worked fine at low volume. But once we crossed a few thousand transactions per second, problems started stacking up.
JSON serialization was eating CPU on the hot path. We had no way to enforce schema contracts between teams — someone would rename a field in the fraud service response and the authorization service would silently swallow the missing data. And when we needed real-time settlement status updates, we were polling a REST endpoint every 500ms. It was ugly.
gRPC solved all three problems. Not because it's magic, but because it gives you binary serialization, strict contracts via protobuf, and native streaming. For internal service-to-service communication in a payment system, that combination is hard to beat.
To be clear: REST is still the right choice for your public-facing payment API. Clients expect JSON, browsers need HTTP/1.1 compatibility, and the tooling ecosystem is unmatched. gRPC shines for internal service communication where you control both ends.
REST vs gRPC — Where It Actually Matters
I've seen too many "REST vs gRPC" comparisons that focus on theoretical throughput benchmarks. Here's what actually mattered in our payment system:
The serialization difference alone justified the migration on our authorization path. When you're doing fraud checks, balance lookups, and ledger writes for every transaction, shaving milliseconds off each hop compounds fast.
Protobuf Schema Design for Financial Data
Designing protobuf schemas for payment data is where most teams trip up. Financial amounts need precision, and protobuf doesn't have a native decimal type. Here's what we settled on after some painful lessons:
syntax = "proto3";
package payment.v1;
import "google/protobuf/timestamp.proto";
// Don't use float/double for money. Ever.
// We represent amounts as minor units (cents) with currency.
message Money {
int64 minor_units = 1; // e.g., 2000 = $20.00
string currency_code = 2; // ISO 4217: "USD", "EUR", "SGD"
}
message AuthorizationRequest {
string idempotency_key = 1;
string merchant_id = 2;
Money amount = 3;
string payment_token = 4;
string mcc = 5; // Merchant Category Code
google.protobuf.Timestamp request_time = 6;
map<string, string> metadata = 7;
}
message AuthorizationResponse {
string transaction_id = 1;
AuthStatus status = 2;
string decline_reason = 3;
Money authorized_amount = 4; // May differ from requested (partial auth)
string authorization_code = 5;
}
enum AuthStatus {
AUTH_STATUS_UNSPECIFIED = 0;
AUTH_STATUS_APPROVED = 1;
AUTH_STATUS_DECLINED = 2;
AUTH_STATUS_PARTIAL = 3;
AUTH_STATUS_ERROR = 4;
}
service PaymentAuthorizationService {
rpc Authorize(AuthorizationRequest) returns (AuthorizationResponse);
rpc Void(VoidRequest) returns (VoidResponse);
}
A few things to notice. We use int64 for money, not float or double. Floating point arithmetic and financial calculations don't mix — you'll get rounding errors that turn into real accounting discrepancies. Minor units (cents, pence, sen) keep everything as integers.
The idempotency_key is a first-class field, not an afterthought stuffed into metadata. In payment systems, idempotency isn't optional — it's load-bearing infrastructure.
Architecture — How Our Services Talk
Here's a simplified view of how gRPC connects our payment services. The API gateway translates external REST calls into internal gRPC, and everything downstream speaks protobuf.
REST → gRPC
Service
Service
Service
Service
Service
Streaming for Real-Time Settlement Updates
This is where gRPC really pulled ahead for us. Settlement isn't a request-response operation — it's a process that unfolds over time. A batch of transactions gets submitted, individual items settle or fail, and downstream services need to know about each state change as it happens.
With REST, we were polling every 500ms. With gRPC server-side streaming, the settlement service pushes updates the moment they occur:
// settlement.proto
service SettlementService {
// Server streams settlement status updates as they happen
rpc WatchSettlement(WatchSettlementRequest)
returns (stream SettlementEvent);
}
message WatchSettlementRequest {
string batch_id = 1;
google.protobuf.Timestamp since = 2; // Resume from timestamp
}
message SettlementEvent {
string transaction_id = 1;
SettlementStatus status = 2;
Money settled_amount = 3;
string failure_reason = 4;
google.protobuf.Timestamp event_time = 5;
}
enum SettlementStatus {
SETTLEMENT_STATUS_UNSPECIFIED = 0;
SETTLEMENT_STATUS_PENDING = 1;
SETTLEMENT_STATUS_SETTLED = 2;
SETTLEMENT_STATUS_FAILED = 3;
SETTLEMENT_STATUS_REVERSED = 4;
}
And the Go implementation on the server side:
func (s *settlementServer) WatchSettlement(
req *pb.WatchSettlementRequest,
stream pb.SettlementService_WatchSettlementServer,
) error {
ctx := stream.Context()
events := s.eventStore.Subscribe(req.BatchId, req.Since.AsTime())
defer events.Close()
for {
select {
case <-ctx.Done():
return status.FromContextError(ctx.Err()).Err()
case evt, ok := <-events.Ch():
if !ok {
return nil // batch fully settled
}
if err := stream.Send(evt.ToProto()); err != nil {
return err
}
}
}
}
The since field in the request lets clients resume from where they left off after a disconnect. This is critical — in payment systems, you can't afford to miss a settlement event, and you can't afford to reprocess the entire batch either.
Error Handling with gRPC Status Codes
One of the underappreciated features of gRPC is its rich error model. Instead of overloading HTTP status codes (is a declined transaction a 400? a 422? a 200 with an error body?), gRPC gives you semantic status codes plus structured error details.
Here's how we map payment-specific errors:
import (
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
epb "google.golang.org/genproto/googleapis/rpc/errdetails"
)
func mapPaymentError(err error) error {
switch {
case errors.Is(err, ErrInsufficientFunds):
st := status.New(codes.FailedPrecondition, "insufficient funds")
st, _ = st.WithDetails(&epb.ErrorInfo{
Reason: "INSUFFICIENT_FUNDS",
Domain: "payment.example.com",
Metadata: map[string]string{
"decline_code": "51",
},
})
return st.Err()
case errors.Is(err, ErrCardExpired):
st := status.New(codes.FailedPrecondition, "card expired")
st, _ = st.WithDetails(&epb.ErrorInfo{
Reason: "CARD_EXPIRED",
Domain: "payment.example.com",
})
return st.Err()
case errors.Is(err, ErrDuplicateTransaction):
// AlreadyExists tells the client this is safe to treat as success
return status.Error(codes.AlreadyExists, "duplicate transaction")
case errors.Is(err, ErrFraudSuspected):
return status.Error(codes.PermissionDenied, "transaction blocked")
default:
// Don't leak internal errors to callers
return status.Error(codes.Internal, "payment processing failed")
}
}
The key insight: codes.AlreadyExists for duplicate transactions. When a client retries with the same idempotency key and the original transaction succeeded, returning AlreadyExists tells the caller "this already went through, you're fine." The client doesn't need to parse an error message — the status code carries the semantics.
Deadline Propagation — The Feature You Didn't Know You Needed
In a REST world, each service sets its own HTTP timeout independently. Service A calls B with a 5s timeout, B calls C with a 3s timeout, and nobody coordinates. If A's timeout fires while C is still processing, you get an orphaned transaction.
gRPC propagates deadlines automatically through the call chain. When the API gateway sets a 4-second deadline, every downstream service sees the remaining time:
// Gateway sets the deadline
ctx, cancel := context.WithTimeout(ctx, 4*time.Second)
defer cancel()
// Auth service receives ctx with ~3.9s remaining
// It calls fraud service — fraud sees ~3.5s remaining
// Fraud calls the ML scoring service — it sees ~3.1s remaining
// If time runs out anywhere in the chain, everything unwinds cleanly
resp, err := authClient.Authorize(ctx, req)
if status.Code(err) == codes.DeadlineExceeded {
// The entire chain timed out — no orphaned work
log.Warn("authorization deadline exceeded",
"remaining", time.Until(deadline))
}
This is huge for payment systems. A card authorization that takes longer than 4 seconds is useless — the customer is already staring at a spinner. Deadline propagation ensures that when you give up, every service in the chain gives up too. No zombie transactions, no wasted compute.
Watch out: Deadline propagation means a slow downstream service can cause cascading timeouts across your entire call graph. Pair it with circuit breakers. We use a 3-strike pattern — if a service hits deadline exceeded three times in a row, we trip the breaker and fail fast for 30 seconds.
Backward Compatibility — Don't Break Production
Protobuf's wire format is designed for backward compatibility, but you still need discipline. Here are the rules we enforce in CI:
- Never reuse field numbers. If you remove a field, mark it as
reserved. Old clients sending that field number with a different type will corrupt data silently. - Never change field types. An
int32toint64change looks safe but breaks wire compatibility. - Add new fields, don't modify existing ones. Old clients ignore unknown fields. New clients handle missing fields with defaults.
- Use
buf breakingin CI. It catches breaking changes before they hit main.
# buf.yaml — we run this on every PR
version: v1
breaking:
use:
- FILE
# Catches: field deletion, type changes,
# service removal, method signature changes
We version our proto packages (payment.v1, payment.v2) and run both versions in parallel during migrations. The old service keeps serving v1 while the new one handles v2. Once all clients have migrated, we sunset v1. It's more work than just changing a JSON field name, but you never get a silent data corruption bug at 3am.
What I'd Do Differently
If I were starting over, three things:
- Start with gRPC internally from day one. Migrating from REST to gRPC mid-flight is painful. You end up maintaining both protocols for months during the transition. If you know you'll have more than three internal services, just start with gRPC.
- Invest in observability early. gRPC's binary protocol means you can't just read the payloads in your HTTP access logs anymore. Set up proper distributed tracing with OpenTelemetry from the start. We bolted it on later and it was a mess.
- Use
bufinstead of rawprotoc. The protobuf compiler toolchain is notoriously painful. Buf handles linting, breaking change detection, and code generation with a single config file. We wasted weeks fighting protoc plugin versions before switching.
gRPC isn't the right tool for everything. But for internal communication between payment microservices — where latency matters, contracts must be enforced, and streaming is a real requirement — it's been a clear win for us. The migration took about three months, and we haven't looked back.
References
- gRPC Official Documentation
- Protocol Buffers — Proto3 Language Guide
- gRPC Error Handling Guide
- gRPC Deadlines and Cancellation
- Buf — Protobuf Tooling Documentation
- Go gRPC Package Reference
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.