Why Polling Killed Our Dashboard
Our ops team had a simple ask: see payment statuses update in real time. What they got instead was a dashboard that polled /api/payments/recent every two seconds, hammering our database with identical queries and still feeling sluggish. During peak hours, the API would start returning stale data because our read replicas were lagging behind the write primary, and the polling interval meant operators were always looking at a world that was at least two seconds old.
Two seconds doesn't sound like much until a payment fails and your ops team doesn't see it for another 2-4 seconds. They'd click refresh out of habit. Multiply that by 30 operators across three time zones, and we had roughly 15 requests per second hitting the same endpoint for data that changed maybe twice a second. It was wasteful, slow, and the team didn't trust the dashboard.
We needed to flip the model: instead of the client asking "anything new?" every two seconds, the server should say "here's what just happened" the moment it happens.
The Architecture
The design is straightforward. The browser opens a persistent WebSocket connection to our Go server. The Go server subscribes to a Redis Pub/Sub channel where our payment service publishes state changes. When a payment transitions — authorized, captured, failed, refunded — the payment service publishes an event to Redis, and every connected WebSocket server instance fans it out to the relevant clients.
Redis Pub/Sub is the glue. It decouples the payment service from the WebSocket layer entirely. The payment service doesn't know or care how many dashboard instances are running — it just publishes events. Each WebSocket server subscribes independently and pushes updates to its own set of connected clients.
Polling vs SSE vs WebSocket
Before committing to WebSockets, we evaluated the three main approaches. Here's how they stacked up for our use case:
SSE would have worked for a one-way feed, but our dashboard also sends filter changes and subscription updates from the client — operators can scope their view to specific merchants or payment methods. WebSockets gave us that bidirectional channel without needing a separate REST endpoint for client-to-server messages.
The Go WebSocket Handler
We used gorilla/websocket for the server side. The handler upgrades the HTTP connection, sets up ping/pong for keepalive, and spins up a read pump and write pump per connection. Here's the core of it:
var upgrader = websocket.Upgrader{
ReadBufferSize: 1024,
WriteBufferSize: 1024,
CheckOrigin: func(r *http.Request) bool {
origin := r.Header.Get("Origin")
return origin == "https://dashboard.internal.co"
},
}
func (s *Server) HandleWS(w http.ResponseWriter, r *http.Request) {
conn, err := upgrader.Upgrade(w, r, nil)
if err != nil {
log.Printf("upgrade failed: %v", err)
return
}
client := &Client{
conn: conn,
send: make(chan []byte, 256),
}
s.hub.register <- client
// Configure keepalive
conn.SetReadDeadline(time.Now().Add(60 * time.Second))
conn.SetPongHandler(func(string) error {
conn.SetReadDeadline(time.Now().Add(60 * time.Second))
return nil
})
go client.writePump(s.hub)
go client.readPump(s.hub)
}
The CheckOrigin function is important — in production, you want to lock this down to your actual dashboard domain. The default gorilla behavior rejects cross-origin requests, which is the right starting point.
The ping/pong mechanism is our heartbeat. The server sends a ping every 30 seconds, and if the client doesn't respond with a pong within 60 seconds, we consider the connection dead and clean it up. This catches cases where a browser tab goes to sleep or a network path silently drops.
Redis Pub/Sub for Fan-Out
Each WebSocket server instance subscribes to a Redis channel called payment:events. When the payment service processes a webhook from Stripe or Adyen, it publishes a JSON event to that channel. Every server instance receives it and forwards it to connected clients that have subscribed to the relevant merchant.
Redis Pub/Sub is fire-and-forget — if a WebSocket server is down when an event is published, it misses it. For our dashboard, this was acceptable because we also load the last 50 events on initial connection. For systems where you can't afford to miss events, consider Redis Streams or Kafka instead.
The fan-out logic in the hub is simple: iterate over registered clients, check if the event's merchant ID matches the client's subscription filter, and if so, push the serialized event into the client's send channel. Non-blocking sends with a default case ensure one slow client doesn't block the entire hub.
Connection Lifecycle Management
Heartbeats
The server sends a WebSocket ping frame every 30 seconds. If the pong doesn't come back within the read deadline (60 seconds), the read pump exits, which triggers cleanup. This is critical in cloud environments where idle TCP connections get silently killed by load balancers — AWS ALB, for instance, has a default idle timeout of 60 seconds.
Client-Side Reconnection
The browser side needs to handle disconnections gracefully. We use exponential backoff with jitter to avoid a thundering herd when the server restarts:
class PaymentSocket {
constructor(url) {
this.url = url;
this.attempt = 0;
this.maxDelay = 30000;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('Connected to payment feed');
this.attempt = 0; // Reset on successful connection
this.subscribe({ merchants: ['mch_a1b2c3'] });
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
this.handlePaymentEvent(data);
};
this.ws.onclose = (event) => {
if (event.code !== 1000) {
this.reconnect();
}
};
}
reconnect() {
const base = Math.min(1000 * Math.pow(2, this.attempt), this.maxDelay);
const jitter = base * (0.5 + Math.random() * 0.5);
this.attempt++;
console.log(`Reconnecting in ${Math.round(jitter)}ms (attempt ${this.attempt})`);
setTimeout(() => this.connect(), jitter);
}
subscribe(filters) {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'subscribe', ...filters }));
}
}
handlePaymentEvent(data) {
// Update dashboard UI
}
}
The jitter is the key detail. Without it, if your server restarts and 500 clients all try to reconnect at exactly 1s, then 2s, then 4s, you get periodic spikes that can overwhelm the server before it stabilizes. Jitter spreads those reconnections out.
Scaling Considerations
WebSocket connections are stateful, which makes horizontal scaling trickier than stateless HTTP. A few things we learned:
- Sticky sessions are optional with Redis Pub/Sub. Because every server instance subscribes to the same Redis channel, it doesn't matter which instance a client connects to. Any instance can deliver any event. This was a deliberate design choice — it means we can scale WebSocket servers independently without worrying about session affinity.
- Connection limits matter. Each WebSocket connection holds a goroutine for reading and one for writing. At 12K connections, that's 24K goroutines per instance. Go handles this well — each goroutine is ~4KB of stack — but you need to tune your OS file descriptor limits (
ulimit -n) and keep an eye on memory. - Graceful shutdown is non-negotiable. When deploying a new version, we send a close frame with code 1001 (Going Away) to all connected clients before shutting down. The client-side reconnection logic kicks in and connects to a healthy instance. Without this, clients see an abrupt disconnect and the user experience suffers.
Lessons Learned
After running this in production for over a year, here's what I'd tell someone starting out:
- Start with SSE if you only need server-to-client. WebSockets add complexity. If your dashboard is purely a read-only feed, SSE is simpler and works through HTTP/2 proxies without special configuration.
- Always implement backpressure. If a client can't keep up with the event rate, you need to drop messages or disconnect them. A buffered channel with a non-blocking send is the simplest approach in Go — if the channel is full, close the connection.
- Monitor connection counts, not just request rates. Traditional HTTP metrics don't capture WebSocket health. We track active connections, message throughput, and reconnection rates as first-class metrics in Prometheus.
- Test with realistic connection counts. Our load tests spin up 15K concurrent WebSocket clients using
k6with the WebSocket extension. The behavior at 100 connections is nothing like the behavior at 10K. - Redis Pub/Sub has no persistence. We learned this the hard way during a Redis failover. Events published during the ~3 second failover window were lost. We added a catch-up mechanism: on reconnect, the client sends its last-seen event ID, and the server backfills from a short-lived Redis Stream.
The biggest win wasn't technical — it was trust. Once the ops team saw payments update instantly on their screens, they stopped refreshing and started relying on the dashboard as their primary tool. That behavioral shift was worth more than any latency number.
References
- gorilla/websocket — Go WebSocket package documentation
- MDN Web Docs — WebSocket API
- Redis Pub/Sub documentation
- RFC 6455 — The WebSocket Protocol
- k6 WebSocket testing documentation
Disclaimer: This article reflects the author's personal experience and opinions. Product names, logos, and brands are property of their respective owners. Pricing and features mentioned are subject to change — always verify with official documentation.