A load-balanced system with more servers gives you better latency not just because you have more capacity, but because the queuing math has a shape that surprises most people.
I ran into this recently when someone asked me to walk through an M/M/c queue model. The setup: c servers, each processing one request at a time, with a load balancer holding an infinite queue. If you're offering c * 0.8 requests per second (80% utilization per server), what happens to mean request latency as you increase c?
Most people expect latency to improve slowly, or maybe linearly. The actual result is that latency drops faster than your intuition suggests, especially in the tail.
Here's the specific finding that stuck with me. At 5 servers and half-saturation (offering load equal to half what the servers can handle), about 13% of requests end up in queue. Double the servers to 10, keep the per-server load constant at 80%, and that queuing probability drops to 3.6%. Double again to 20 servers, and you're effectively at zero queuing at half load.
That's not a linear relationship. The queuing probability is a function of the Erlang C formula, and its shape is convex in a way that favors scale.
The practical takeaway isn't "add more servers." It's that load balancer latency is often not the right thing to optimize when you have headroom. If your services are sitting at 60-70% utilization, adding capacity will reduce queuing more than you'd predict from simple capacity math.
The flip side: if you're already running lean (85%+ utilization), adding a couple servers won't move the needle much. You're in the steep part of the curve where queuing probability climbs quickly.
This is a reminder that queue theory has real teeth in distributed systems design. The counterintuitive behavior is worth knowing when you're sizing clusters or debugging latency spikes that seem out of proportion with load.
Source: Marcus Brooks (brooker.co.za) on M/M/c queuing systems and Erlang's C formula.
Top comments (0)