I spent a month benchmarking LLM gateway overhead. Measured proxy latency down to the microsecond. Ran load tests at 500, 1000, 5000 RPS. Built dashboards to track P99 gateway overhead.
Then my teammate asked: "What percentage of total request time is the gateway?"
I ran the query. The answer was 0.3%.
The Math Nobody Talks About
Here's what LLM API calls actually cost in latency right now (mid-2026):
| Model | Time to First Token | Total Response Time |
|---|---|---|
| GPT-4o | ~850ms | 2-8s |
| Claude Sonnet 4 | ~900ms | 3-15s |
| Claude Fable 5 | ~147s | ~155s |
| GPT-4.1 | ~1,100ms | 3-12s |
| Gemini 2.5 Flash | ~500ms | 1-5s |
Source: Artificial Analysis, ailatency.com, personal measurements
Now here's what gateways add:
| Gateway | Overhead |
|---|---|
| Direct API call (no gateway) | 0ms |
| Python-based proxy | 8-40ms |
| Go/Rust-based proxy | 1-11ms |
So the "debate" is about whether you add 8ms or 1ms to a call that takes 3,000-155,000ms.
That's like arguing about whether to use a faster USB cable to transfer a file that's downloading from a satellite.
"But What About P99 at 5000 RPS?"
I've seen the benchmarks floating around. You know the ones — "50x faster P99 latency!" tested on a t3.medium (2 vCPU, 4GB RAM) at 500 RPS.
Let's think about what that test actually measures:
- You take a Python process and a Go process
- You hammer both with 500 concurrent requests on a tiny machine
- The Python process runs out of resources first (shocking, right?)
- You declare victory
But in production:
- Nobody runs a single-instance proxy at 500 RPS on a 4GB machine. You scale horizontally. That's what Kubernetes is for.
- At 4 instances, P99 drops from 630ms to 150ms at 1000+ RPS. Add more instances, it keeps dropping. (LiteLLM benchmark data)
- The actual LLM call takes 50-1000x longer than any gateway overhead. Your P99 is dominated by the model, not the proxy.
The "50x faster" claim is technically true in the same way that a Ferrari is faster than a bicycle — but if both are stuck behind the same traffic jam (the LLM API call), you arrive at the same time.
What Actually Affects Your LLM Latency
After a month of measuring, here's what actually moved the needle for us:
1. Model selection (10-100x impact)
Switching from GPT-4o to Gemini 2.5 Flash for non-critical calls cut our average latency by 60%. No gateway change needed.
from litellm import completion
# Before: ~850ms TTFT
response = completion(model="gpt-4o", messages=messages)
# After: ~500ms TTFT
response = completion(model="gemini/gemini-2.5-flash", messages=messages)
2. Intelligent routing (2-5x impact)
Routing based on latency, not just round-robin, cut our P99 by 40%:
# litellm config.yaml
model_list:
- model_name: fast-chat
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_KEY_1
- model_name: fast-chat
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_KEY_2
router_settings:
routing_strategy: latency-based-routing
3. Caching (∞ impact on cache hits)
Caching cut redundant calls by ~30% in our agent workflows:
from litellm import completion
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}],
caching=True
)
4. Prompt optimization (2-10x impact)
Shorter prompts = faster responses. We trimmed our system prompts from 2000 tokens to 800 and saw 35% faster responses. Zero infrastructure changes.
5. Provider failover (reliability, not speed)
When OpenAI's having a bad day (it happens), automatic failover to Anthropic or Google means your users don't notice:
model_list:
- model_name: production-chat
litellm_params:
model: gpt-4o
- model_name: production-chat
litellm_params:
model: claude-sonnet-4-20250514
- model_name: production-chat
litellm_params:
model: gemini/gemini-2.5-pro
None of these optimizations require you to care about whether your gateway adds 1ms or 40ms.
The Real Gateway Decision
If you're choosing an LLM gateway, here's what actually matters:
- Provider coverage — Can it talk to all the models you need? (Some gateways support 15 providers. Others support 100+. This matters when you want to try a new model next month.)
- Routing and failover — Does it handle provider outages gracefully?
- Cost tracking — Can you see which team/feature/user is burning tokens?
- Ecosystem and community — When something breaks at 2am, are there people to help? Check GitHub stars, contributor count, and how fast issues get resolved.
- Extensibility — Can you add custom logic without forking the codebase?
Gateway overhead in microseconds? That's item #47 on the list.
The Uncomfortable Truth
The "gateway latency" narrative exists because it's easy to measure and easy to market. "50x faster!" is a great headline.
But if you're building production AI systems, you already know: the hard problems aren't microsecond overhead. They're cost management, provider reliability, model routing, observability, and not burning $10k on a runaway agent at 3am.
I'd rather have a gateway that adds 40ms but tells me exactly which agent call costs $0.47 and why, than one that adds 1ms but leaves me blind.
What's your biggest LLM infrastructure pain point? Gateway latency probably isn't it. Drop a comment — I'm curious what you're actually struggling with.
Top comments (1)
The percentage-of-total-time question is the sanity check every latency project needs. Shaving 80ms off a gateway does not matter much if the user is waiting 9 seconds on model generation or downstream IO.
I like measuring latency by user-visible phase: request accepted, first token or first progress event, final result, and post-processing. That usually shows which optimization is real and which one just feels satisfying.