Paul Twist

Posted on Jun 18

We Obsessed Over Gateway Latency for a Month. Then We Looked at the Actual Numbers.

#ai #llm #infrastructure #discuss

I spent a month benchmarking LLM gateway overhead. Measured proxy latency down to the microsecond. Ran load tests at 500, 1000, 5000 RPS. Built dashboards to track P99 gateway overhead.

Then my teammate asked: "What percentage of total request time is the gateway?"

I ran the query. The answer was 0.3%.

The Math Nobody Talks About

Here's what LLM API calls actually cost in latency right now (mid-2026):

Model	Time to First Token	Total Response Time
GPT-4o	~850ms	2-8s
Claude Sonnet 4	~900ms	3-15s
Claude Fable 5	~147s	~155s
GPT-4.1	~1,100ms	3-12s
Gemini 2.5 Flash	~500ms	1-5s

Source: Artificial Analysis, ailatency.com, personal measurements

Now here's what gateways add:

Gateway	Overhead
Direct API call (no gateway)	0ms
Python-based proxy	8-40ms
Go/Rust-based proxy	1-11ms

So the "debate" is about whether you add 8ms or 1ms to a call that takes 3,000-155,000ms.

That's like arguing about whether to use a faster USB cable to transfer a file that's downloading from a satellite.

"But What About P99 at 5000 RPS?"

I've seen the benchmarks floating around. You know the ones — "50x faster P99 latency!" tested on a t3.medium (2 vCPU, 4GB RAM) at 500 RPS.

Let's think about what that test actually measures:

You take a Python process and a Go process
You hammer both with 500 concurrent requests on a tiny machine
The Python process runs out of resources first (shocking, right?)
You declare victory

But in production:

Nobody runs a single-instance proxy at 500 RPS on a 4GB machine. You scale horizontally. That's what Kubernetes is for.
At 4 instances, P99 drops from 630ms to 150ms at 1000+ RPS. Add more instances, it keeps dropping. (LiteLLM benchmark data)
The actual LLM call takes 50-1000x longer than any gateway overhead. Your P99 is dominated by the model, not the proxy.

The "50x faster" claim is technically true in the same way that a Ferrari is faster than a bicycle — but if both are stuck behind the same traffic jam (the LLM API call), you arrive at the same time.

What Actually Affects Your LLM Latency

After a month of measuring, here's what actually moved the needle for us:

1. Model selection (10-100x impact)

Switching from GPT-4o to Gemini 2.5 Flash for non-critical calls cut our average latency by 60%. No gateway change needed.

from litellm import completion

# Before: ~850ms TTFT
response = completion(model="gpt-4o", messages=messages)

# After: ~500ms TTFT
response = completion(model="gemini/gemini-2.5-flash", messages=messages)

2. Intelligent routing (2-5x impact)

Routing based on latency, not just round-robin, cut our P99 by 40%:

# litellm config.yaml
model_list:
  - model_name: fast-chat
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_KEY_1
  - model_name: fast-chat
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_KEY_2

router_settings:
  routing_strategy: latency-based-routing

3. Caching (∞ impact on cache hits)

Caching cut redundant calls by ~30% in our agent workflows:

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    caching=True
)

4. Prompt optimization (2-10x impact)

Shorter prompts = faster responses. We trimmed our system prompts from 2000 tokens to 800 and saw 35% faster responses. Zero infrastructure changes.

5. Provider failover (reliability, not speed)

When OpenAI's having a bad day (it happens), automatic failover to Anthropic or Google means your users don't notice:

model_list:
  - model_name: production-chat
    litellm_params:
      model: gpt-4o
  - model_name: production-chat
    litellm_params:
      model: claude-sonnet-4-20250514
  - model_name: production-chat
    litellm_params:
      model: gemini/gemini-2.5-pro

None of these optimizations require you to care about whether your gateway adds 1ms or 40ms.

The Real Gateway Decision

If you're choosing an LLM gateway, here's what actually matters:

Provider coverage — Can it talk to all the models you need? (Some gateways support 15 providers. Others support 100+. This matters when you want to try a new model next month.)
Routing and failover — Does it handle provider outages gracefully?
Cost tracking — Can you see which team/feature/user is burning tokens?
Ecosystem and community — When something breaks at 2am, are there people to help? Check GitHub stars, contributor count, and how fast issues get resolved.
Extensibility — Can you add custom logic without forking the codebase?

Gateway overhead in microseconds? That's item #47 on the list.

The Uncomfortable Truth

The "gateway latency" narrative exists because it's easy to measure and easy to market. "50x faster!" is a great headline.

But if you're building production AI systems, you already know: the hard problems aren't microsecond overhead. They're cost management, provider reliability, model routing, observability, and not burning $10k on a runaway agent at 3am.

I'd rather have a gateway that adds 40ms but tells me exactly which agent call costs $0.47 and why, than one that adds 1ms but leaves me blind.

What's your biggest LLM infrastructure pain point? Gateway latency probably isn't it. Drop a comment — I'm curious what you're actually struggling with.

Top comments (1)

Alex Shev • Jun 18

The percentage-of-total-time question is the sanity check every latency project needs. Shaving 80ms off a gateway does not matter much if the user is waiting 9 seconds on model generation or downstream IO.

I like measuring latency by user-visible phase: request accepted, first token or first progress event, final result, and post-processing. That usually shows which optimization is real and which one just feels satisfying.