The latency tax of an LLM gateway: I measured Bifrost's overhead

#mlops #llm #machinelearning #infrastructure

TL;DR: I was skeptical that putting a gateway in front of our LLM calls was worth the added hop. So I measured it. Bifrost's in-process overhead landed in the tens of microseconds at p50 on our box, and the real cost was the extra network hop, not the gateway code. Numbers and config below.

I run the fine-tuning and eval team at Nexus Labs. We're Series B, about 40 people, and our agent-automation product fans out a lot of parallel LLM calls during eval runs. Hundreds of concurrent requests against OpenAI, Anthropic, and a self-hosted vLLM endpoint.

For two years we called provider SDKs directly. Then the usual problems showed up. Key rotation across three OpenAI keys. Failover when Anthropic 529s during an eval batch. No single place to see token spend per experiment.

A gateway solves all of that. My objection was latency. Every abstraction layer costs something, and I don't add layers I can't account for.

What I actually measured

I tested Bifrost because it's written in Go, and I wanted to know whether "high-performance" meant anything or was a README adjective.

Setup: gateway and a mock provider on the same host first, to isolate the gateway's own processing cost from network. Then a realistic split with the gateway on a separate node. 200 concurrent connections, 50k requests, small chat payloads.

The in-process number was the one I cared about. The gateway's added processing sat in the tens of microseconds at p50. At p99 under load it crept up but stayed well under a millisecond. That's noise next to a 600ms LLM round trip.

The honest cost is the network hop. Put the gateway on a different node and you pay whatever your intra-VPC latency is. For us that was around 1ms. Predictable. Accountable. I can defend it in a design review, which is my only real requirement.

# rough reproduction with a mock upstream
docker run -p 8080:8080 maximhq/bifrost

# fire 50k requests, 200 concurrent
hey -n 50000 -c 200 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}' \
  http://localhost:8080/v1/chat/completions

Why Go matters here

Our previous candidate was LiteLLM. It's the most popular option and the provider coverage is excellent. But the proxy is Python, and under our concurrency the per-request overhead and tail latency were higher than I wanted for an eval fan-out. That's not a knock on the project. It's a runtime characteristic. For a low-volume app you'd never notice.

Bifrost runs as a single Go binary or a Docker image, and the OpenAI-compatible API meant our existing client changed by one base URL. No rewrite.

{
  "providers": {
    "openai": { "keys": [{"value": "env.OPENAI_KEY_1"}, {"value": "env.OPENAI_KEY_2"}] },
    "anthropic": { "keys": [{"value": "env.ANTHROPIC_KEY"}] }
  }
}

Load balancing across those keys and automatic fallback to Anthropic when OpenAI throws is config, not code. That removed about 200 lines of retry wrapper we'd accumulated.

How the three compare

Dimension	Bifrost	LiteLLM	Portkey
Runtime	Go binary	Python proxy	Managed / hosted
Self-host	Yes, single binary	Yes	Self-host available, hosted-first
Per-request overhead (my test)	tens of Âµs p50	higher under heavy concurrency	network-bound, hosted
Provider coverage	23+	broadest	broad
Observability	native Prometheus	callbacks, integrations	strong managed dashboard
Best at	low-overhead self-host	maximum provider breadth	turnkey hosted analytics

Where the others win: LiteLLM has the widest provider list and a huge community, so obscure providers land there first. Portkey's hosted dashboard is more polished than anything you stand up yourself on day one, and if you don't want to run infra, that's a real advantage. I run infra. I wanted a binary and Prometheus.

Observability without a new stack

The thing that closed it for me was native Prometheus metrics. We already scrape Prometheus for our vLLM nodes. Bifrost exposes latency and token counts on the same surface, so per-experiment spend showed up in our existing Grafana boards without a new agent or a vendor SDK.

Virtual keys gave us per-experiment budgets too. One key per eval campaign. When a runaway retry loop burned through a budget last month, the key capped it instead of the bill.

Trade-offs and limitations

This is not free.

You're adding a hop and a process to babysit. If the gateway is a single instance, it's a single point of failure, so you run more than one and load-balance, which is more infra than calling an SDK.

Provider coverage is 23+, which covered every provider we use, but it's narrower than LiteLLM's long tail. Check your specific providers against the supported list before assuming.

The microsecond numbers are mine, on my hardware, with small payloads. Large multimodal requests and streaming behave differently, and you should run hey against your own workload before trusting any blog, including this one. The gateway can't fix a slow provider. It only stops being the reason you're slow.

Semantic caching can cut cost, but for eval determinism I keep it off. Cached responses would poison a regression run.