Keeping a client's VLM inference inside the EU with a self-hosted-first gateway

#machinelearning #computervision #llm #infrastructure

TL;DR: A German automotive client needed scene descriptions of our event-camera footage, but the raw data could not leave their premises. We put Bifrost in front of an on-prem Ollama box running Qwen2.5-VL 7B, with a cloud provider as fallback for synthetic data only. The routing rule, not the model, was the hard part.

So, the thing is, most VLM tutorials assume your data wants to go to the cloud. Ours legally could not.

I work on a team of five CV engineers at Prophesee, building algorithms for event cameras. Last quarter a Tier-1 automotive supplier in Stuttgart asked us for something specific: natural-language descriptions of reconstructed event-stream frames, to seed their annotation tooling. Reasonable ask. The catch was their data governance. Recorded factory floor footage, even reconstructed, counted as sensitive under their GDPR interpretation, and it was not allowed to touch a US provider's API.

We already had a working VLM caption step. It called GPT-4o directly through the OpenAI SDK, hardcoded, the way these things always start. That had to change.

The constraint, stated precisely

Two classes of input. Real footage from the client's site, which must stay on their on-prem hardware. And synthetic footage we generate ourselves, which has no residency restriction and is fine to send anywhere.

The naive fix is two code paths. One client for local, one for cloud, an if somewhere deciding which. We had exactly that for about a week and it was already rotting. Different retry logic, different error shapes, two sets of credentials to rotate.

I wanted one endpoint and a routing decision made by config, not by branches scattered through the pipeline.

What we put in place

Bifrost (https://github.com/maximhq/bifrost) is an open-source AI gateway written in Go that speaks an OpenAI-compatible API across 23+ providers, including Ollama as a first-class provider. That last part mattered. We self-host the gateway on the client's box next to Ollama, so for real footage nothing leaves the rack.

The setup is a Docker container and a config file. No SaaS account, no phoning home.

# bifrost config (illustrative)
providers:
  ollama:
    base_url: "http://localhost:11434"   # on-prem, Qwen2.5-VL 7B
  anthropic:
    keys:
      - env: "ANTHROPIC_API_KEY"

# real footage path: on-prem only, no fallback off-box
# synthetic path: local first, cloud allowed as fallback
fallbacks:
  synthetic-captions:
    - provider: ollama
      model: "qwen2.5-vl:7b"
    - provider: anthropic
      model: "claude-sonnet-4-6"

The pipeline tags each request with a virtual key. Real-footage jobs use a key whose route has no cloud entry, so a local failure surfaces as an error instead of silently shipping data to Stuttgart's least favourite jurisdiction. Synthetic jobs use a key that allows the automatic fallback to Anthropic when the local box is saturated.

Our application code makes one call to one OpenAI-compatible URL. The residency policy lives in config and is auditable in one place. That was the whole point.

Numbers from the first month

The on-prem box is a single RTX 4090. Qwen2.5-VL 7B gives us roughly 18 captions per second at our batch size, plenty for the client's annotation throughput. About 92% of total volume is real footage, so it never leaves the rack. The remaining synthetic load is where fallback actually fires, maybe a dozen times a day during big generation runs.

Native Prometheus metrics meant I could prove the residency split to the client's security team with a dashboard rather than a promise. They liked that more than any architecture diagram.

How it compares

I evaluated LiteLLM and Portkey before committing. Honest read below.

Concern	Bifrost	LiteLLM	Portkey
Ollama as provider	Yes	Yes	Yes
Self-host as the default story	Yes, Go binary / Docker	Yes, Python proxy	Possible, but SaaS is the centre of gravity
Per-key routing with no cloud escape	Virtual keys + per-route fallbacks	Config-based, workable	Config-based, strong
Built-in Prometheus	Yes	Yes	Mostly via their dashboard
Provider breadth	23+	Largest list I've seen	Wide

LiteLLM honestly has the broadest provider coverage and a bigger community, and if you live in Python it slots in with less friction. Portkey's hosted observability and guardrails are more polished out of the box than anything I self-hosted. For us the deciding factor was narrow: a single Go binary I could drop on the client's hardware with no external dependency, plus per-key routing rules I could read at a glance during an audit. Different constraint, different winner.

Trade-offs and limitations

You are now running infrastructure. The gateway is one more process the client's ops team has to keep alive, and when Ollama OOMs at 2am, that's your pager, not a provider's.

Self-hosting also means you own the upgrade treadmill for both Bifrost and Ollama. We pin versions and test before bumping.

Qwen2.5-VL 7B is not GPT-4o. For our caption task the gap is small enough, but on dense or ambiguous scenes the local model misses detail a frontier model would catch. We accept that as the cost of residency.

And fallback can lie to you if you misconfigure it. A real-footage key with an accidental cloud entry would be a compliance incident, not a warning. We test that path in CI now, asserting the route has exactly one provider.

Like a good espresso, the value is in the constraint, not the volume.