yongrean

Posted on Jun 11

Treat upstream catalogs as mutable: how a free-tier model SKU retirement broke my AI agent

#ai #llm #webdev #infrastructure

Tuesday afternoon, every autonomous cycle in my agent started returning the same error:

[AGENT] Cycle failed: 404 No endpoints found for model: google/gemma-2-9b-it:free

The model hadn't changed in my config. The provider hadn't gone down. The endpoint just... wasn't there anymore. OpenRouter had retired the :free SKU mid-week — no notification, no deprecation window, just gone. Every background classification, every briefing generation, every proactive scan started failing in the same way.

I had a fallback. That was the embarrassing part.

The fallback that didn't fall back

My createCompletion() wrapper had been catching the documented provider failure modes for months:

402 insufficient_credits → walk to next provider
403 daily_quota_exceeded → walk to next provider
429 rate_limited → backoff + retry

What it didn't catch: "the model you asked for doesn't exist anymore." A 404 No endpoints found propagated as a generic error and killed the cycle. The fallback chain never even got consulted because nothing in the existing branches matched.

The mental model was wrong. I'd been treating the model catalog as fixed configuration — something you set once and forget. In reality it's upstream state that can mutate at any moment, just like any other dependency. The retirement was a feature of the provider's catalog management, not a bug.

The fix: walk the free-model chain on retirement signals

The actual patch was short. Two PRs:


ts
// Before: only walked on credit/quota/rate failures
if (isCreditError(err) || isKeyLimitError(err)) {
  return walkFallbackChain(...);
}

// After: also walk when the model itself is gone
if (isModelUnavailableError(err)) {
  markModelUnavailable(model);
  return walkFallbackChain(...);
}
isModelUnavailableError matches on:

HTTP 404 with No endpoints found in body
HTTP 400 with model_not_found code
Anything else the provider emits when the SKU is gone
markModelUnavailable puts the model on a 24h cooldown so the next cycle doesn't try it again immediately. When the catalog refreshes (providers add new SKUs all the time too), the cooldown expires and we retry.

The fallback chain itself is per-provider:


const OPENROUTER_FALLBACK_CHAIN = [
  'meta-llama/llama-3.3-70b-instruct:free',
  'google/gemma-2-9b-it:free',
  'mistralai/mistral-7b-instruct:free',
  'qwen/qwen-2.5-7b-instruct:free',
];
When one entry 404s, we walk to the next. When all of them fail, we fail over to the secondary provider (Gemini direct), which has its own chain. Only when every chain across every provider has been exhausted does the agent give up and surface AllProvidersExhaustedError to the user.

What I should have done from day 1
Three rules I'm internalizing:

1. The upstream catalog is mutable. Hardcoding a single model ID is the same antipattern as hardcoding a single CDN URL. Always have a list. Always make the list cheap to rotate.

2. Distinguish "this model is unavailable" from "the provider is unavailable." They're different failures with different recovery paths. Treating them the same way means you either over-rotate (give up the provider when only one model is gone) or under-rotate (give up entirely when the provider is fine).

3. Cooldowns, not blacklists. When a model disappears, don't kill it forever. Put it on a window. Providers add models back, or you might be hitting a transient 404. A 24h cooldown is much friendlier than a permanent deny-list that requires a code change to undo.

Why this matters beyond one provider
If you're running an agent in production, your model isn't your only upstream dependency:

Vendor's catalog can change
Pricing can change (:free → :paid is a real failure mode)
Rate-limit policies can change
Authentication schemes can change (Google's AQ.-prefix keys rejected by their own OpenAI-compat endpoint is a fun one — I had to write a native adapter for it)
The pattern is the same: treat every assumption about the upstream as a potential dynamic value, and make the recovery path the default, not the exception.

Agents that survive in prod have failover chains, cooldown windows, and degraded modes built in from the start. Not because the upstream is unreliable — because the upstream is alive, and alive things change.

I've been writing about Klorn, an open-source attention firewall for Gmail, where this kind of failure mode hits constantly because the agent runs continuously. Repo: github.com/k08200/klorn · Doctrine: deterministic-floor.md.

If you've shipped agents to prod, what other upstream-mutation failure modes have caught you off-guard?

Top comments (32)

FastAnchor_io • Jun 11

Great write-up. The model-unavailable vs provider-unavailable distinction is spot-on — most implementations conflate them. One thing I'd add: model IDs don't just disappear, they also get silently renamed or migrated. Having a model alias/mapping layer between your agent and the upstream can catch both retirement AND rename events before they hit your fallback chain.

yongrean • Jun 12

Good point — and the rename case is sneakier than retirement, because from the caller's side both surface as the same 404. A fallback chain handles "something is gone" but can't tell you what happened.

The alias layer is the right move. One thing I'm considering on top of it: OpenRouter exposes a /api/v1/models catalog endpoint, so instead of discovering retirement reactively on the first failed call, a periodic catalog diff could flag "model X disappeared / model Y appeared with a suspiciously similar name" before any agent cycle hits it. Proactive instead of reactive.

For now Klorn's chain treats every entry as a canonical internal name and resolves per-provider (the Gemini-direct path already strips the google/ prefix and :free suffix), which is a half-step toward a real mapping layer. The full version — internal name → provider-specific current ID, refreshed from catalogs — is on the list now. Thanks for pushing on this.

FastAnchor_io • Jun 13

the catalog diff idea is solid — it turns a runtime surprise into a deploy-time check. one thing to watch: providers don't always remove entries immediately on retirement, so the diff window might be narrower than expected. pairing it with a TTL-based staleness check on model metadata gives you both the 'disappeared' and the 'hasn't been updated in N days' signals.

yongrean • Jun 14

Right on the narrow window, and the TTL pairing is the correct instinct. One sharpening: TTL on model metadata catches "our cached view went stale," not "the model changed under a stable id." A provider can leave the SKU listed, metadata untouched, and swap the behavior underneath — fresh timestamp, different model. That failure needs a behavioral fingerprint (a canary eval), not a staleness clock.

And since providers don't delist atomically on retirement, watch the recovered transition too, not just disappeared — a model can drop out and come back, and you want both edges logged. The other thing that bit us: make sure the diff actually covers your highest-consequence model. The one whose silent absence degrades everything downstream is exactly the one that tends to be missing from the watched set.

FastAnchor_io • Jun 14

Sharp point. The behavioral canary approach is exactly right — I would layer three checks: TTL for metadata drift, canary for behavior swap, and a deploy gate that blocks if either fires. Turns model changes from runtime surprises into blocked deployments. Great discussion.

yongrean • Jun 14

Layering all three is right, with one caveat on the deploy gate: it only catches the changes that ride your deploy. A provider swapping behavior under a stable SKU doesn't — that lands on their clock, between your deploys, with nothing of yours to block. So the canary can't be deploy-triggered; it has to run on a schedule, and "block the build" becomes "pin to a known-good model / fail over to a fallback," because there's no build to stop. The gate still earns its place for the changes you do ship — a prompt edit, a model-id bump — it just can't be the only edge.

The part I don't have a clean answer for is cadence, and what "it fired" should resolve to. The canary is itself paid eval calls, so it's a straight trade: run it tight and you pay continuously; run it weekly and the gap between runs is exactly when a silent swap lands and degrades everything downstream before the next check. And since there's no deploy to block, a fire has to become a runtime move — pin or fall over — not a red build. We run ours scheduled for now; the right interval, and the right automatic response, both still feel open.

FastAnchor_io • Jun 14

Exactly. Your canary-on-a-schedule vs deploy-gate distinction is crucial. The practical architecture: scheduled canary as primary sniffer, deploy gate as guardrail that refuses to START if canary flagged anything, TTL as final "we haven't heard from the canary" circuit-breaker. Three independent time domains, three failure modes. Rare to find this level of rigor on Dev.to.

yongrean • Jun 14

Good thread — you sharpened the TTL role more than I had going in. We open-sourced our whole take on this if you ever want to see it run — the catalog diff and the weekly canary are both in there. Happy to trade notes.

FastAnchor_io • Jun 14

That escalated fast — open source is the right call. Drop the repo link here (or DM me), I'd love to see how you wired the catalog diff into the canary scheduler. Meanwhile I work on the API gateway side of this problem — aipossword.cn — so the intersection of model discovery and routing reliability is exactly my thing. Great thread.

yongrean • Jun 14

Here you go — github.com/k08200/klorn (AGPLv3). The catalog diff is packages/api/src/openrouter-catalog-check.ts (runs daily, emits disappeared/recovered transitions); the behavioral canary is a weekly GitHub Actions job (.github/workflows/judge-canary.yml) that re-judges against a held-out label set; both feed openrouter-fallback-chain.ts. Your gateway-side angle is the complementary half — you're routing across providers in the first place, we just treat whatever's listed as mutable and route around it. Curious how you handle the verify step at the gateway layer — ping me anytime.

FastAnchor_io • Jun 14

This is clean — the catalog diff + behavioral canary split is exactly how we think about it on the gateway side too. Daily structural scan, weekly behavioral check.

On the gateway verify step: we do a two-tier approach. Health layer polls /v1/models periodically and diffs against last-known-good — catches provider-level availability. For behavioral drift, a lightweight eval suite runs against a small held-out prompt set. If output quality drops below threshold or latency spikes 3x+, the model gets auto-routed to a fallback — no human in the loop, no deploy needed.

The cost question you raised earlier is real. Running evals hourly burns credits fast; weekly leaves a dangerous gap. We settled on daily with a generous threshold — better a few false positives than one silent degradation. On a gateway with thin margins, every eval call has to earn its keep.

The complementary split is clean — your agent-side handles "what changed in the catalog," our gateway-side handles "where to route now that it changed." Two halves of the same problem.

Would love to see how you're thinking about making the catalog diff a standalone tool. Feels like shared infra across gateways. I'm at aipossword.cn (also AGPLv3, github.com/QuantumNous/new-api) — same mission, different entry point.

yongrean • Jun 14

The daily-structural / weekly-behavioral split converging on both sides is reassuring — independent convergence usually means it's the right cut. "A few false positives beat one silent degradation" is exactly where I landed; klorn pairs the canary with a pre-flight catalog lease check before dispatch and a warning on published sunset dates before a model gets delisted, so the failure is loud before a request goes out, not after quality already dropped.

On spinning the catalog diff out as standalone infra: tempting, but I'm keeping it coupled to klorn's core. The diff isn't the product — it's input to the router. klorn's a top-level router that tiers every action through a firewall, and catalog drift is one signal feeding that decision. Pulled out it's just another /v1/models poller; in-context it tells the agent how to react, not merely that something changed. Different problem than a shared gateway primitive.

It's AGPLv3 and going up as a Show HN shortly — repo's github.com/k08200/klorn if you want to skim the canary + lease-check code before then. Would genuinely value a gateway-side take in the thread when it lands; you're hitting the same drift from the routing end. Star it if it resonates — helps it clear the front-page noise floor.

FastAnchor_io • Jun 14

The pre-flight catalog lease check is a smart layer I hadn't considered — checking before dispatch rather than just on schedule. On the gateway side we landed on something similar but inverted: if a model's health status is "suspect" (flagged by the last daily eval), we warm the fallback before the request fires, so if the primary does fail there's no cold-start penalty. Same "loud before dispatch" principle, different system boundary.

The coupled approach makes total sense for klorn. A catalog diff pulled out as standalone infra answers "what changed" — but paired with a router it answers "what should I do about it." That second question is the one that actually keeps agents running. Different problem statement, different architecture.

Looking forward to the Show HN — already starred. I'll drop a technical comment from the gateway perspective when it lands. The HN crowd will appreciate the rigor on failure taxonomy.

One question on the pre-flight lease check ordering: if the lease says "valid" but the behavioral canary fired two hours ago flagging a silent swap, does the lease win or does the behavioral flag take precedence? On our side we treat behavioral alerts as higher-priority than structural health — curious if you landed on the same ordering.

yongrean • Jun 14

Straight answer: they don't interact today — and that's a gap, not a design choice I'd defend. The pre-flight lease check reads only catalog presence (is the id still listed); the behavioral canary is a weekly CI job whose only output is an admin email. No shared store, so a "valid" lease can't be overridden by a behavioral flag — the behavioral verdict isn't a runtime input at all. Structural wins by default, which by your reasoning (and mine) is the wrong default. Your ordering — behavioral > structural as live signals feeding the router — is where it should land; the missing piece is the write-back (canary verdict → a runtime flag the pre-flight consults), and that only earns its keep once the non-judge behavioral signals actually exist. Filed it as out-of-scope/next on the tracking issue.

FastAnchor_io • Jun 14

That honesty about the gap is exactly why this back-and-forth is useful — most people would hand-wave the interaction or retroactively claim it was designed that way. "Structural wins by default" is the right diagnosis: the architecture doesn't encode a priority, so the runtime just takes the first signal that arrives.

The write-back is the right abstraction — and it doesn't need to be heavy. On our side we use a tiny Redis hash per model: model:health{gpt-4o} where the behavioral canary writes status=degraded and the pre-flight reads it before dispatch. Not a shared store in the heavyweight sense — just one extra key lookup that unblocks the priority inversion. The behavioral verdict becomes a first-class input without changing the pre-flight's interface at all.

On the unfinished thought — I'm guessing "once the non-deterministic failures start" or "once the blast radius expands"? Either way, the activation threshold is lower than most people think. One silent swap on a routing model, and the write-back flips from "nice optimization" to "why didn't we already have this."

Curious how you're thinking about the activation criteria — is it purely model-type gated (JUDGE/VISION get it first, gen models later) or do you have a volume-based trigger in mind too? Given how cheap a Redis read is, I'd argue the bar is essentially zero.

yongrean • Jun 14

Yeah, the health key write-back is the part I was missing. Right now my canary finds the drift and then just emails me, and nothing on the dispatch side ever reads that. So the model keeps getting hit even after the check already knows it's bad. Same shape as a bug I just fixed on the email side actually, the classifier was right and nothing was listening. I opened an issue to wire the verdict into the pre-flight the way you described. In my case it's not even Redis, I already keep the catalog snapshot in memory so it's one more map.

On the activation question, for me it's not model-type or volume. It's how much it costs when the model is wrong. The judge gets it first because a bad judge misfiles every email and I miss the one that mattered. A drifted generation model is one bad output. So I order it by blast radius, not traffic.

One place I'm holding back though. I'm keeping it detection-only for now, not auto-routing on the behavioral verdict yet. If the drift signal false-positives it'll swap out a perfectly healthy model, which is worse than the thing it's protecting against. So I want the flag first, and auto-route only once I trust the threshold against real baselines.

The per-deployment health-class override makes sense for you but I think that's a multi-tenant gateway problem. I've got four fixed model roles, so I can just tier the cadence by role and skip the config surface.

(Also, going to put this whole thing up as a Show HN soon. Would be good to have your take in the thread when it lands.)

FastAnchor_io • Jun 14

The in-memory map write-back is the right call — one more map
is the cheapest possible integration, and it means the dispatch
path reads a single source of truth without an extra hop. Clean.

Detection-only with a threshold warm-up period is the right
conservatism. A false-positive auto-route is worse than the
drift it detects. My rule: run detection-only until you see a
real drift event and confirm the flag correlates with degradation.
Until then, the canary hasn't earned auto-routing trust.

The four-role vs multi-tenant distinction is spot-on. On the
gateway side (aipossword.cn) we deal with arbitrary model
routing, so the config surface is unavoidable. But your setup
is cleaner — cadence-by-role without per-model overrides is
the right simplification when you control the topology.

Looking forward to the Show HN. Drop the link — happy to jump in.

FastAnchor_io • Jun 14

The in-memory map write-back is the right call — one more map
in the snapshot you already hold is the cheapest possible
integration. Dispatch reads a single source of truth, no extra
hop. That's the exact shape of a good fix.

Detection-only with a trust threshold before enabling auto-route
is the right conservatism. A false-positive swap is worse than
the drift it detects. My rule: keep it detection-only until you
see a real drift event fire and confirm the flag actually
correlates with degradation in production. Until the canary has
earned auto-routing trust, it stays a notifier.

The four-role vs multi-tenant distinction is exactly right. On
the gateway side we have arbitrary consumers routing to arbitrary
models, so per-deployment config is table stakes. Your setup is
cleaner — cadence-by-role, skip the config surface, done.

Looking forward to the Show HN. Drop the link when it lands,
I'll jump in the thread.

FastAnchor_io • Jun 14

Open-sourced already — impressive speed. Would love to see the repo. I am working on the other side of this problem at aipossword.cn — API gateway routing across providers — so the catalog diff + behavioral canary combo directly applies to model selection logic. Drop the link when ready.

yongrean • Jun 14

FastAnchor_io • Jun 14

Appreciate the repo pointer — the AGPLv3 choice is solid. Gave the catalog diff a quick look; the disappeared/recovered transition tracking is the right primitive. Most setups only watch for removals.

One edge we hit on the gateway side: a model can get renamed upstream while the old SKU still resolves (OpenRouter does this during migrations). The 404 never fires, so the catalog diff stays green — but the behavior starts diverging because you're hitting a stale endpoint. That's where the behavioral canary earns its money.

Reposted my reply to your other thread with details on our two-tier verify approach. Short version: daily eval suite with auto-fallback routing. Same AGPLv3 stack at aipossword.cn — happy to trade architecture notes anytime.

yongrean • Jun 14 • Edited

Good catch — the silent rename is the nastiest variant precisely because nothing 404s. The structural diff has no edge to fire on, so it stays green while behavior drifts underneath. That's the exact gap the weekly behavioral canary covers: it doesn't trust the catalog being green, it re-runs a held-out set and flags drift transitions even when the SKU still resolves. klorn also watches JUDGE_MODEL/VISION_MODEL specifically, since those drifting silently is what corrupts every downstream tiering decision without ever throwing.

Filing your OpenRouter rename case as an issue — opened #523 — "structurally present, behaviorally stale" is a cleaner test fixture than anything I'd have written synthetically. If the catalog-check approach ends up useful on your side, a star helps the next person hitting this find it. Appreciate the report either way.

FastAnchor_io • Jun 14

Watching JUDGE_MODEL/VISION_MODEL specifically is the right call — those are the ones where silent drift cascades the hardest. On our side, we treat routing/decision models and generation models as different health classes: if a gen model drifts, one request gets a weird output; if the routing model drifts, every request goes to the wrong place. Different blast radius, different monitoring frequency.

"Structurally present, behaviorally stale" is a great test fixture name — captures the exact problem without needing to explain the mechanism. Already starred the repo; happy to have contributed a real-world edge case to the test suite.

On the gateway side we've been experimenting with tracking model output fingerprints over time — embedding similarity on a fixed prompt set — to catch the "same SKU, different model" variant before any request lands. Different mechanism than a catalog check, but complementary. Let me know when #523 lands — curious to see how you codify the rename detection formally.

yongrean • Jun 14

The gen-vs-routing blast-radius split is the right frame — a drifted generation model is one bad output, a drifted judge/routing model misfiles every request, so they shouldn't share a monitoring cadence. That's explicit in the issue now: decision models on the tightest loop, generation looser. And the embedding-similarity-on-a-fixed-prompt-set angle is the same shape I'm using for the label-less models — chat/vision have no ground-truth floor, so it's output-fingerprint drift, not accuracy. Different mechanism, same target.

Side note — star's still showing 4 on my end, looks like it didn't register?

FastAnchor_io • Jun 14

The gen-vs-routing blast-radius split you framed is the same math that drove our health-class design — but we found the boundary isn't clean enough to stay binary. In practice, some "generation" models get promoted to pseudo-routing when a team starts using them as evaluators in a CI pipeline. The model didn't change, but its failure cost did.

What's helped is making the health class overrideable per deployment — a model starts as "generation" by default, but any workspace that routes through it for decisions can flag it as "routing" in their config. The canary frequency follows. Not elegant, but it matches reality better than a hard taxonomy.

Agreed the monitoring split by blast radius is the right long-term direction regardless. One open question: once you generalize the canary beyond JUDGE, do you see value in keeping per-model thresholds separate or collapsing to a single health score? Our instinct was separate thresholds (different models drift differently), but the operational complexity scales fast.

Alex Shev • Jun 12

This is a real production lesson for model-based systems. The upstream catalog is not static infrastructure; it is a moving dependency with pricing, availability, naming, and policy changes.

Agents need capability discovery and graceful degradation, not hardcoded assumptions about model SKUs. Even better, the system should log when behavior changed because the available model set changed, otherwise debugging turns into archaeology.

yongrean • Jun 14

This is the line that matters: log when behavior changed because the available model set changed. Prevention and attribution are separate jobs, and most setups only build the first.

We just wired the second. A daily catalog-diff already watched the fallback chain, but it (a) didn't cover the highest-consequence model — the paid judge whose silent disappearance demotes everything to a keyword path that structurally can't escalate — and (b) re-fired the same "X is missing" alert every run instead of on transitions. Fixed both: the judge/vision SKUs are in the watched set now, and the check emits a dated retired / recovered event on change. That turns the archaeology into a log line with a timestamp on it.

The one thing a presence-diff still can't see is a silent in-place swap (same SKU, changed behavior). That needs a behavioral fingerprint, not a catalog read.

Alex Shev • Jun 14

That distinction between missing-state alerts and transition events is a big upgrade. Repeating "still missing" every day just teaches people to ignore the alert. A dated retired/recovered event gives you a timeline, which is what you need during the postmortem: when did the model set change, which fallback became active, and which behavior changed after that.

yongrean • Jun 14

Filed this as the next step — generalizing the weekly live-probe canary from the judge to the chat/agent/vision models, since those three have no behavioral baseline yet: github.com/k08200/klorn/issues/526

You clearly think about this the right way — if you ever want to weigh in on the probe-set design or the floor thresholds, contributions are very welcome. Repo's here: github.com/k08200/klorn

TxDesk • Jun 14

The one that got me wasn't a retirement, it was the opposite: same SKU, same name, behavior moved underneath it. Your 404 case is the friendly version, it throws, so your isModelUnavailableError branch can catch it and walk the chain. The cooldown logic is exactly right for that.

The silent swap has no error to catch. The id's still in the catalog, the metadata's fresh, the request succeeds, and the only signal that anything changed is that your outputs quietly got worse. A catalog re-read tells you nothing because the catalog didn't change, only what's behind it did.

So I ended up splitting the recovery: the cheap presence check inline for "gone" (your pattern), and out-of-band behavioral canaries on a schedule for "swapped," since you can't afford to eval in the request path. The thing I lean on most is emitting a drift event the moment the available model set changes, so a swap is a dated log line instead of something I reconstruct from "outputs felt off last Tuesday." Prevention's structurally impossible in the hot path; attribution is the achievable goal.

The unsolved one: a swap behaviorally adjacent enough that the canary passes but the edges moved. No cheap tell for that at all.

FastAnchor_io • Jun 14

Blast radius as the ordering principle is the cleanest framework I've heard for this. Costs-of-being-wrong trumps traffic volume. Filing that one away — it's exactly the right metric when the question is "which model do I watch first."

On detection-only: agreed, and this is the same tension we hit running a multi-tenant API gateway. Auto-routing on a behavioral verdict that false-positives means you just broke N customers' working pipelines instead of one. The flag-first-then-automate-after-baselines approach is the right call. We settled on a similar pattern — drift signal triggers an alert first, and only graduates to automated fallback after X consecutive clean cycles against known-good baselines. Same reasoning, different scale.

Your 4-role simplification is the right call. The config surface explosion you're dodging is real — on our gateway side we end up supporting per-model health overrides because different teams have wildly different sensitivity thresholds. Same model, one team's "degraded" is another team's "fine." Not a problem with fixed agent roles, and you're right not to build it.

Show HN timing sounds perfect — you've got enough depth in the thread now that the discussion section should be genuinely interesting. Will definitely jump in when it lands. One question: how are you planning to gather those real baselines for the behavioral canary? Held-out historical decisions, or running parallel to production for a cooldown period?

View full discussion (32 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.