DEV Community

Treat upstream catalogs as mutable: how a free-tier model SKU retirement broke my AI agent

yongrean on June 11, 2026

Tuesday afternoon, every autonomous cycle in my agent started returning the same error: [AGENT] Cycle failed: 404 No endpoints found for model: go...

Read full post

FastAnchor_io • Jun 11

Great write-up. The model-unavailable vs provider-unavailable distinction is spot-on — most implementations conflate them. One thing I'd add: model IDs don't just disappear, they also get silently renamed or migrated. Having a model alias/mapping layer between your agent and the upstream can catch both retirement AND rename events before they hit your fallback chain.

yongrean • Jun 12

Good point — and the rename case is sneakier than retirement, because from the caller's side both surface as the same 404. A fallback chain handles "something is gone" but can't tell you what happened.

The alias layer is the right move. One thing I'm considering on top of it: OpenRouter exposes a /api/v1/models catalog endpoint, so instead of discovering retirement reactively on the first failed call, a periodic catalog diff could flag "model X disappeared / model Y appeared with a suspiciously similar name" before any agent cycle hits it. Proactive instead of reactive.

For now Klorn's chain treats every entry as a canonical internal name and resolves per-provider (the Gemini-direct path already strips the google/ prefix and :free suffix), which is a half-step toward a real mapping layer. The full version — internal name → provider-specific current ID, refreshed from catalogs — is on the list now. Thanks for pushing on this.

FastAnchor_io • Jun 13

the catalog diff idea is solid — it turns a runtime surprise into a deploy-time check. one thing to watch: providers don't always remove entries immediately on retirement, so the diff window might be narrower than expected. pairing it with a TTL-based staleness check on model metadata gives you both the 'disappeared' and the 'hasn't been updated in N days' signals.

yongrean • Jun 14

Right on the narrow window, and the TTL pairing is the correct instinct. One sharpening: TTL on model metadata catches "our cached view went stale," not "the model changed under a stable id." A provider can leave the SKU listed, metadata untouched, and swap the behavior underneath — fresh timestamp, different model. That failure needs a behavioral fingerprint (a canary eval), not a staleness clock.

And since providers don't delist atomically on retirement, watch the recovered transition too, not just disappeared — a model can drop out and come back, and you want both edges logged. The other thing that bit us: make sure the diff actually covers your highest-consequence model. The one whose silent absence degrades everything downstream is exactly the one that tends to be missing from the watched set.

FastAnchor_io • Jun 14

Sharp point. The behavioral canary approach is exactly right — I would layer three checks: TTL for metadata drift, canary for behavior swap, and a deploy gate that blocks if either fires. Turns model changes from runtime surprises into blocked deployments. Great discussion.

yongrean • Jun 14

Layering all three is right, with one caveat on the deploy gate: it only catches the changes that ride your deploy. A provider swapping behavior under a stable SKU doesn't — that lands on their clock, between your deploys, with nothing of yours to block. So the canary can't be deploy-triggered; it has to run on a schedule, and "block the build" becomes "pin to a known-good model / fail over to a fallback," because there's no build to stop. The gate still earns its place for the changes you do ship — a prompt edit, a model-id bump — it just can't be the only edge.

The part I don't have a clean answer for is cadence, and what "it fired" should resolve to. The canary is itself paid eval calls, so it's a straight trade: run it tight and you pay continuously; run it weekly and the gap between runs is exactly when a silent swap lands and degrades everything downstream before the next check. And since there's no deploy to block, a fire has to become a runtime move — pin or fall over — not a red build. We run ours scheduled for now; the right interval, and the right automatic response, both still feel open.

FastAnchor_io • Jun 14

Open-sourced already — impressive speed. Would love to see the repo. I am working on the other side of this problem at aipossword.cn — API gateway routing across providers — so the catalog diff + behavioral canary combo directly applies to model selection logic. Drop the link when ready.

yongrean • Jun 14

Here you go — github.com/k08200/klorn (AGPLv3). The catalog diff is packages/api/src/openrouter-catalog-check.ts (runs daily, emits disappeared/recovered transitions); the behavioral canary is a weekly GitHub Actions job (.github/workflows/judge-canary.yml) that re-judges against a held-out label set; both feed openrouter-fallback-chain.ts. Your gateway-side angle is the complementary half — you're routing across providers in the first place, we just treat whatever's listed as mutable and route around it. Curious how you handle the verify step at the gateway layer — ping me anytime.

FastAnchor_io • Jun 14

Appreciate the repo pointer — the AGPLv3 choice is solid. Gave the catalog diff a quick look; the disappeared/recovered transition tracking is the right primitive. Most setups only watch for removals.

One edge we hit on the gateway side: a model can get renamed upstream while the old SKU still resolves (OpenRouter does this during migrations). The 404 never fires, so the catalog diff stays green — but the behavior starts diverging because you're hitting a stale endpoint. That's where the behavioral canary earns its money.

Reposted my reply to your other thread with details on our two-tier verify approach. Short version: daily eval suite with auto-fallback routing. Same AGPLv3 stack at aipossword.cn — happy to trade architecture notes anytime.

yongrean • Jun 14 • Edited

Good catch — the silent rename is the nastiest variant precisely because nothing 404s. The structural diff has no edge to fire on, so it stays green while behavior drifts underneath. That's the exact gap the weekly behavioral canary covers: it doesn't trust the catalog being green, it re-runs a held-out set and flags drift transitions even when the SKU still resolves. klorn also watches JUDGE_MODEL/VISION_MODEL specifically, since those drifting silently is what corrupts every downstream tiering decision without ever throwing.

Filing your OpenRouter rename case as an issue — opened #523 — "structurally present, behaviorally stale" is a cleaner test fixture than anything I'd have written synthetically. If the catalog-check approach ends up useful on your side, a star helps the next person hitting this find it. Appreciate the report either way.

FastAnchor_io • Jun 14

Watching JUDGE_MODEL/VISION_MODEL specifically is the right call — those are the ones where silent drift cascades the hardest. On our side, we treat routing/decision models and generation models as different health classes: if a gen model drifts, one request gets a weird output; if the routing model drifts, every request goes to the wrong place. Different blast radius, different monitoring frequency.

"Structurally present, behaviorally stale" is a great test fixture name — captures the exact problem without needing to explain the mechanism. Already starred the repo; happy to have contributed a real-world edge case to the test suite.

On the gateway side we've been experimenting with tracking model output fingerprints over time — embedding similarity on a fixed prompt set — to catch the "same SKU, different model" variant before any request lands. Different mechanism than a catalog check, but complementary. Let me know when #523 lands — curious to see how you codify the rename detection formally.

yongrean • Jun 14

The gen-vs-routing blast-radius split is the right frame — a drifted generation model is one bad output, a drifted judge/routing model misfiles every request, so they shouldn't share a monitoring cadence. That's explicit in the issue now: decision models on the tightest loop, generation looser. And the embedding-similarity-on-a-fixed-prompt-set angle is the same shape I'm using for the label-less models — chat/vision have no ground-truth floor, so it's output-fingerprint drift, not accuracy. Different mechanism, same target.

Side note — star's still showing 4 on my end, looks like it didn't register?

Alex Shev • Jun 12

This is a real production lesson for model-based systems. The upstream catalog is not static infrastructure; it is a moving dependency with pricing, availability, naming, and policy changes.

Agents need capability discovery and graceful degradation, not hardcoded assumptions about model SKUs. Even better, the system should log when behavior changed because the available model set changed, otherwise debugging turns into archaeology.

yongrean • Jun 14

This is the line that matters: log when behavior changed because the available model set changed. Prevention and attribution are separate jobs, and most setups only build the first.

We just wired the second. A daily catalog-diff already watched the fallback chain, but it (a) didn't cover the highest-consequence model — the paid judge whose silent disappearance demotes everything to a keyword path that structurally can't escalate — and (b) re-fired the same "X is missing" alert every run instead of on transitions. Fixed both: the judge/vision SKUs are in the watched set now, and the check emits a dated retired / recovered event on change. That turns the archaeology into a log line with a timestamp on it.

The one thing a presence-diff still can't see is a silent in-place swap (same SKU, changed behavior). That needs a behavioral fingerprint, not a catalog read.

Alex Shev • Jun 14

That distinction between missing-state alerts and transition events is a big upgrade. Repeating "still missing" every day just teaches people to ignore the alert. A dated retired/recovered event gives you a timeline, which is what you need during the postmortem: when did the model set change, which fallback became active, and which behavior changed after that.

yongrean • Jun 14

Filed this as the next step — generalizing the weekly live-probe canary from the judge to the chat/agent/vision models, since those three have no behavioral baseline yet: github.com/k08200/klorn/issues/526

You clearly think about this the right way — if you ever want to weigh in on the probe-set design or the floor thresholds, contributions are very welcome. Repo's here: github.com/k08200/klorn

TxDesk • Jun 14

The one that got me wasn't a retirement, it was the opposite: same SKU, same name, behavior moved underneath it. Your 404 case is the friendly version, it throws, so your isModelUnavailableError branch can catch it and walk the chain. The cooldown logic is exactly right for that.

The silent swap has no error to catch. The id's still in the catalog, the metadata's fresh, the request succeeds, and the only signal that anything changed is that your outputs quietly got worse. A catalog re-read tells you nothing because the catalog didn't change, only what's behind it did.

So I ended up splitting the recovery: the cheap presence check inline for "gone" (your pattern), and out-of-band behavioral canaries on a schedule for "swapped," since you can't afford to eval in the request path. The thing I lean on most is emitting a drift event the moment the available model set changes, so a swap is a dated log line instead of something I reconstruct from "outputs felt off last Tuesday." Prevention's structurally impossible in the hot path; attribution is the achievable goal.

The unsolved one: a swap behaviorally adjacent enough that the canary passes but the edges moved. No cheap tell for that at all.

FastAnchor_io • Jun 14

Blast radius as the ordering principle is the cleanest framework I've heard for this. Costs-of-being-wrong trumps traffic volume. Filing that one away — it's exactly the right metric when the question is "which model do I watch first."

On detection-only: agreed, and this is the same tension we hit running a multi-tenant API gateway. Auto-routing on a behavioral verdict that false-positives means you just broke N customers' working pipelines instead of one. The flag-first-then-automate-after-baselines approach is the right call. We settled on a similar pattern — drift signal triggers an alert first, and only graduates to automated fallback after X consecutive clean cycles against known-good baselines. Same reasoning, different scale.

Your 4-role simplification is the right call. The config surface explosion you're dodging is real — on our gateway side we end up supporting per-model health overrides because different teams have wildly different sensitivity thresholds. Same model, one team's "degraded" is another team's "fine." Not a problem with fixed agent roles, and you're right not to build it.

Show HN timing sounds perfect — you've got enough depth in the thread now that the discussion section should be genuinely interesting. Will definitely jump in when it lands. One question: how are you planning to gather those real baselines for the behavioral canary? Held-out historical decisions, or running parallel to production for a cooldown period?