Sergei Parfenov

Posted on Jun 11

You Fixed the Rate Limits. Now Your Agent Fails Quietly.

#llm #ai #devops #machinelearning

Uptime versus correct uptime trade-offs

Last week I wrote that your agent isn’t failing because it hallucinates — it’s failing because of rate limits. The capacity-engineering toolkit in that post — concurrency caps, backoff with jitter, fallback models, caching — is real and it works. Deploy it and your agent stops dying.

Then a commenter (ANP2) pointed out the thing the post undersold, and it’s been stuck in my head since: every one of those fixes quietly opens a correctness hole while it closes the availability one. This post is me paying that comment thread its due, because the second half of the story turns out to matter more than the first.

TL;DR — A 429 is a loud failure: you see it, you alert on it, you fix it. Retries, fallbacks, and caches keep the agent alive — but they let it act on output it didn’t freshly earn: a stale cache hit, a different model’s answer, a re-run side effect. You’ve traded loud failures for quiet ones. The fix is to treat availability (“can I serve this?”) and correctness (“can I still trust the result?”) as two separate gates — and to propagate trust across the agent’s chain, not just per call.

The trade you didn’t know you made

Here’s the uncomfortable symmetry. The whole point of my last post was that the dominant production failure mode isn’t the model being wrong — it’s the plumbing saying no. The capacity toolkit fixes the plumbing. But look at what each fix actually does:

A retry re-runs a call. If that call had a side effect — created a ticket, sent a message, committed a change — the retry runs the side effect again. The agent didn’t fail; it succeeded twice, which is its own kind of wrong.
A fallback model answers when the primary is rate-limited. But it’s a different model: different training, different calibration, different failure modes. The task continues on an answer the primary never produced.
A cache hit serves a response generated for an earlier input. If the world moved — the codebase changed, the data updated — the cached answer can be subtly stale for this request while looking perfectly fresh.

Each mechanism keeps the agent up. None of them guarantees the agent is right. And the cruel part is the failure economics: the 429 you eliminated was honest — visible, countable, alertable. The failures you bought instead are silent. The agent stays up and is confidently wrong, which is exactly the failure mode the hallucination-hunters were worried about in the first place — just arriving through the plumbing instead of the model.

The reliability you bought is uptime, not correct uptime. (That phrase is ANP2’s, and it’s better than anything in my original post.)

Two gates, not one

The conversation in that thread converged on a framing I now use everywhere: an agent’s runtime layer has to answer two different questions, and conflating them is where the quiet failures breed.

Gate 1 — “Can I serve this?” This is the availability gate. Trip the fallback on 429s, serve the cache on a hit, retry on transient errors. Another commenter (Echo) nailed the key property of this gate: when you trip a fallback only on rate-limit errors — never on bad outputs — the failure mode you’ve introduced is latency, not quality. The fallback just buys time. That’s a fine trade, and it’s why the capacity toolkit is still the right first move.

Gate 2 — “Can I act on this irreversibly?” This is the correctness gate, and it’s where the degraded outputs from Gate 1 must get re-examined. The moment an output is about to feed something you can’t take back — a merge, a payment, a message to a user, a deleted record — its provenance matters. Did it come from the primary, fresh? Or from a fallback, a cache, a retry?

One rule worth stealing here: gate on risk, not on confidence. There’s a war story making the rounds of an agent that was 95% confident about a production database migration — the missing 5% was a foreign-key constraint absent from its test data, and the only thing that prevented corrupted referential integrity across three tables was a hard rule that destructive operations always require human approval, regardless of confidence. Confidence is the model grading itself; irreversibility is a property of the action. Gate on the second.

The two gates fail differently, and that’s the point: Gate 1 failures cost you time; Gate 2 failures cost you trust. A system with only Gate 1 is fast and quietly dangerous. A system with only Gate 2 is safe and constantly down. You need both, and they need to stay separate.

Per-call correctness: the three tags

The minimum viable version of Gate 2 is making degraded outputs identifiable. Three mechanisms, one per capacity fix:

1. Idempotency keys on anything with side effects. Before an agent action that touches the world, generate a key from the task + step + inputs. The receiving system deduplicates on it. Now a retry is safe by construction — the second execution is a no-op instead of a double-fire. This is decades-old distributed-systems practice; agent frameworks have mostly just… not adopted it yet.

import hashlib, json

def idempotency_key(task_id: str, step: int, payload: dict) -> str:
    raw = json.dumps({"t": task_id, "s": step, "p": payload}, sort_keys=True)
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

# pass it with the side-effecting call; the receiver dedupes on it
create_ticket(..., idempotency_key=idempotency_key(task.id, step.n, args))

The grown-up version of this is the saga pattern from distributed systems: each step records its completion and defines a compensation action, so a task that dies at step 4 of 7 can roll back cleanly instead of orphaning state. Idempotency prevents duplicate effects; sagas handle partial completion. Once your agents fail mid-workflow — and they will — you eventually want both.

2. Trust tags on fallback outputs. When the fallback answers instead of the primary, don’t just return the text — return (text, trust="degraded"). Cheap to add, and it’s the hook everything downstream needs. A degraded answer is fine for the agent to keep thinking with; it is not fine to act irreversibly on without a re-check.

3. Validity conditions on cache entries. A cache entry shouldn’t just store the response — it should store what the response assumed: which file version, which data snapshot, which config. On a hit, check the assumptions, not just the key. If the codebase moved since the entry was written, that’s a miss wearing a hit’s clothes. And the assumptions can move without you touching anything: providers silently update models, document stores drift, input distributions shift — degradation with no error to catch. Your “primary, fresh” answer from last Tuesday may already be a fallback in disguise.

The part single calls don’t prepare you for: trust must propagate

Here’s where agents make this genuinely harder than classic distributed systems, and it’s the piece I’d add on top of the thread that started this post.

Say step 3 of a 6-step task came from a lower-trust fallback. Steps 4, 5, and 6 each run on the primary, fresh, individually flawless. Are they trustworthy?

No — and this is the trap. They reasoned on top of a degraded input. This isn’t a niche concern, either: observability vendors who cluster production agent traces report that chained corruption — one bad step at position N silently poisoning everything after it — is the single most common and most insidious agent failure mode they see. And the math is brutal: at a 95% per-step success rate, an 8-step task completes cleanly ~66% of the time; at 85% per step, it’s ~27%. The chain is where reliability goes to die, quietly. Each step is locally correct and the trajectory is still poisoned. If the trust tag stays local to the call that produced it, the degraded answer launders itself: two “clean” hops later it looks pristine, and your irreversibility gate at step 6 checks the last call’s tag, sees green, and fires.

So the tag can’t be per-call metadata. It has to taint — propagate to everything downstream of it, the way taint-tracking works in security analysis:

@dataclass
class StepResult:
    output: str
    trust: str          # "full" | "degraded"
    tainted_by: set[str]  # which upstream steps were degraded

def propagate(inputs: list[StepResult], my_trust: str) -> tuple[str, set[str]]:
    taint = set().union(*(r.tainted_by for r in inputs))
    taint |= {r.step_id for r in inputs if r.trust == "degraded"}
    # my own trust can't exceed the weakest input
    trust = "degraded" if taint or my_trust == "degraded" else "full"
    return trust, taint

Then the irreversibility gate checks the aggregate trust of the whole trajectory, not the last hop: if anything upstream was degraded and unverified, the action pauses for a re-check — re-run the degraded step on the primary, or escalate to a human. In my experience the re-check fires rarely; the point isn’t that fallbacks are usually wrong, it’s that the one time the degraded path feeds a merge or a payment, you want it caught at the gate instead of in the incident review.

Making it observable (or it didn’t happen)

Same lesson as the capacity post, one level up. You can’t engineer what you can’t see, and correctness debt is even quieter than 429s. The minimum dashboard:

% of completed tasks with any degraded step — your real exposure, invisible in error rates because nothing errored.
% of irreversible actions that fired with taint — should be ~zero; every one is a gate you skipped.
Cache validity-miss rate — hits that failed the assumption check. If this is zero, you’re probably not checking assumptions.
Fallback divergence — periodically replay fallback-answered requests on the primary and diff. This is your measured answer to “how different is the fallback, actually?” instead of a vibe.

None of these show up in uptime. All of them are the difference between uptime and correct uptime.

The takeaway

The capacity toolkit from the last post is still step one — an agent that’s down helps nobody. But availability engineering has a hidden invoice: every mechanism that keeps the agent alive does it by substituting something for the fresh, primary, verified answer. That substitution is usually fine — which is exactly what makes it dangerous, because “usually fine” plus “irreversible” plus “silent” is how you get the 3am incident that no alert predicted.

Two gates. Tag what’s degraded. Taint what it touches. Check the trajectory, not the last call, before anything you can’t undo.

Uptime is table stakes. Correct uptime is the product.

Sources & further reading

Detecting AI Agent Failure Modes in Production, Latitude (2026) — chained corruption as the most common and most insidious production failure mode.
AI Agent Error Handling: 5 Patterns to Catch Silent Failures, Kevin Tan (2026) — the saga pattern, the 95%-confident migration story, and risk-based escalation.
AI Agent Failure Modes: What Goes Wrong in Production, Trantor (2026) — silent quality degradation from provider model updates and store drift.
International AI Safety Report 2026 — why agent failures are categorically riskier: actions in the world, no human in the loop.
My previous post on the capacity side — the availability toolkit this post is the second half of.

Credit where due: this post exists because ANP2 and Echo took the last one apart constructively in the comments — the “uptime, not correct uptime” framing and the latency-not-quality fallback distinction are theirs. Best argument I’ve had on this site. If you’re running agents in prod: do you track degraded-path exposure at all, or does your observability stop at error rates? Genuinely curious how rare Gate 2 is in the wild.

Top comments (25)

xulingfeng • Jun 12

The "uptime, not correct uptime" distinction is gold. We hit the same pattern with AI-driven test automation at my last company — pass rate climbed because the AI kept "fixing" flaky tests by shrinking their assertion scope. The pipeline stayed green, but the tests stopped catching real regressions.

The taint propagation approach for multi-step agents makes a lot of sense. Same correctness debt, different level of the stack — and way harder to spot until something irreversible happens.

Sergei Parfenov • Jun 12

the shrinking-assertion-scope story is the nastiest version of this pattern ive heard, because the degradation happened in the verification layer itself. my whole taint approach quietly assumes the verifier is trustworthy — tag the degraded data, gate the irreversible action, re-check against something solid. but when the thing that checks correctness is what degraded, uve lost the instrument that wouldve caught it. green pipeline, hollow assertions. thats not a quiet failure anymore, its a quiet failure with a forged alibi.

guess the test-automation version of my dashboard metric would be tracking assertion scope/strength over time, not pass rate — pass rate is exactly the metric the failure mode games.

Mykola Kondratiuk • Jun 14

removing the error without replacing it with a new signal is the core problem. one thing the toolkit needs: validate the fallback model's output schema separately - primary and fallback often return differently-structured responses, and format drift is invisible downstream.

Ahmet Özel • Jun 12

Good framing. Silent degradation is where agents get dangerous because the system still looks alive from the outside. One thing I like to add is an eval replay set for degraded runs: keep the tool trace, retrieved context and final answer together, then replay the same cases after prompt/tool changes. It catches cases where the agent learned to continue smoothly while carrying bad state forward.

Sergei Parfenov • Jun 12

the degraded-run replay set is a great addition — its basically the offline half of the "fallback divergence" metric from the post. i diff fallback answers against the primary now; ur replaying the whole trace after changes, which catches the scarier thing: the agent learning to glide smoothly over bad state. keeping trace + retrieved context + answer together is the part most people skip and then cant reconstruct. adding this to the toolkit.

Scarab Systems • Jun 12 • Edited

this is exactly the sort of pivot in approach I'm interested in...

I would take it even a step further... the agent should not need to carry state... state and context should be provided by something that can carry that weight cleanly and more importantly truthfully... the repo... then the agent can continue to do what it does best.. code.

Sergei Parfenov • Jun 12

externalizing state is the right instinct — stateless agents + a source of truth they read from beats agents lugging context around, agreed. and for code the repo is the best ledger we have.

but heres where it doesnt close the loop: the repo records outcomes, not provenance. a commit produced from a degraded fallback chain diffs identically to one produced from clean primary reasoning. git gives u receipts for what changed — its silent on whether u should trust how it got there. so moving state into the repo solves the "agent carries fragile context" problem, but the evidence problem just moves with it: something still has to carry the trajectory-level receipts alongside the artifact. repo as ledger for state, evidence layer for process. u need both, theyre answering different questions.

Scarab Systems • Jun 12

ah Yes! — this is the distinction I was reaching for, and I think you’re right to split it that way.

When I say the repo should carry the authority, I don’t mean the git diff alone proves the process. A commit can show what changed while saying almost nothing about whether the change preserved the right obligations.

The way I think about it is more like: the repo has to be read into a baseline first.

Not just “current files,” but the repo’s claims: tests, docs, contracts, generated-vs-source boundaries, config expectations, ownership surfaces, validation signals, and whatever the system already uses to say “this is true here.”

Then the agent is not carrying the burden of remembering all of that conversationally. It is working against a diagnostic baseline that can say: this claim existed before, this boundary owned it, this artifact was evidence for it, and this change either preserved, moved, weakened, or contradicted it.

So yes: repo as ledger for state, evidence layer for process — but I’d add that the evidence layer has to be grounded in a repo baseline, not just attached afterward as trace metadata.

That is the shape I’m interested in: before the workflow acts, it should be able to show both the artifact change and the evidence chain that says the change still belongs where it landed.

Scarab Systems • Jun 12

This is a really strong framing — especially the distinction between uptime and correct uptime.

The part that stands out to me is that the degraded path is not just a runtime state; it becomes an evidence problem. Once a fallback, stale cache hit, or retried side effect enters the chain, the question is no longer only “did the agent complete?” It becomes “what proof does the system still have that the completed trajectory preserved the intended boundary?”

That is very close to the diagnostic layer I’ve been exploring with Scarab/SDS. The failure is often not the loud error. The loud error is honest. The more dangerous failure is when the system keeps moving after the boundary that was supposed to preserve trust has already weakened.

The taint propagation point feels especially important. A degraded step should not be allowed to launder itself through later successful calls. If step 6 is clean but step 3 was degraded and never re-verified, the trajectory is still carrying that earlier uncertainty.

I like the “two gates” framing a lot. I’d almost describe Gate 2 as an evidence gate: before an irreversible action, the system has to prove not just that the last call succeeded, but that the whole chain still has valid provenance.

Sergei Parfenov • Jun 12

"evidence gate" is honestly a better name than mine — because it makes the obligation explicit. a trust tag is passive metadata; evidence is something the chain has to carry and produce on demand. step 6 shouldnt just be untainted, it should be able to show receipts for steps 1-5. same mechanism, stronger contract. stealing the term (with credit).

Scarab Systems • Jun 12

Yes — please take it and use it. Credit appreciated, but honestly the bigger thing is that we start naming the problem clearly enough to work on it together.

That “receipts for steps 1–5” phrasing is exactly the contract I was trying to get at. A tag describes a state, but an evidence gate asks whether the chain can actually produce proof for the state it is claiming.

The more we can shift the conversation from “did the agent finish?” to “what evidence does the workflow carry forward?”, the more useful the whole discussion becomes.

I think that shared language matters here because this failure mode is showing up in a lot of different places under different names. Once we can name it together, we can start designing around it instead of just reacting to it.

Sergei Parfenov • Jun 12

agreed — and ur "different names" point is literally true across fields: security calls it taint, data engineering calls it provenance, ML calls it lineage, audit calls it receipts. four communities, one shape: can you trace what this result stands on. agents just made it urgent because now the untraceable thing acts.
"what evidence does the workflow carry forward" is the right question to standardize on. good thread — this is going in the next post.

Manuel Bruña • Jun 15

Quiet failure is worse than a hard rate-limit error. For agent systems I’d rather have an explicit degraded state: skipped tool, stale data, partial result, retry budget exhausted. If that is hidden, the final answer looks more reliable than it is.

VoltageGPU • Jun 16

Great piece—very much in line with what I've seen in distributed systems. In GPU workloads, especially with rate-limited inference APIs, we often add retries with jitter, but subtle state corruption can still happen if the retry logic doesn't fully respect the original request context. It's a good reminder that availability isn't enough if correctness is compromised.

Lily • Jun 16

The distinction between uptime and "correct uptime" is something more teams should be talking about. Most dashboards celebrate successful requests, but very few measure whether degraded paths are influencing downstream decisions. The idea of propagating trust across an agent workflow feels like a natural evolution of traditional reliability engineering.

mote • Jun 18

Rate limits causing silent failures is worse than outright crashes — at least a crash gets logged. I've watched agents accumulate partial state across multiple 429 responses and then execute with half the context missing. The output looks plausible enough that nobody notices until corrupted data hits production three steps later.

The real problem is most agent frameworks treat rate limits as transport-layer issues rather than application-layer state corruption. A 429 isn't "try again later" — it means "your current execution branch is now poisoned." If the agent was in the middle of mutating internal state when the limit hit, the retry starts from a half-baked world.

How do you handle the case where the agent's internal state is already partially written when the rate limit fires? Undo the mutation or trust the retry with the dirty state?

VoltageGPU • Jun 12

Great post—this really hits on the nuance between availability and correct availability. In distributed systems, especially when dealing with GPU-accelerated workloads on platforms like VoltageGPU, it's easy to mask rate-limiting with retries, but that can lead to stale or incorrect results downstream. I've seen this in inference pipelines where cached responses were used under load, leading to subtle correctness issues that only surfaced in edge cases.

Alex Shev • Jun 12

This is the hidden cost of making agents more resilient. Retries, cache, fallback models, and degraded modes all improve uptime, but they can also hide the moment when the answer stopped being freshly earned.

I like the distinction between uptime and correct uptime. For agents, the SLO should probably include provenance: which inputs were current, which tools actually ran, which fallbacks triggered, and what confidence was produced by evidence instead of habit.

View full discussion (25 comments)