We logged every rejected tool call for a month. A third were our validation being wrong, not the model.

#llm #python #devops

TL;DR: Everyone logs tool calls that error or return junk. We started logging the calls our own validation REJECTED before they ever ran. Over a month, about 1 in 3 of those rejections were false: a valid user intent our schema or precheck was too rigid to accept. We had spent weeks hardening the guardrail and never checked whether it was now blocking real work.

The blind spot in "we added validation"

After an incident where our agent made a structurally valid but wrong tool call, we added a precheck layer in front of every state-mutating tool. Failures dropped, we moved on. What we did not log was the other side of the ledger: every time the precheck said no. A block felt like a success by definition. The agent tried something bad, we stopped it, good.

Then support started forwarding tickets where the agent refused something the user was clearly allowed to do.

What the rejection log showed

So we logged every rejection with three fields: which check fired, the full arguments, and the user-visible outcome. One month, 612 rejections. We hand-reviewed a sample.

Roughly a third were false rejections. The pattern was almost always the same: a check written to stop one specific bad case was also catching a legitimate neighbouring case nobody thought about when they wrote it. The "is this order in the cancellation window" check rejected legitimate cancellations on orders whose timezone put them one hour outside a window they were actually inside. The "does this id exist in retrieved context" check rejected valid ids that arrived through a second tool the author had not considered.

def run_check(check, args, ctx):
    failures = check(args, ctx)
    log.info("tool_precheck", extra={
        "check": check.__name__,
        "rejected": bool(failures),
        "reasons": failures,
        "args": redact(args),
        "outcome": "blocked" if failures else "passed",
    })
    return failures
# the 'rejected' branch is the one nobody reads. read it weekly.

What we changed

Two things. First, a weekly fifteen-minute review of a sample of rejections, same as we review errors. False rejections get the check loosened or split. Second, checks now fail with a specific reason string the agent can act on, not a generic block, so a too-strict check often self-corrects: the agent reads "outside cancellation window by your local timezone" and escalates instead of dead-ending.

False rejections fell from a third to under a tenth over six weeks. The number that matters more: support tickets about the agent refusing valid requests basically stopped.

The tension I have not resolved

Every loosened check is a check that now lets more through, which is the exact surface the check was added to close. We have not found a principled way to loosen a guardrail without quietly reopening the hole. Right now we lean on the canonical-examples test (the bad case that prompted the check stays frozen as a must-block), but that only protects against the failures we have already seen.

If you run guardrails on an agent: do you measure your false-rejection rate at all, and if so, how do you loosen a check without trusting that a frozen example covers the regression? That is the part I keep getting wrong.