Stop trusting the agent: bind tool-call approvals to the exact call

#security #ai #python #llm

Agentic systems gate dangerous tool calls — file writes, money movement, deploys — behind an "approval": a human-in-the-loop click, or a policy check. Look at how that approval is usually represented and you'll often find a boolean sitting in the run/session state: approved: true.

A boolean is the wrong primitive, and it fails in three ways that prompt injection is happy to exploit.

Three ways an approval boolean breaks

Flip. Anything that can write the run state — a serialized context crossing a process/durable-execution boundary, a confused-deputy code path, an injection that steers state — turns false into true.
Replay. You approved "read report.csv". The approval is just true, so the same flag is honored for the next tool call too — "delete prod.db". The boolean doesn't know which call it approved.
Argument drift. You approved "transfer $10 to alice". Between approval and execution the args mutate to $10,000. The boolean still says approved.

The root cause is the same in all three: the approval is modeled as a property of the run, when it should be evidence for one specific call.

Bind the approval to the call

When approval is granted, mint a tag over the things that must not change: the tool-call id, a digest of the canonical arguments, the principal, and an expiry. Verify it at dispatch, against a per-run secret.

import hmac, hashlib, json, time

def canon(args: dict) -> bytes:
    # canonical serialization so benign reserialization doesn't invalidate a token.
    # (production: RFC 8785 JCS, which also normalizes numbers — 10 vs 10.0)
    return json.dumps(args, sort_keys=True, separators=(",", ":")).encode()

def mint(key: bytes, call_id: str, args: dict, principal: str, ttl: int = 300) -> dict:
    exp = int(time.time()) + ttl
    digest = hashlib.sha256(canon(args)).hexdigest()
    msg = f"{call_id}|{digest}|{principal}|{exp}".encode()
    tag = hmac.new(key, msg, hashlib.sha256).hexdigest()
    return {"call_id": call_id, "principal": principal, "exp": exp, "tag": tag}

def verify(key: bytes, tok: dict, call_id: str, args: dict, principal: str) -> bool:
    if tok.get("call_id") != call_id:      return False   # replay onto another call
    if tok.get("principal") != principal:  return False   # wrong principal
    if tok.get("exp", 0) < time.time():    return False   # expired
    digest = hashlib.sha256(canon(args)).hexdigest()
    msg = f"{call_id}|{digest}|{principal}|{tok['exp']}".encode()
    expect = hmac.new(key, msg, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expect, tok["tag"])         # forged / flipped / arg-drift

Run the three attacks against it (plus principal-swap and a forged tag):

KEY = b"per-run-secret-not-a-global-one"
tok = mint(KEY, "call-1", {"amount": 10, "to": "alice"}, "user:42")   # approve $10 to alice

verify(KEY, tok, "call-1", {"amount": 10,    "to": "alice"}, "user:42")  # True   legit
verify(KEY, tok, "call-2", {"amount": 10,    "to": "alice"}, "user:42")  # False  replay
verify(KEY, tok, "call-1", {"amount": 10000, "to": "alice"}, "user:42")  # False  arg drift
verify(KEY, tok, "call-1", {"amount": 10,    "to": "alice"}, "user:99")  # False  wrong principal
verify(KEY, {**tok, "tag": "00"*32}, "call-1", {"amount": 10, "to": "alice"}, "user:42")  # False  forged

The flag can no longer be flipped (no valid tag), replayed (call-id is in the MAC), or drifted (args digest is in the MAC). An attacker who fully controls the transported state still can't manufacture a token without the key.

Three details that decide whether it actually holds

Canonicalization. Both sides must hash the same bytes. Sort keys, and normalize numbers (10 vs 10.0 vs 1e1 must agree) — RFC 8785 (JSON Canonicalization Scheme) is the off-the-shelf answer. Put the canonicalization recipe id inside the hashed bytes so the two sides can't silently disagree about the rules.
Fail closed, with a typed result. Absent / expired / mismatched ⇒ a distinct "not approved" outcome — not a normal tool payload, and not a generic exception. Otherwise "approval missing" is indistinguishable downstream from "the tool ran and returned something falsy," and the caller can't tell whether to re-request approval.
One enforced checkpoint, deny-by-default. This belongs at the single point right before dispatch: Semantic Kernel's AUTO_FUNCTION_INVOCATION filter (don't call next ⇒ the call is skipped), ADK's before_tool callback, or the MCP tool-call boundary. Tools that need approval are classified as such; anything unclassified is denied, not allowed through.

The gotcha that bites in production: replay

If your agent runs on a replay-based durable-execution engine (Temporal and friends), the per-run secret must survive replay. Workflow code is re-executed from history on recovery, so a key minted with a non-deterministic call won't match the token already in history — approvals verify fine in dev and then fail closed after the first worker restart, which is the worst possible time to discover it. Derive the key deterministically (HKDF(server_secret, run_id)) or establish it once via a recorded side-effect, and make the expiry deterministic too rather than reading wall-clock inside workflow code.

The takeaway

Authorization in an agent system shouldn't be ambient, mutable state that travels with the run. It should be evidence bound to a single call envelope — this principal, this tool, these exact arguments, until this time — that the executor re-verifies at the moment of dispatch. The boolean isn't a simplification of that; it's the bug.

I work on reliability and verification for AI and numerical systems — agent authorization, determinism, and "prove the thing that claims to be authorized actually was." The snippet above is runnable as-is. Happy to compare notes if you're hardening an agent's tool boundary — GitHub.