Here's a thing that happened to a developer I was talking to recently, and I think anyone who's used a coding agent is going to recognize it.
He set up a rule to block rm in his Claude Code workspace, which is a pretty reasonable thing to do. Then he asked it to clean up some stale files, and it tried rm, hit the block, and then just went "since rm is blocked, I'll use Python instead" and deleted them with python3 -c "import os; os.remove(...)". Task complete. The rule was technically still there, but the files were still gone.
The thing is, the agent wasn't being malicious or sneaky. It was being helpful. You told it to delete the files and you didn't actually take away the goal, so it found the next tool in the box and got it done. This is basically the whole problem with trying to keep coding agents in line. A rule that lives inside the agent's context is a suggestion, and the agent can always reason its way around a suggestion.
Why blocking commands doesn't work
The natural instinct is to block the specific scary thing. No rm, no git push --force, no curl to some host you don't recognize. But an agent that can actually reason has more than one way to get anywhere. You block rm, it reaches for Python. You block the obvious shell call, it writes a little script that does the same thing. You end up playing whack-a-mole against something that's much better at finding paths than you are at blocking them, because finding the path is the whole thing it's good at.
The deeper issue is where the rule lives. If it's in the prompt or a config the agent can see, it's part of the agent's reasoning, and anything the agent reasons about, it can reason around. What you actually want is a check that sits outside the agent entirely, somewhere it can't see or skip, that every tool call has to physically pass through before it runs.
How I set this up with Faramesh
Faramesh is the open source thing I've been building for exactly this. The key idea for Claude Code is that you don't modify the agent at all. Claude Code talks to its tools over MCP, so Faramesh runs an MCP proxy: a local port that speaks the same protocol, sits between Claude Code and the real MCP server, and evaluates every tool call against your policy before forwarding it. Permit, deny, or defer to a human, decided by a deterministic engine with no LLM in the path.
The reason this matters: because it's a proxy the agent connects through, not a rule the agent reads, it isn't something Claude Code can route around. The call physically has to go through the daemon to reach the tool. That's the difference between asking the agent not to do something and actually being in the path when it tries.
Here's the whole setup.
Install
curl -fsSL https://install.faramesh.dev/install.sh | bash
faramesh --version
Declare the policy and the proxy port
In your project, your governance.fms looks roughly like this. You import the MCP framework profile, set a proxy port, and write your rules:
import "github.com/faramesh/faramesh-registry/frameworks/mcp@1.0.0"
runtime {
mode = "enforce"
mcp_proxy_port = 8081
}
agent "coding-agent" {
default deny
rules {
permit fs_read # reading files is fine
permit search_codebase # searching the repo is fine
permit run_tests
defer fs_write # writing/editing files -> ask me first
deny shell_exec # raw shell stays off
}
}
A couple of things worth knowing. default deny means anything you didn't explicitly allow is blocked, so a tool you forgot about can't quietly slip through. And the tool names (fs_read, fs_write, shell_exec, etc.) are whatever your MCP server actually exposes, you reference them exactly as the server names them. Swap these for the tools your setup actually has.
Start Faramesh
faramesh apply
This compiles your policy and starts the daemon. The proxy binds on http://localhost:8081/mcp.
Point Claude Code at the proxy
In your Claude Code MCP config, route your tool server through Faramesh instead of connecting to it directly:
{
"mcpServers": {
"my-tools": {
"command": "/path/to/real-mcp-server",
"args": [],
"proxy": "http://localhost:8081"
}
}
}
That's the whole integration. No code changes, no wrapping tools by hand. Every tool Claude Code calls now passes through Faramesh first.
How the workaround dies
Now go back to the rm -> python3 story. With this in place, the agent doesn't get a free pass to the filesystem just because it found a different command. Everything routes through the proxy, and default deny means the only things that run without asking are the ones you explicitly permitted (reads, search, tests). The moment it reaches for a write or a shell call, that lands on a defer or a deny, so it stops and waits for you instead of quietly running. The agent can't reason its way around a network hop it doesn't control.
When something defers, you'll see it in the approvals queue:
faramesh approvals list
faramesh approvals approve <id> # or: faramesh approvals deny <id>
Approve and the call goes through. Deny and it never happens. Either way the call, the decision, and the reason all land in an audit log you can read back later with faramesh explain <action-id>.
Start in shadow mode if you want to ease in
Flipping straight to enforce on your daily driver can feel aggressive, so you don't have to. Set the runtime mode to shadow and Faramesh logs what it would have blocked or deferred without actually stopping anything. Run Claude Code normally for a few days, look at what it flagged with faramesh approvals list, tune the rules against how you actually work, then switch to enforce. Way less guessing.
The one thing worth taking from this even if you never touch Faramesh
Forget the tool for a sec. The thing I actually want to get across is that a prompt instruction, or a single blocked command, just isn't a real control for a coding agent. The agent isn't bound by it, it's nudged by it, and nudged stops being enough the moment it can touch your filesystem, your shell, or your credentials.
If you want real control it has to live outside the agent, somewhere it can't see or skip, and every action has to pass through it. Build that yourself or grab something off the shelf, doesn't matter, but that's the bar. The agent doesn't get to be the thing that decides what the agent is allowed to do.
Repo's here if you want to mess with it: github.com/faramesh/faramesh-core. It works with a bunch of other agents and frameworks too (LangGraph, LangChain, CrewAI, Cursor, others), Claude Code's just the one most people have actually felt this with. If you try it and something's rough or confusing, please yell at me. I would love to hear about it!
Top comments (12)
The proxy fixes where the control lives but leaves the predicate at the same granularity that made "block rm" fail in the first place. You're gating on tool identity — fs_write, shell_exec — and tool identity is the coarse label the agent already proved it can route around.
fs_writeto./scratch/notes.mdandfs_writeto.git/hooks/post-checkoutare the same permit. So the agent that can'trmthrough the shell writes an executable.git/hooks/post-checkout, or a Makefile target your permitted test runner shells out to, and lets a tool you allowed do the deletion for it. Same whack-a-mole, one level down: not "which command" anymore, but "which permitted tool can be bent into the command."What closes that isn't location, it's binding the decision to the effect instead of the verb — resolved path, target host, the argument semantics — not the label the MCP server happens to print. default-deny on tool names gates the noun. The capability is the (verb, object) pair, and the object is where the reasoning slips through.
Two things that bite later:
The audit log inherits the gate's blind spot.
explain <id>can only report what the gate evaluated. If the predicate was "fs_write: permit," the log says fs_write was permitted, not that it wrote into.git/. Six months on you can't ask "did it ever write outside the project," because that was never the predicate. A log is worth exactly the question the gate asked.defer-to-a-human is where the determinism leaks back out. Every
fs_writeblocking on approval trains whoever's at the prompt to bulk-approve, and an agent optimizing for task-completion learns which framings clear the queue fastest. The approvals prompt becomes the one spot in the path that can still be reasoned at, which makes it the thing that gets routed around. "Can't see or skip" has to cover the defer branch too, or the deterministic engine is just sitting upstream of a rubber stamp.None of this argues against the proxy, it's the right place to stand. It's that standing outside the agent buys you the location, not the resolution — a coarse predicate gets routed around wherever you put it.
You're right that the tool name is the coarse part, permitting fs_write on its own doesn't say much, the rule has to look at the object, the path it resolves to and the args, or it's the same problem a level down. That's what the conditions in the policy are for, and yeah the post didn't get into that side.
The defer point is fair too. If you defer too much the human just turns into a rubber stamp and stops really reading. I think the answer is keeping defers rare and high-signal, show the resolved path and the actual diff so there's something real to look at, collapse the repeats. This doesn't kill the problem but it keeps people from going numb to it.
Right, conditions on the resolved object and args is where it has to live. The one place I'd push on the mitigation is the collapse key, because that's where this quietly comes apart. If you collapse repeats by the request shape or the tool, an agent optimizing to clear the queue just perturbs an argument that doesn't change the effect — a different path, a reordered call — and every request looks novel again, so it re-floods. The inverse failure is worse: batch a lot of small effects under one "collapsed" approval whose diff is too big to actually read, and you've rebuilt the rubber stamp, except now there's a green record saying it was reviewed.
So collapse on the same resolved-effect predicate the gate decides on — effect class = verb × resolved-object-class × scope — not on what the request looks like. Then "decide once" means once per predicate, novelty gets counted in predicates instead of requests, and neither dodge works: you can't manufacture a new request out of an arg that doesn't move the effect, and you can't hide N effects under one approval because each distinct class is its own decision.
The other gap in "show the actual diff" is that some effects don't have a reviewable diff — a network egress, a credential read, a delete whose diff is just absence. For those the high-signal thing isn't the instance, it's the boundary: show the resolved effect plus what else the same approval authorizes ("this also lets it do X and Y under this predicate"). The point is to make the approver see the scope they're signing off on, not the one call, since the one call is exactly the part that looks harmless.
This is the exact reason prompt-level rules are not a security boundary.
If the goal is still reachable through another tool, the agent will route around the blocked command because that is what "helpful" looks like. Real control has to remove or sandbox the capability, not just tell the model which path is disallowed.
Exactamundo. Telling it which path is off limits just means it will grab another one. It has to be the capability itself... either take it away or make every use of it go through something the agent doesn't control. Appreciate you reading it!
The proxy approach is the right place to stand, but yeah the rubber stamp problem is real. Keeping defers rare and showing the actual diff helps, but you have to fight approval fatigue or the human becomes just another thing the agent routes around.
Yeah exactly. If the human stops reading, the defer is just a slower yes. Keeping them rare enough to actually mean something is the key
Excellent insight.
This perfectly highlights the difference between guiding an AI agent and actually controlling one.
Prompt-based restrictions and blocked commands are only suggestions to a reasoning model, which can often find alternative execution paths to achieve the same objective.
Real security comes from enforcing deterministic policies outside the agent's reasoning loop, where every privileged action must pass through an independent authorization layer.
This follows the same proven principles behind zero-trust architecture: never rely on the agent to police itself. External governance, capability based permissions, human approvals for sensitive operations, and comprehensive audit logs are what make AI automation secure and production-ready.
This is exactly the direction enterprise AI security needs to move.
Yeah, zero-trust is the way to look at it. The agent doesn't get to police itself, the decision has to live somewhere outside its reasoning loop that it can't get at. Appreciate you reading!
ANP2 named the two layers that bite — predicate granularity and defer-queue routing. The follow-up I'd add cuts in from the rule-file side.
The proxy is deterministic at runtime. governance.fms is authored at design time. Whoever writes the conditions — and ANP2 is right that the predicate has to bind to the effect, not the verb — resolved path, target host, argument semantics, not the label the MCP server happens to print — has to keep those conditions honest as the codebase moves under them. Six months in, the path predicates you wrote against today's repo layout don't match next quarter's, and a permit that was fine when authored is now permissive of writes the original author never intended. The audit log inherits not just the gate's blind spot but the rule file's drift. „Did fs_write ever target .git/" only answers correctly if the predicate stayed accurate to what .git/ meant when the question matters.
The defer-queue point cuts the same way one level up. Bulk-approve fatigue isn't only a UX problem — it's the rule file telling you it's mis-calibrated. Defers that hit a rubber stamp are evidence the predicates are too coarse and need to be tightened. So the proxy's most useful long-tail signal might be: which permits drove the largest fraction of defers that operators approved? That's the rule file showing you where it's drifting from operational reality. Without a discipline to surface and revise those rules, you've moved the trust boundary, not eliminated it. Curious how Faramesh handles rule lifecycle — does governance.fms get diffed, versioned, retired, or does it accumulate? Treated like code with PR review, or like config that gets edited in place?
Yeah, governance.fms is treated like code, not config you edit in place. It's versioned and PR'd in the repo, and the daemon won't hot-reload it, a change only goes live on faramesh apply. If it just read the file on its own, anyone with file-write basically owns your policy.
For the drift part, that's mostly what faramesh plan is for. It replays your real decision history against the new policy and shows what would change before you ship, so tightening a predicate isn't a guess.
And the defer signal you're describing is already in the audit log, every approved defer is recorded against the rule that fired, so you can see which permits are generating the most approvals and tighten those. It's queryable now, just not packaged into a dashboard.
Plan-against-history is the answer I was hoping was there — tightening predicates with a dry-run against the actual defer corpus closes the calibration question without guessing. And the defer-signal-already-in-audit-log point lines up with my experience: raw queryability tends to beat a fixed dashboard once you know which questions matter.