DEV Community

Cover image for Don't use an LLM to decide what your AI agent is allowed to do
Brian Hall
Brian Hall

Posted on

Don't use an LLM to decide what your AI agent is allowed to do

Soft intelligence vs hard enforcement

I'm in a group called AARM. It's a bunch of people trying to work out how you actually secure what an AI agent can do once it's running, and the basic idea is that the control has to sit right at the action. You check a tool call before it runs, and the agent can't wriggle around the check. So everyone in there already agrees that telling an agent "please don't" isn't a security model.

What gets me is that even in that room, I keep seeing people reach for an LLM to be the thing that makes the call. The agent goes to do something, you take that action and hand it to a second model, ask it whether it's fine, and whatever it answers is what happens. A model watching the model. I don't really get it, and I want to walk through why, because I think people lean on this without sitting with what it actually buys them.

What you're actually defending against

Go back to why you want a guard on the agent in the first place. It's there because the agent can be talked into things. Some prompt injection sitting in a page it reads, a tool result that quietly hands it a new instruction, a user who words a request just so. The agent is a thing you can reason with, and the worry is that the wrong person reasons with it.

Now look at what the LLM-judge setup does about that. It puts a second thing you can reason with in front of the first one. That's the part I get stuck on, because it's the same weakness wearing a different hat. If somebody can craft input that bends the agent, there's a real chance the same sort of input bends the judge too, since under the hood it's the same kind of system responding to the same kind of pressure.

Maybe it holds. Maybe you've prompted the judge more carefully and it's tougher to push around. But "harder to talk into it" is a strange thing to be resting on when not getting talked into it is the entire job you hired this layer for.

Same question, different answer

There's a second problem and in day to day terms it's the one that actually bugs me. You can ask a model the same question twice and get two different answers. That isn't a bug you patch out, it's just what the thing is. It's sampling. It isn't a function that hands back the same output every time you give it the same input.

For most of what we build, that's completely fine, and honestly it's part of why models are useful. But once the question is something like whether the agent gets to drop the production database, that property turns into a real liability. The same action can get waved through on Tuesday and stopped on Wednesday, and there's no reason you can actually point at, because there isn't one. There's just a different roll of the dice. Good luck writing that up for an auditor, or explaining it to yourself at two in the morning when you're trying to figure out how something got through that shouldn't have.

A rule doesn't behave that way. deny delete on production means the production database does not get deleted, every single time, no exceptions. You can read the rule, you can test it, you can pull up the log six months later and see exactly what got asked and what came back. The decision is something you can actually stand behind, which is the whole reason it can be the part you trust.

This isn't an argument against LLMs

I want to be careful here, because it's easy to take this too far, and the version where models have no place anywhere near security is also wrong.

Models are great at a lot of this. Looking at an action and noticing something's off about it. Telling you a piece of text is sensitive. Putting a rough score on how risky something seems. Picking up on a pattern across a string of calls that no fixed rule was ever going to catch. That's all real, and for a lot of it a model is the best tool you've got. The issue was never an LLM being near the security boundary. It's the LLM being the boundary, the thing that says the final yes or no.

So where I land is layered. Let the model do the soft work it's genuinely good at, watching for the weird thing, flagging it, telling you to go take a look. Just don't let it be what opens the gate. The actual call on whether a real action runs has to sit on something that gives the same answer every time and can show its work afterward. The model can feed into that all it wants. It just can't be the thing that decides.

Where it actually bites

The closer your agent gets to anything that matters, money, prod, customer data, the less theoretical any of this is. If the worst it can do is write a bad paragraph, then fine, none of this is worth losing sleep over and you should go do something more useful with your afternoon. But the moment it can move money or drop a table, what's allowed to run can't come down to a coin flip, and it really can't live inside the same kind of system you were trying to protect yourself from to begin with.

Put the smart, context-aware stuff where it's strong, which is noticing when something's wrong. Put the hard line somewhere the agent can't talk its way past.

That last part is the thinking behind Faramesh, the open source thing I've been building. The permit/deny/defer decision is deterministic, no model sitting in that path, and every call lands in a signed log. But the tool is kind of beside the point. Even if you go build your own version of this, keep the final decision off the model. That piece should be boring on purpose.

Top comments (7)

Collapse
 
aljen_007 profile image
Aljen M

Hello Mr. Hall

Thank you for writing good post

This is my opinion

This is a strong engineering position and it reads like it comes from real system experience rather than theory.
The core separation you draw between “soft intelligence” and “hard enforcement” is exactly the line most agent systems eventually rediscover after incidents.
You are right that once an LLM is allowed to authorise actions, the system inherits the same manipulability class as the agent itself.
A second model acting as a judge does not remove the trust boundary problem; it only duplicates it in a slightly different form.
Even if the judge is better prompted or more constrained, it is still exposed to adversarially shaped inputs coming from the same pipeline.
The non-determinism argument is also valid in practice because operational systems need reproducible decisions for audit, debugging, and compliance.
However, it is also worth acknowledging that determinism alone is not sufficient unless the policy layer is correctly defined and maintained. The strongest part of your framing is the idea that enforcement must be boring, explicit, and externally verifiable.
LLMs are still extremely useful in this architecture when used as detectors, scorers, or signal generators rather than arbiters.
In real deployments, the safest pattern tends to be “model suggests, system decides,” not the reverse.
Overall, your argument is directionally correct, but its impact would be even stronger if you explicitly addressed hybrid systems where LLM judgments are converted into strict, non-probabilistic policy outputs.

Best Regard

Aljen

Collapse
 
jugeni profile image
Mike Czerwinski

The same-question-different-answer problem is the load-bearing thing — non-determinism in the enforcement path means the policy you think you've written isn't the policy that runs. Agreed completely, with one refinement: the model still has a role, just not at the enforcement seat. It can propose, surface contradictions, even flag candidate violations. It cannot be the gate.

The pattern I've been running with: LLM proposes, deterministic rules enforce, human authorizes transitions on the rules themselves. Three separate authorities, three different update rates. The enforcement layer is dumb on purpose — it reads a locked decision, fires a hard veto, and surfaces both sides to the operator. No reasoning at the gate. Same way you'd write an admission controller in K8s — model doesn't get to vote.

What I keep finding non-trivial: where the LLM-proposes step lives. If the model is also the one writing the rules it'll later enforce against, you've recreated the original problem one layer up. Curious how Faramesh handles the proposal-vs-rule boundary — does the model ever propose new rules, or is rule authoring strictly out of model hands?

Collapse
 
brianrhall profile image
Brian Hall

Three authorities with different update rates is good. To your question, Faramesh is only the enforcement side, it doesn't author rules at all. The policy is a file in your repo a human writes, and the only way it ever changes is someone running faramesh apply. The daemon doesn't even re-read the file on its own, the whole reason being that if it could hot-reload, anyone with file-write would have policy authority. So the model never proposes or edits the rules it runs against, which like you said is exactly the layer where you'd otherwise recreate the problem.

Collapse
 
jugeni profile image
Mike Czerwinski

Clean separation. The hot-reload-equals-file-write-authority insight is the part most policy systems get wrong by treating rule loading as a deployment concern instead of a governance one. Keeping apply explicit closes the proposal-vs-rule recursion at the daemon's level.

Where it picks up again is one layer further out: who's authorized to run faramesh apply? Same architectural pattern repeats — explicit human action, audit trail, ideally tied to the same review discipline as code merges. ANP2 framed it sharply in another thread: a bound you can route around is not a bound. Sibling grants — git push, shell access, container exec — all reach the same rule file if nothing closes them. The architecture you've described is sound, but the perimeter that protects the apply command is the next thing to make explicit. Curious if Faramesh ships any opinions on that side, or treats it as caller-system concern.

Thread Thread
 
brianrhall profile image
Brian Hall

Yeah, that’s where the boundary sits, and it’s a documented limit. Faramesh draws the trust line at the daemon and treats the host as the privileged-access boundary. If someone has shell access and can run apply, they can change policy, so access to that should come from your host controls, same as who can merge code or deploy config.

What Faramesh does is make the change non-silent. The compiled policy gets signed into .faramesh/, tampered state fails verification on reload, and with an external KMS the audit chain can’t be forged even with root. So a sketchy apply leaves evidence. What Faramesh doesn’t try to do is reimplement RBAC for your shell, that’s intentionally left to the caller’s environment.

Thread Thread
 
jugeni profile image
Mike Czerwinski

Signed compiled policy + external KMS for the audit chain is the right anchor — it makes „who has root" the same threat model as „who has code-signing key," and the industry has tooling for the latter. The non-silent-change discipline turns the host-boundary punt from punt-and-hope into punt-and-verify. Clean architectural decision.

Collapse
 
kartik-nvjk profile image
Kartik N V J K

"The same weakness wearing a different hat" captures exactly why a second model in front of the first doesn't add a real security boundary, since the same crafted input bends both. I'd still keep an LLM judge for fuzzy quality scoring, just never on the authorization path where the decision has to be deterministic. Where do you draw the line for things like a refund cap that is technically a number but the trigger is a natural-language request?