MCP's real attack surface isn't prompt injection — it's the trust boundary (21 patterns, 5 languages)

#mcp #python #security #ai

We keep talking about prompt injection like it's the endgame. It isn't. Prompt injection is step one. The actual damage happens one step later, at a place most MCP threat models barely mention: the trust boundary — the moment an injected instruction turns into a real tool call that runs with your machine's privileges.

I build and run a handful of MCP servers locally, and the more server source I read, the more I became convinced we're auditing the wrong layer. So I built a small MCP server whose only job is to audit other MCP servers. This is the write-up of what it looks for and why — the threat model is the reusable part whether or not you ever run my code.

The chain that should keep you up at night

An MCP server is, by design, code that an LLM is allowed to invoke. That's the whole point. Which means the dangerous chain is:

untrusted input → model is convinced → tool call fires → server-side code executes on your box

The injection was never the vulnerability. The injection is just the delivery mechanism. The vulnerability is what the tool does once it's called. If the server-side handler does anything unsafe with the arguments it receives, the model has just become a remote-code-execution courier — and it did it while behaving exactly as designed.

This reframes the problem in a useful way: most MCP security is just appsec, newly reachable through a model. The bugs that matter are 20-year-old classics. What changed is who can reach them. Pre-MCP, a lot of this code was only callable by a trusted caller. Now a sentence in a webpage your agent reads can reach it.

The patterns that keep showing up

Reading through MCP server source, the same handful of mistakes appear over and over. These are the OWASP greatest hits, scored by how badly they bite when they sit one tool-call away:

eval() / exec() / os.system() on tool arguments     CRITICAL  direct code injection
subprocess(..., shell=True), Runtime.exec(concat)    HIGH      command injection
pickle.load / torch.load / ObjectInputStream         HIGH      deserialization RCE
yaml.load() without SafeLoader                        HIGH      object instantiation
f-string / string-concat SQL                          HIGH      SQL injection
URL built by concatenation (fetch tools)              MEDIUM    SSRF / network pivot
hardcoded API keys / tokens in server source          MEDIUM    credential leak

None of these are exotic. That's the point. The scanner I built detects 21 vulnerability patterns across Python, Java, Go, C++, and Rust — and almost every one predates LLMs entirely. The novelty isn't the bug class; it's the reachability.

The design decision that mattered most: purpose-aware severity

The naive version of a scanner like this is a glorified grep eval. It floods you with false positives and you stop reading the output by day two. An eval() inside a sandboxed test harness is not the same finding as an eval() on a tool's input argument — and a scanner that scores them identically is noise.

So the core of the tool is purpose-aware scoring: it weighs a pattern by where it sits and what reaches it. A subprocess(shell=True) in a CI helper that never touches model input is a low-priority note. The same call wired to a tool argument is a CRITICAL. Getting this right is the difference between a report someone acts on and a report someone closes.

The scanner ships as an MCP server itself, with three tools:

audit_repo — point it at a GitHub URL, get a scored report
audit_code — paste a snippet, get findings inline
list_patterns — see every pattern and its severity

Because it runs locally as an MCP server in Claude Desktop, your agent can audit a repo before you wire it in — which is the right time to find out a server you're about to trust shells out on its arguments.

pip install mcp-security-audit

Two lessons that generalize beyond MCP

1. The injection is not the vuln — the handler is. Spend your defensive budget on what tool code does with arguments, not only on filtering what reaches the model. Input filtering is a sieve; a safe handler is a wall. You want the wall.

2. Severity without context is noise that trains people to ignore you. Any scanner that can't tell a sandboxed eval from a reachable one will get muted. Context-awareness isn't a nice-to-have on a security tool — it's the feature that decides whether anyone reads the second report.

The honest part: where this still falls short

Static patterns catch reachable-by-shape, not reachable-in-fact. A taint analysis that proves an argument actually flows into a dangerous sink would cut false positives further — that's the open edge of this design, and it's not built yet. And the two pattern families I'm least confident I've covered well are path traversal in file-serving tools and SSRF in fetch-style tools that are built to make outbound requests, where "is this call malicious" is genuinely ambiguous. If you've solved the fetch-tool SSRF problem cleanly, I want to hear how.

Where to find it

The scanner is open-source (MIT) and free:

Repo + star: https://github.com/LuciferForge/mcp-security-audit
Install: pip install mcp-security-audit

If you'd rather not run it yourself — if you're shipping an MCP server and want a scored, human-reviewed audit report with a 3-day turnaround instead of a raw scan dump — I offer that as a paid service here: MCP Security Audit Report — $29.

But I'm more interested in the threat-model discussion than either. Two open questions for the comments:

Where should the boundary live? Sandbox every tool call by default (perf + DX cost) or scan-and-trust at install time? I lean install-time scanning plus runtime allow-lists, and I'd like to be talked out of it.
What pattern am I missing? I've got 21. Tell me the 22nd.