DEV Community

Cover image for Giving an AI Agent Write Access to Your App: Guardrails We Built for RobinReach's MCP Tools
Shaher Shamroukh
Shaher Shamroukh

Posted on

Giving an AI Agent Write Access to Your App: Guardrails We Built for RobinReach's MCP Tools

A few months ago I wrote about building a production MCP server in Rails, the plumbing of exposing RobinReach's API as a set of MCP tools that Claude and other agents can call.

That post was about connecting an AI agent to your app. This one is about the harder problem: what happens once it's connected and can actually do things, like publish to a client's Instagram, reply to a comment on their behalf, or schedule a week of content. The moment an agent has write access, "it works in the demo" stops being good enough.

The single question every user (and every one of our customers' customers) eventually asks is some version of: "can this thing accidentally touch something it shouldn't?" Specifically, on a platform that manages multiple brands for multiple clients, can the AI agent working on Brand A ever see or post to Brand B?

The answer is no, and the reason why is the part I want to focus on, because it's a different kind of guardrail than the usual "we told the AI not to" approach.

Brand isolation is not a rule the agent follows. It's a wall the agent can't see over.

The easy way to build this would be: give the agent one set of credentials for the whole account, list every brand the user has access to, and then add an instruction like "always check which brand you're working on and never act on the wrong one."

That approach technically works, right up until it doesn't. LLMs make mistakes. They mix up context across a long conversation, they reuse an ID from three messages ago, they occasionally just hallucinate. If the only thing standing between "agent posts to Brand A" and "agent posts to Brand B" is a sentence in a prompt telling it to be careful, that's not a guardrail. That's a hope.

So we built it differently. The connector that the agent talks to is scoped at the API/auth layer, not the prompt layer. When the integration is set up, the credentials issued for that connection are tied to a specific account and a specific set of brands the user actually has access to. Every tool call the agent makes gets validated against that scope on the server, before it ever touches a database row.

What this means in practice:

  • The agent cannot list, read, or write to a brand that isn't in its scope. Not "is instructed not to," cannot. There is no code path where a request for a different brand's data returns anything other than an authorization error.
  • If an account only has one brand, brand_id isn't even something the agent needs to think about, the scope handles it silently.
  • If an account has multiple brands, the agent has to explicitly select and pass a brand_id on every write action, and that brand_id is checked against the connector's allowed scope on every single call, not just the first one.

The reason this matters so much is that it moves the guarantee from "the AI is well behaved" to "the infrastructure makes the bad outcome impossible." Those are very different sentences to say to a customer. One is a promise about behavior. The other is a statement about architecture. If you're building anything where an AI agent has access to multiple tenants, customers, or brands, this is the line I'd draw first, before writing a single line of prompt instructions.

The second layer: validating what the agent is about to do, not just where

Scoping solves "is this the right brand." It doesn't solve "is this the right content." That's where validation comes in, and we treat it as a hard, separate step.

validate_post is a required call that happens before create_post, and the agent is instructed it must never skip it. We deliberately did not fold validation into the create step itself, even though that would be simpler. Splitting it forces a "draft, check, fix" loop instead of "fire and see what happens."

What gets checked:

  • Character limits per platform (280 for Twitter, 500 for Threads, 2200 for Instagram, and so on), counting hashtags, emojis, line breaks, and unshortened URLs, because LLMs are notoriously bad at counting characters themselves.
  • Required fields per platform, like a title for Pinterest or media for a Reel.
  • Whether the content makes sense for the platforms it's being sent to at all.

If something fails, the agent gets a structured response back describing exactly what's wrong, and it can correct the content before anything goes near a real social account. In practice this catches the most common failure mode by a wide margin: an agent writing one great LinkedIn post and then naively reusing the same text as a tweet, which is both too long and the wrong tone.

The other guardrails, briefly

A few more things worth a sentence each, because together they form the full picture:

Draft by default. Anything the agent generates proactively lands as a draft, not scheduled or published. Scheduling or publishing only happens when the user actually asks for that outcome. This gives the agent a safe "here's what I made, take a look" state instead of a binary publish or don't.

Audience aware scheduling. Before scheduling anything, the agent pulls the audience's actual best performing times for that brand and platform, rather than picking a "reasonable sounding" time itself. Left alone, an LLM tends to pick suspiciously round numbers like 9am or noon, because those are common in training data, not because that's when this brand's followers are online.

Voice learned from feedback. Whenever a user edits or rejects generated content, that correction is saved and applied automatically next time. The agent is told to apply it silently, so the output just sounds like the brand without the user re-explaining preferences every session.

Comments are surfaced, not auto handled. The agent can read and reply to comments, but it always shows the user who said what, on which post, and flags anything that looks like a complaint before drafting a response. Replying as the brand to a real customer is high stakes enough that a human stays in the loop.

No raw API leaks into the conversation. Tool names, JSON, internal IDs, none of that reaches the user. Everything is translated into plain language, like "your Facebook page Acme Co has 3 new comments" instead of a payload with internal identifiers. This sounds cosmetic but it's actually a trust guardrail. The moment a non technical user sees a raw error or an ID, the illusion that they're talking to "their social media manager" breaks, and they become more cautious about giving the tool any access at all.

The pattern, if you're building something similar

Across all of this, the theme is the same. Don't try to make the agent smarter. Make the wrong action structurally harder to take than the right one, and put the hardest boundaries where the cost of a mistake is highest.

For us, that meant brand and tenant isolation enforced at the auth layer, where the agent has no technical ability to even ask for the wrong thing, and content validation enforced as a separate required step, where mistakes get caught before they go live. Everything else, voice, scheduling, comment handling, is built on top of those two foundations.

MCP makes it trivially easy to hand an LLM the keys to your app. The interesting engineering work is making sure some of those keys don't open every door.

Top comments (2)

Collapse
 
max_quimby profile image
Max Quimby

"A wall the agent can't see over" is the exact mental model I wish every MCP tutorial led with. Scoping at the auth layer instead of the prompt layer is the difference between a guarantee and a hope — the LLM will eventually reuse an ID from three messages ago, and the only reliable defense is that the request physically cannot resolve to another tenant's data.

The complementary failure I'd add for write tools specifically: even with perfect scoping, the agent can do the correctly-scoped action twice. A flaky network, an ambiguous tool result, a retry — and suddenly the same post is published, or the same reply sent, to Brand A twice. Server-side scope handles cross-brand; it doesn't handle accidental repeat. We started attaching idempotency keys to every write tool call so a replay collapses to a no-op instead of a duplicate.

How are you handling that on the publish/reply tools — a dedupe/idempotency layer, or leaning on the agent not to retry? "Never retry on an ambiguous result" is deceptively hard to enforce from the model side.

Collapse
 
shahershamroukh profile image
Shaher Shamroukh

Great catch, and honestly the gap I should have included in the post.

We do have idempotency keys on publish its derived from content hash + brand + target platform, so a retry within the same window just collapses to a no-op.
And For replies we scope the key to comment ID too, because the same reply text to two different comments is a legitimate action and you don't want to accidentally swallow it.

But to your actual question no, we're not leaning on the agent not to retry.
agree with you that it's nearly impossible to enforce from the model side it has no reliable ground truth about whether the action actually landed, so "be careful" just trades duplicate posts for silent drops. Neither is great.

The infrastructure layer has to own this one. The agent being cautious is a bonus, not a guarantee.