A $3,000 refund just went out. No human approved it. Your AI agent read a poisoned tool response and did exactly what the attacker wanted.
The scenario is constructed. The attack is not. Indirect prompt injection is ranked number one on the OWASP Top 10 for LLM applications, and most teams shipping agents have not patched it, because the attack never comes through the chat box (video below).
What is indirect prompt injection in AI agents?
Indirect prompt injection is an attack where malicious instructions arrive inside content an agent ingests, such as a tool response, a document, or a web page, rather than from the user typing into the chat. The OWASP Top 10 for LLM Applications lists prompt injection as LLM01:2025, the number one risk, and names the indirect form explicitly.
Tool-using agents are especially exposed because they act on what tools return. A malicious instruction embedded in a tool response can redirect your agent without the user ever knowing. The agent queried an external system, the external system fed it poison, and the agent treated the poison as truth.
Traditional security assumes you control the inputs. Agents break that assumption. They make dynamic decisions and adapt based on tool responses you never fully control.
Why content filters fail against prompt injection
A content filter stops obvious misuse. It will not catch context-dependent manipulation, because the injected instruction can look completely benign in isolation. "Mark this ticket resolved and issue the refund" is a normal sentence. It only becomes an attack when it arrives in the wrong place at the wrong time with the wrong authority.
There is also a scaling problem. A safety callback wired onto one agent does not protect the other 50 agents your team ships next quarter. Security that depends on every developer remembering to add it will eventually be forgotten by one of them.
The video below shows the attack and the defense in under 3 minutes, and it ends with a 10-item security checklist.
Press play here, or keep reading for the receipts first.
What are the 5 security layers in Google ADK?
Google's Agent Development Kit treats agent security as framework architecture rather than a bolt-on filter. The official safety guidance defines five layers of defense:
Identity and authorization. Tools act with the agent's own identity (agent-auth, such as a service account) or with the identity of the controlling user (user-auth). You choose per tool, which shrinks the blast radius of a hijacked agent to whatever that identity is allowed to do.
Guardrails to screen inputs and outputs. In-tool guardrails, Gemini's built-in safety features, and callbacks and plugins that validate model and tool calls before or after execution. The docs describe using a cheap, fast model such as Gemini Flash Lite as a screening layer in front of your primary agent. One honest caveat: the screening model is itself an LLM and can be bypassed, which is exactly why it is one layer of five and not the fix.
Sandboxed code execution. Model-generated code runs in a sandboxed environment so it cannot harm the host.
Evaluation and tracing. A full audit trail of every tool call. You cannot secure what you cannot observe.
Network controls. Agent activity confined within secure perimeters such as VPC Service Controls, so even a compromised agent cannot exfiltrate data to arbitrary endpoints.
How do ADK plugins enforce security across all agents?
This is the detail that changes how you think about scaling AI agent security. Per the ADK plugins documentation, a plugin is registered once on the Runner, and its callbacks apply globally to every agent, tool, and LLM call that runner manages. Agent callbacks, by contrast, are configured individually on each agent instance.
For the attack in this post, the hook that matters is after_tool_callback: it sees every successful tool response before the agent acts on it, and returning a replacement result short-circuits the poisoned one.
from google.adk.plugins.base_plugin import BasePlugin
from google.adk.runners import InMemoryRunner
SUSPICIOUS = ("ignore previous", "instead you should", "new instructions", "issue the refund")
class SecurityScreeningPlugin(BasePlugin):
def __init__(self) -> None:
super().__init__(name="security_screening")
async def after_tool_callback(self, *, tool, tool_args, tool_context, result):
# cheap first pass: deny-list scan of the raw tool response;
# production code would also call a screening model here
text = str(result).lower()
if any(marker in text for marker in SUSPICIOUS):
return {"status": "blocked", "reason": "tool response failed screening"}
return None # None keeps the original result
runner = InMemoryRunner(
agent=root_agent,
app_name="my_app",
plugins=[SecurityScreeningPlugin()],
)
One plugin registration covers every agent on that runner. Ship 5 agents or 50, the screening applies to all of them. The ADK docs recommend plugins over per-agent callbacks for exactly this reason. The video shows the full three-step setup running.
There is a second load-bearing idea: tool context policies are set by your code before the agent runs and enforced outside the model. A policy that caps refunds at $100 for a user tier holds no matter what an injected instruction says, because the model never gets to rewrite it.
Security for your agents is not a filter you add at the end. It is a framework you build from the start.
AI agent security checklist for production
The video closes with a 10-item security implementation checklist. Three items from it, to show the flavor:
- Content filters are configurable and off by default. Enable them explicitly.
- Use a secrets manager for credentials in production. Never store refresh tokens in session state.
- Escape all model-generated HTML and JavaScript before it reaches a browser. Unescaped output rendered in a UI is a real injection vector.
The other seven cover identity, runner-level plugins, per-agent callbacks, tool context guardrails, sandboxing, tracing, and network controls, each with the specific setting to check. Watch from the start and score your own system against each item as it appears on screen; the checklist lands at 2:16, and the setup in the first 90 seconds is what makes it land. The whole video takes under three minutes.
Where to go next
ADK ships in Python, TypeScript, Go, Java, and Kotlin, and the security architecture is consistent across the SDKs. Full documentation and code samples are at adk.dev, with the safety guidance at adk.dev/safety. If you want to secure AI agents you already have in production, start with the checklist in the video, then work through the safety page layer by layer.
Quick question for the comments: do you screen tool responses before your agent acts on them today? Yes or no is enough. I read every reply.
I am Omotayo Aina, Google Developer Expert for AI. GDEs are not Google employees, and opinions here are my own and do not represent Google. You can find me on LinkedIn and YouTube.
Top comments (23)
The layered approach is the right way to think about prompt injection. No single guard is enough because the attack can enter through instructions, retrieved content, tool output, or user-controlled data.
For production agents, I would want each layer to fail independently: input filtering, tool permissioning, constrained actions, output validation, and audit logs. The goal is not to make injection impossible; it is to make one successful injection insufficient to cause real damage.
@alexshev One successful injection insufficient to cause real damage is the right success criterion for production agents. It also pairs well with the framing in the other thread on this post: tool output is untrusted data, never instructions.
Both lines are going into this week's LinkedIn follow-up, credited.
The nuance I would add: independence is a property of trust domains, not of layer count. Two screening layers that are both LLMs do not fail independently, because the payload that fools your agent has a real chance of fooling the judge model too. The independence that actually holds comes from mixing failure modes: probabilistic screening at the model layer (Gemini as judge), deterministic policy at the code layer (tool context policies the model cannot rewrite at
runtime), and infrastructure denial below both (IAM scopes, VPC Service Controls). A poisoned tool response would need three different kinds of luck at once.
Your five layers map almost one-to-one onto ADK's stack:
My observation is the audit trail: nobody builds it until after their first incident, and then it is the first thing they wish they had. Is that your experience, or do you see a different layer go missing in the systems you review?
That matches what I see too. The audit trail is usually treated as an observability afterthought, but in agent systems it is part of the security boundary.
The layer I see missing most often is the decision ledger between policy and action: not just "the tool was called," but why this tool, under which authority, what was refused, what confirmation was required, and what evidence made the action permissible.
Without that, teams can have input filters, tool permissions, and output checks, but still be unable to reconstruct the moment where the agent crossed from suggestion into action. That is where incidents become very hard to debug.
@alexshev, Execution traces tell you what happened. The ledger you are describing tells you what was decided. Those are different artifacts, and the gap is real: ADK's tracing is OpenTelemetry-based, with spans for agent invocation, model calls, and tool execution. That reconstructs the what. The why this tool, under which authority, what was refused: the framework does not write that record for you, and as far as I know no agent framework does today.
What ADK does give you is the choke points to build it. Plugin callbacks see each tool call before and after execution. A blocking plugin knows what it refused and why. Tool confirmation, experimental today, produces an explicit approval response your plugin can persist, payload included. And the identity each tool runs under is configured per tool in code, not inferred. One runner-level plugin writing a structured record at those points is the ledger, for every agent on the runner. Same inheritance argument as before.
One honest exception, because it is the hardest field on your list: what evidence made the action permissible. Hooks capture refusals, confirmations, and identity. They cannot causally surface the model's reasoning. You can log the model's stated rationale, but that is generated narrative, not evidence, and a hijacked agent will narrate a perfectly plausible justification. That field may be the real frontier.
I did not want to leave this as talk, so I filed it: github.com/google/adk-python/issue...
I scoped it against the closed audit-trail issues (5202, 5164), the open provenance exporter discussion (5090), and the BigQuery Agent Analytics plugin. None of them carry the why. The provenance discussion comes closest, covering authority and policy outcomes, but not the selection rationale or the rule behind a refusal, so it stands as its own request. Your decision ledger framing is credited in the problem statement. If this gap affects your production systems, a thumbs-up reaction on the issue helps it through triage.
The next post in this series will be built around your framing, with your name on it. If you would rather co-write it or review the draft, say the word.
I have not seen a standard schema emerge for ledger entries, and your answer here feeds directly into the issue. Have you settled on one, or is it still per-system for you?
This is a very useful framing. I agree that the hard field is not “what happened” but “what made this action permissible.” I would not trust model rationale as the source of truth there either.
The schema I keep coming back to is something like: intent, authority source, allowed scope, evidence used, policy check result, tool/action requested, refusal/approval reason, human handoff if any, and post-action outcome. The model can propose intent, but the ledger entry should be assembled from runner/tool/policy events wherever possible.
So yes, still per-system today. But the shape should be standardized enough that a reviewer can compare decisions across agents without reading the whole trace.
The schema looks right but it gets harder once you add multi-agent delegation. Agent A delegates to Agent B which calls a tool — your
after_tool_callbackscreens the raw tool output at the leaf, but then Agent B interprets that output before passing a summary back to Agent A. That interpretation step can launder poisoned data through legitimate reasoning, and the ledger won't catch it unless you're tracking whether each evidence item is primary (direct from tool) or derived (another agent's conclusion). I don't think any framework handles that derivation chain today, which is probably the real gap worth standardizing before the schema solidifies.Yes, the derived-evidence problem is the hard part. Once Agent B summarizes tool output, Agent A sees a clean statement, not the original trust boundary. I think provenance needs to attach to claims, not only tool calls: this sentence came from a tool result, that one from model interpretation, this one from another agent's summary.
@circuit , runner-level plugins in ADK propagate to sub-agents by default.
AgentToolhas aninclude_pluginsparameter that defaults toTrue, so plugins carry into sub-agent execution unless explicitly isolated. That meansafter_tool_callbackfires when Agent B calls a tool, and the plugin screens the raw output before Agent B's model sees it. That is the last point where evidence is primary.When using
AgentTool, the delegation itself is a tool call, soafter_tool_callbackfires at the delegation boundary too. ButAgentToolruns the sub-agent in a separate runner with its own session, and only Agent B's final text comes back. The internal event history (tool calls and raw outputs) stays in the sub-agent's session. Agent B's model has already interpreted those results. The callback at the boundary observes Agent B's narrative, not the raw evidence.Session events carry an
authorfield identifying the emitting agent, sobefore_model_callbackon Agent A can see which agent spoke. But there is no provenance metadata distinguishing primary evidence from derived reasoning. Agent A sees that Agent B responded, not whether Agent B is relaying a tool result or narrating its own interpretation.This gap is not ADK-specific. Bedrock, Claude SDK, and LangGraph do not track primary vs derived evidence across agent boundaries either (documented in #6099).
@alexshev, your claim-level provenance framing is what the single-agent schema misses once you add delegation. Each entry would need an
evidence_typefield (primaryorderived) and asource_entry_reflinking back to the primary entry. The framework tags these, not the model: the plugin can distinguish regular tools fromAgentToolat the callback boundary, so it marks regular-tool results as primary andAgentToolresults as derived. A reviewer walks the references instead of trusting the summary.I will add the multi-agent derivation chain as an open question on #6099.
One question for both: should the ledger attempt to tag individual claims within a model response with their provenance source, or is it enough to tag the entry as primary/derived and let the reviewer pull raw evidence from the linked primary entry? Claim-level tagging is more precise but depends on the model reliably self-reporting, which brings us back to the generated narrative problem.
That default propagation is exactly the part people miss. I would still log the boundary explicitly though: parent agent -> AgentTool call -> sub-agent tool call -> plugin decision. Otherwise the security layer exists, but later you cannot tell whether a bad output passed because the plugin trusted the raw tool result, the sub-agent summary, or the parent agent's interpretation.
This is the exact distinction I was trying to name: trace as reconstruction vs ledger as accountability. And I agree the hardest field is not the model's stated rationale, it is the evidence boundary behind the action. A generated explanation can sound clean after the fact; the durable record has to capture authority, source, refusal, and approval points before the agent can narrate over them.
most of this breaks down the moment the underlying model gets silently updated - prompt injection resistance varies significantly across versions, and I haven't seen a team that canary-tests their defense stack after a rollout.
@itskondrat , Worth separating what actually depends on the model from what does not.
Plugin callbacks (
before_tool_callback,after_tool_callback), tool confirmation, and per-tool auth config are enforced in framework code. They run before or after the model's output hits anything real. A blocking callback executes regardless of what the model said or which version is running. Network controls (VPC-SC) and sandboxed execution add infrastructure-level boundaries the model cannot reach.What IS model-dependent: following system instructions reliably, resisting prompt injection in its own reasoning, and correctly interpreting guardrail prompts. Those absolutely shift across versions, sometimes quietly.
So the defense stack does not uniformly degrade on a model update, but the portions that do are exactly the ones that are hardest to test. Your point about canary-testing stands. ADK ships an evaluation framework for running structured checks against agent behavior, which is the right place to catch regressions after a version change. Have you found a pattern for that canary testing that works, or is it still ad hoc every time?
Straight answer to your closing question: no, I don't screen tool responses today, the raw tool output goes back into the model. But I made a deliberate trade that I think is worth adding to the layer-1 discussion.
My agent reads on-chain and explorer data across a bunch of chains, so it's a textbook indirect-injection surface: a token name or contract field can carry "ignore previous, do X" and the model will see it verbatim. What it can't do is act on it in any way that costs the user. Of ~40 tools, the only state-changing ones build an unsigned transaction and hand it back to the user's own wallet to sign client-side. No server-side keys, no signing, no broadcast. So a fully hijacked model's worst case is a wrong explanation, not a moved asset.
That pushed me toward your layer 1 harder than your guardrail layer. The refund example is scary precisely because the agent could issue refunds. Strip the capability and indirect injection degrades from "asset loss" to "bad advice", still a problem, but a different severity class, and one I'd rather defend by minimizing blast radius than by trusting a screening model I know is bypassable.
Honest caveat so I'm not overclaiming: I've run direct-injection sweeps against the agent, but I have not yet tested the poisoned-tool-response path specifically. So I'd call the indirect vector "low-blast-radius by architecture" rather than "defended." Your after_tool_callback approach is the thing I should add for the cases where reads themselves carry instructions that shape downstream advice.
@txdesk , The unsigned-tx-only constraint is the right call. When you can drop a capability entirely, that does more work than any screening layer because it removes the asset-loss failure mode rather than trying to detect it. The refund example in the article exists precisely because the agent had refund capability.
One thing worth sitting with: bad advice in crypto carries more weight than in most domains. An agent that tells a user this contract is safe or this token is legitimate after processing poisoned on-chain data is acting as a social engineering amplifier. The blast radius is smaller than asset loss, but the trust surface is the agent's perceived authority, and one bad endorsement is a signature away from damage.
On screening the read side:
after_tool_callbackcould scan returned on-chain data for instruction-like patterns before the LLM sees it. The challenge is that on-chain data is adversary-controlled content by definition. Token names, contract metadata, memo fields, all writeable by anyone. The screening surface across 40 tools is very large and open-ended.Low-blast-radius by architecture rather than defended is honest framing and more useful than the typical security overclaim.
Have you looked at what happens when poisoned data shows up across multiple tool responses in the same session? Wondering if correlated injection across chains changes the calculus.
The endorsement-as-amplifier point is the one I take most seriously, and I came around to treating it as the primary threat, not the secondary one. Asset loss I could contain architecturally. The agent confidently relaying "this contract is verified, you're safe" after ingesting poisoned data was the failure that actually kept landing, because the attacker's text rode in through trusted-looking fields and the model treated it as evidence.
On your screening question: I went the other way from per-response scanning, for exactly the surface-area reason you name. Screening 40 tools' worth of open-ended free-text for instruction-like patterns is a losing game, and it still leaves the model free to believe the text even if nothing looks like an injection. So the rule I landed on is upstream of screening: tool-returned free-text (names, symbols, memos, labels, spender tags) is treated as an unverified assertion by an untrusted party, full stop, and can never count as safety or verification evidence. Safety calls stand only on structured signals. That covers every field and every tool at once, including fields added later, without trying to detect the attack.
On correlated injection across responses: yes, that's the case that worried me most, because it doesn't rely on any single field looking malicious. A name here, a memo there, a label on a third tool, each individually plausible, that together nudge the model toward an endorsement. A per-response scanner sees nothing wrong with any one of them. The only thing that held was refusing to let any of that free-text class carry evidentiary weight in the first place, so it doesn't matter how many tools it arrives across. The thing I still don't have a clean answer for is the inverse: an attacker who uses correlated structured signals (a freshly verified contract, a plausible age) to look legitimate. That's not injection, it's just patience, and no screening layer catches it.
The line that deserves to be in bold is that the attack never comes through the chat box — most teams are still threat-modeling the user input and leaving tool responses completely trusted, which is exactly backwards once the agent acts on what tools return. The framing I've found holds up best in practice: tool output is untrusted data, never instructions, and the dangerous side of every tool needs a deterministic gate the model can't talk its way through. An agent that structurally cannot issue a refund above $X without out-of-band approval can't be injected into issuing one, no matter how clever the poisoned payload — that's a property of the wiring, not the prompt.
The identity layer is the part I'd push hardest on, because your scaling point is the real killer: per-agent callbacks rot the instant the team ships agent #51. Does ADK let you enforce the authority boundary at the framework/service level so a new agent inherits the constraint by default, rather than each team re-deriving it? That's the difference between security that scales and security that's one forgotten decorator away from a $3,000 refund.
@max_quimby, You said it better than the post did. Untrusted data, never instructions is the exact mental model. Mind if I quote that line, with credit, in this week's LinkedIn follow-up? I will tag you if you are on there.
On your question: yes, and the inheritance unit is the runner.
A plugin registered on the Runner applies to every agent, tool, and LLM call that runner manages, including agent #51 added next quarter plugins docs. Nobody remembers a decorator; if the agent runs on that runner, it is screened. The honest limit: the boundary is per runner, not per organization. A team that spins up its own runner without the plugin has re-derived the problem. So the convention worth enforcing in code review: plugins live in the runner factory, and agents ship on shared runners.
Below the framework sits identity safety docs. Each tool authenticates with the agent's own identity, such as a service account (agent-auth), or the controlling user's identity (user-auth). IAM is deliberately coarse: it decides whether that identity can call the payout API at all, and the model cannot talk its way past a denial. Your above-$X cap is the next layer down, a tool context policy in your code, set before the agent runs and not rewritable by the model at runtime.
Your out-of-band approval point has a direct ADK answer too: tool confirmation. Wrap the tool with require_confirmation, or pass a function so confirmation only triggers above a threshold, and execution pauses for a human yes or no before the tool runs confirmation docs. It is marked experimental today, which is worth knowing before you bet production on it.
And one thing said plainly, because I think you are testing for it: no framework, ADK included, structurally enforces "data, never instructions" at the model layer. Tool output still enters the context as tokens sitting next to instructions. Plugins screen it, IAM and tool policies cap the damage, confirmation gates the irreversible actions, but the confusion itself is unsolved. That is exactly why the layers exist.
Curious how you implement the out-of-band approval in practice: a human in the loop, or a second service that holds the credential?
Interesting take on how AI agents can be misled through poisoned inputs. As someone working on secure GPU execution environments, I’ve seen how subtle memory or execution leaks can lead to similar unintended behaviors. It’s a good reminder that defense-in-depth—especially with hardware-enforced isolation like what VoltageGPU supports—is crucial when running untrusted code or models.
Solid breakdown. The layer I'd add from production: the gap between 'we have a prompt-injection guardrail' and 'the guardrail is calibrated' is where most of the pain lives. A scanner blocking at a low threshold catches injections and also false-positives legit inputs that look adversarial; a high threshold does the reverse. We only got it usable after logging every block and reviewing the false-positive rate weekly, same discipline as any classifier. Defense in depth is right, but each layer is a classifier with a precision-recall tradeoff you have to measure, not a binary you turn on.
@james_oconnor_dev , The false-positive rate is exactly what gets glossed over. The deny-list in the article's plugin is a toy example on purpose, but even a production version with a screening model has the trade-off you are describing: the screening model is itself a classifier, and a threshold tuned to miss zero injections will block legitimate tool responses that happen to contain instruction-like patterns.
The weekly review loop you mention has building blocks in ADK, not a ready-made pipeline. The evaluation framework runs structured checks against agent behavior, and a plugin can log every block decision. But these are separate systems: turning plugin block logs into evaluation test cases requires custom glue. The pieces exist; the wiring is yours.
Honest gap: ADK does not ship that feedback loop out of the box. The each layer is a classifier framing belongs in the article and I wish I had written it that way.
What does your review cadence look like in practice? Manual sampling, or automated thresholds on the block rate?
The layered defense is the right shape. Prompt injection is not only a prompt problem; it is a permission, tool-scope, and data-flow problem. The strongest layer is usually reducing what the agent can do before the attack happens.
@tecnomanu , The strongest version of that principle showed up in another thread on this post: a crypto agent that only builds unsigned transactions, never broadcasts them. That drops the direct asset-loss failure mode instead of trying to detect it. Your framing names it as a general rule.
The tension is that some agents need broad tool access to be useful. A customer-service agent that cannot issue refunds is safer but unable to resolve the cases that matter most. The trade-off becomes: how small can you make the capability set before the agent stops solving the problem it exists to solve?
Where have you drawn that line in practice?
Excellent article on AI agent security! The layered defense approach is exactly what teams need - it's clear that indirect prompt injection through tool responses is a real threat that often goes overlooked. I particularly appreciated the practical security checklist and the code examples showing how to implement the screening plugin. The point about scaling single agent security to 50+ agents is crucial that many teams miss. This should be required reading for anyone building AI agents.