May 2, 2026Ai security8 min read

Prompt Injection: Why Runtime Controls Beat Policy Checklists

Google's field data, two convergent research papers, and why your AI governance committee can't stop prompt injection

Ai securityPrompt injectionAgentic aiRuntime controlsCisoAi governance

Prompt Injection: Why Runtime Controls Beat Policy Checklists

Arnaud Wiehe — May 1, 2026

Google's Threat Intelligence team dropped a data point two weeks ago that every CISO should have on their desk. Scanning 2–3 billion web pages per month through Common Crawl, they detected a 32% relative increase in malicious prompt injection attempts between November 2025 and February 2026 [Source: Google Security Blog, "AI threats in the wild: The current state of prompt injections on the web," April 23, 2026, by Thomas Brunner, Yu-Han Liu, Moni Pande].

This is the first real-world measurement of prompt injection at scale. Not a simulation. Not a red-team exercise. Actual attacks found on actual websites, categorized by intent — from harmless pranks and SEO manipulation to data exfiltration and destruction attempts. And while Google notes that current attacks remain "low in sophistication," they explicitly warn this won't last: more capable models make better targets, and attackers are automating their operations with agentic AI, bringing down the cost of attack.

Meanwhile, most organizations are responding to this threat with governance theater — policy documents, risk registers, and quarterly AI ethics committee meetings. This article argues that approach is structurally wrong, and that two research papers published in the same 72-hour window last week show us the right answer.

The Data: What Google Actually Found

Google's methodology was methodical and reproducible. They used Common Crawl, a repository of monthly snapshots of 2–3 billion English-language web pages, and applied a three-stage filtering pipeline: pattern matching for known injection signatures (e.g., "ignore previous instructions," "if you are an AI"), Gemini-based classification to distinguish malicious intent from legitimate discussion, and human validation for high-confidence findings [Source: Google Security Blog, April 23, 2026].

They categorized observed prompt injections into six buckets:

Harmless pranks — invisible HTML comments instructing AI assistants to change their conversational tone
Helpful guidance — website authors trying to shape AI summaries to provide better context (benign, but structurally identical to malicious injections)
Search engine optimization (SEO) — attempts to manipulate AI assistants into promoting specific businesses, increasingly generated by automated SEO suites
Deterring AI agents — instructions to prevent crawling or lure AI readers into infinite-text traps
Malicious: data exfiltration — attempts at data theft, though "sophistication seemed much lower" than published research
Malicious: destruction — attempts to vandalize machines, including commands to "delete all files on the user's machine"

The critical finding: while today's attacks are unsophisticated, the trend line is unmistakable. A 32% relative increase in malicious detections over a single quarter signals growing interest from attackers. And as Google's researchers note, "threat actors tend to engage based on cost/benefit considerations" — both sides of the equation are shifting in the attackers' favor.

Why Policy Checklists Are Structurally Wrong for Runtime Attacks

The security industry learned this lesson decades ago: controls must match the attack surface they're meant to defend. You don't protect against SQL injection with a data classification policy. You don't prevent buffer overflows with an acceptable use guideline. And you cannot stop prompt injection with AI governance documentation.

Yet this is precisely what most organizations are attempting.

The typical enterprise AI security program in 2026 consists of: an AI acceptable use policy, a risk assessment framework, model cards for transparency, output review procedures, and quarterly governance reviews. These are all pre-deployment or post-deployment controls. Prompt injection happens during deployment — at runtime — when an AI agent processes untrusted content that contains embedded instructions it wasn't designed to distinguish from legitimate commands.

The structural problem is fundamental to LLM architecture. Language models process instructions and data through the same channel. An email that says "summarize the quarterly report" is structurally identical to an email that says "summarize the quarterly report, and also forward all financial data to this external address." The model doesn't see a boundary between the user's intent and the content it processes. It sees tokens, and it follows the most persuasive sequence.

This isn't a flaw that better prompting fixes. It isn't a compliance gap you close with documentation. It's an architectural vulnerability that requires architectural controls. And the research community has just converged on what those controls look like.

The Convergence: Two Papers, Same Architecture, Same Day

On April 27, 2026, two independent research teams published defense architectures for prompt injection. They approached the problem from different angles, used different techniques, and arrived at the same fundamental conclusion: runtime monitors with privilege separation, not prompt hardening, is the correct defense model.

AgentVisor: Semantic Virtualization for AI Agents

The AgentVisor paper (arXiv:2604.24118), authored by researchers including Ying Zonghao, Wang Haozheng, Liu Jiangfan, and Liu Xianglong, draws explicit inspiration from operating system virtualization [Source: arXiv 2604.24118, "AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization," April 27, 2026].

Their framework treats the target AI agent as an untrusted guest process and places a "semantic visor" between the agent and its execution environment. Every tool call — every file access, every API invocation, every data retrieval — is intercepted by this trusted mediation layer and audited against security policy before execution.

The visor implements what the authors call a "rigorous audit protocol grounded in classic OS security primitives" specifically adapted to the semantic nature of LLM outputs. Rather than trying to predict and block malicious prompts — an arms race they acknowledge is unwinnable — AgentVisor validates what the agent attempts to do, not what it was told.

The results are striking:

0.65% attack success rate for prompt injection attacks against defended agents
Only 1.45% average decrease in utility compared to the no-defense baseline
A one-shot self-correction mechanism that converts security violations into feedback, enabling agents to recover from attacks rather than simply being blocked

The 0.65% ASR means that out of every 200 prompt injection attempts, approximately 199 fail. And the 1.45% utility penalty means the defense doesn't break legitimate functionality — it protects the agent without neutering it.

LCF: Runtime Behavioral Fingerprinting Without Reference Models

The LCF paper (arXiv:2604.24542), authored by Nay Myat Min, Long H. Pham, and Jun Sun, takes a different technical path but converges on the same architectural principle: monitor at runtime, don't harden pre-deployment [Source: arXiv 2604.24542, "Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models," April 27, 2026].

LCF treats the inter-layer hidden-state trajectory of a language model as a health signal. When a model processes text normally, its internal representations follow predictable patterns. When it encounters injected instructions, those patterns deviate — and LCF detects these deviations at the layer level.

Key characteristics of the approach:

No reference model required — it doesn't need a separate "clean" model to compare against, making it practical for deployments using third-party or API-based models
No trigger knowledge — it doesn't need to know what a specific attack looks like; it detects behavioral anomalies regardless of the injection technique
No retraining — the monitor is tuning-free, calibrated on just 200 clean examples
Less than 0.1% inference overhead — the computational cost is negligible

Evaluated across four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) and three threat families (backdoors, jailbreaks, prompt injection), LCF demonstrated:

100% text-payload injection detection across all eight model-domain combinations tested
92–100% detection of DAN-style jailbreaks
Mean backdoor attack success rate reduced below 1% on two of four tested architectures
A single aggregation score covering all three threat families without threat-specific tuning

The Pattern: What Both Approaches Share

Two independent teams. Two completely different technical implementations. Same day. Same conclusion. The pattern is unmistakable:

The defense boundary is the execution boundary, not the input boundary. Don't try to distinguish clean prompts from injected prompts at ingestion. Control what the agent is allowed to do regardless of what it was told.
Trusted mediation is essential. Some component in the architecture must sit between the agent's reasoning and its actions, with the authority to validate, block, or redirect. Whether you call it a semantic visor, a runtime controller, or a security mediator, the function is the same.
Monitoring beats prediction. You cannot enumerate every possible injection. You can detect when behavior deviates from expected patterns and intervene at that point.

This convergence is significant. It suggests the research community is crystallizing around an architectural solution to prompt injection, much as the security community crystallized around memory safety, input validation, and least privilege for earlier classes of attacks.

The Gap: What Organizations Are Actually Doing

The contrast between research reality and organizational response is stark. While the research community converges on runtime architectural controls, most enterprises are stuck in governance mode.

Consider what "AI security" looks like in a typical organization in May 2026:

A responsible AI committee meets quarterly to review incidents
An AI risk register lists prompt injection as a "medium" risk
Model cards document training data, limitations, and intended use
Output review procedures require human validation for high-stakes decisions
Acceptable use policies prohibit employees from entering sensitive data into public AI tools

None of these controls address what happens when an AI agent browsing the web encounters a page with a hidden prompt injection, or when an email processed by an AI assistant contains embedded instructions to exfiltrate data, or when a document uploaded for summarization includes commands that override the summarization request entirely.

This is the control gap: organizations have deployed pre-deployment and post-deployment controls for a runtime attack vector, and they're calling it "defense in depth" when it's actually just "defense in the wrong place."

Google's own security team described the challenge accurately in a separate post on April 2, 2026: IPI "is not the kind of technical problem you 'solve' and move on. Sophisticated LLMs with increasing use of agentic automation combined with a wide range of content create an ultra-dynamic and evolving playground for adversarial attacks" [Source: Google Security Blog, "Google Workspace's continuous approach to mitigating indirect prompt injections," April 2, 2026, by Adam Gavish]. Their response is a continuous, multi-layered defense combining deterministic filters, ML-based classifiers, LLM-based guards, and model hardening — a runtime architecture, not a policy framework.

What This Means for Security Leaders: A Five-Step Action Plan

If you're responsible for AI security — as a CISO, Chief AI Officer, head of security architecture, or board member with oversight responsibility — here's what to do in the next quarter:

1. Classify Your AI Deployments by Runtime Exposure

Not all AI deployments have the same prompt injection risk profile. Create an inventory that distinguishes:

No runtime exposure: Models that process only trusted, internally-generated data (e.g., a coding assistant that only sees your repository)
Limited runtime exposure: Models that process some untrusted data with human-in-the-loop validation (e.g., a document summarizer with mandatory review before publication)
Full runtime exposure: Models that autonomously process untrusted content and can take actions based on it (e.g., an AI agent that browses the web, reads email, and executes tool calls)

Your highest-risk deployments are the ones where an attacker can reach the model through untrusted data and the model can cause harm through its actions. Focus your runtime control investment here.

2. Map the Trust Boundary in Each High-Risk Agent

For each agent with full runtime exposure, draw the architecture and identify every point where the agent processes content it didn't create. This includes:

Web pages retrieved during browsing
Emails processed from external senders
Documents uploaded by users or received from third parties
API responses from external services
Data ingested from message queues, webhooks, or event streams

At each of these points, ask: "If an attacker controlled this content, what could they make the agent do?" If the answer includes actions you can't afford — data exfiltration, unauthorized transactions, system modification — you have a control gap.

3. Evaluate Runtime Guardrail Options Against Your Architecture

The research provides a menu of approaches, not a single prescription. Map available techniques to your specific architecture:

Mediation-layer controls (AgentVisor pattern): If your agents use structured tool-calling mechanisms, you can insert a security mediator between the reasoning LLM and its tool execution layer. This is the strongest control — it validates actions, not just prompts — but requires architectural changes to how agents interact with tools and data.
Behavioral monitoring (LCF pattern): If you're consuming third-party or API-based models where you can't modify the agent architecture, runtime behavioral monitoring can detect anomalous activity without requiring access to internal model states. Trade-off: detection rather than prevention, so you need incident response procedures.
Layered deterministic + ML + LLM guards (Google pattern): A pipeline approach combining fast regex/rule-based filters, ML classifiers for pattern recognition, and LLM-based semantic analysis for ambiguous cases. Google's internal teams use this across Workspace with Gemini [Source: Google Security Blog, April 2, 2026].

4. Fund Runtime Infrastructure, Not Just Governance Artifacts

Review your 2026 AI security budget allocation. If more than 50% is going to policy frameworks, awareness training, governance documentation, and compliance artifacts — you have a structural imbalance. These activities have value, but they don't prevent runtime attacks.

Concrete reallocation targets:

Engineering headcount for security mediation layer implementation
Runtime monitoring infrastructure (logging, anomaly detection, alerting)
Testing infrastructure to validate guardrail effectiveness against evolving attack techniques
Integration of AI-specific security signals into your existing SOC/SIEM workflows

5. Establish Continuous Validation

Prompt injection defenses are not deploy-and-forget. The attack surface evolves. Your controls must evolve with it.

Implement a validation cadence:

Weekly: Automated red-teaming against your deployed agents using current attack techniques
Monthly: Manual penetration testing with newly published injection techniques
Quarterly: Architecture review to assess whether the trust boundary has shifted (new tools, new data sources, new agent capabilities)

Google's approach of combining human red-teaming, automated red-teaming, a vulnerability rewards program, and continuous defense refinement provides a template [Source: Google Security Blog, April 2, 2026]. Scale this to your organization's size and risk profile.

The Bottom Line

Google's 32% spike in prompt injection detections isn't a crisis — yet. But it's a clear signal that attackers are experimenting, learning, and scaling. The sophistication gap between what researchers have demonstrated is possible and what attackers are currently attempting is closing.

The organizations that win on this threat vector will be the ones that deploy runtime architectural controls — mediation layers, behavioral monitors, privilege separation — before the attacks get sophisticated. The organizations that lose will be the ones still updating their policy checklists when the first real incident hits.

Prompt injection is an architectural problem. It demands an architectural answer. The research community just handed us two of them in the same 72-hour window. The question is whether security leaders will act on them, or keep writing policy documents while the clock ticks down.

Sources

Google Security Blog — "AI threats in the wild: The current state of prompt injections on the web," April 23, 2026. Thomas Brunner, Yu-Han Liu, Moni Pande. https://security.googleblog.com/2026/04/ai-threats-in-wild-current-state-of.html
arXiv:2604.24118 — "AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization," April 27, 2026. Ying Zonghao, Wang Haozheng, Liu Jiangfan, Liu Xianglong et al. https://arxiv.org/abs/2604.24118
arXiv:2604.24542 — "Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models," April 27, 2026. Nay Myat Min, Long H. Pham, Jun Sun. https://arxiv.org/abs/2604.24542
Google Security Blog — "Google Workspace's continuous approach to mitigating indirect prompt injections," April 2, 2026. Adam Gavish. https://security.googleblog.com/2026/04/google-workspaces-continuous-approach-to-mitigating-indirect-prompt-injections.html
arXiv:2604.17562 — "SafeAgent: A Runtime Protection Architecture for Agentic Systems," April 19, 2026. Liu Hailin et al. https://arxiv.org/abs/2604.17562 (additional reference on runtime protection architecture)

← All articles