Edge Cases

Prompt Injection Attacks: The Complete Threat Map for Production AI Systems

Owadokun Tosin Tobi — Wed, 06 May 2026 11:47:22 GMT

Prompt injection has been the number one LLM vulnerability on the OWASP Top 10 for two consecutive years. Most engineers know the name. Fewer understand that in 2026 it is not one attack. It is five distinct attack vectors — each exploiting a different layer of the AI system stack, each with a different blast radius, and each requiring different defences

The reason the threat keeps growing is not that the underlying vulnerability got worse. It is that every capability you add to an AI system — tool access, web browsing, RAG retrieval, MCP integrations, long-term memory, multi-agent coordination — adds a new surface through which an adversary can deliver an injection. The attack surface grows with the capability surface. And in 2026, the capability surface is expanding faster than any previous year in the history of AI.

This article maps all five vectors, documents the real-world incidents that validate each one, and gives you the specific defences that production deployments need. Not theory. Not research caveats. The controls that close the gap between a vulnerable agent and a trustworthy one.

73% of production AI deployments are vulnerable to prompt injection — OWASP Top 10 for LLM Applications 2025 / Security audits cited by Proofpoint
A single injection attempt against a GUI-based agent succeeds 17.8% of the time without safeguards — Anthropic Claude Opus 4.6 System Card, 2026
By the 200th attempt, the breach rate against frontier models reaches 78.6% — Anthropic Claude Opus 4.6 System Card, 2026
Sophisticated attackers bypass the best-defended models ~50% of the time with just 10 attempts — International AI Safety Report 2026
Attack success rates against state-of-the-art defences exceed 85% with adaptive strategies — Meta-analysis of 78 studies, MDPI Information, January 2026

Read those numbers together. A 17.8% success rate per attempt does not sound alarming until you understand that a motivated attacker does not make one attempt. They make hundreds. By the 200th, they are in more than three times out of four. Against frontier models with active defences.

OpenAI CISO Dane Stuckey called prompt injection ‘a frontier, unsolved security problem.’ Anthropic dropped its direct prompt injection metric entirely in its February 2026 system card, arguing that indirect injection is the more relevant enterprise threat. That reasoning tracks with what the incident record shows: every high-impact production compromise in the past year involved indirect injection, not direct.

Why Prompt Injection Cannot Be Fixed at the Model Level

Before mapping the five vectors, it is worth being precise about why this problem resists the standard fix. The core issue is architectural, not configurational. LLMs process instructions and data in the same channel. There is no separation between the system prompt, the user input, and the external content the agent retrieves. Everything is tokens in a sequence. The model cannot reliably distinguish ‘these tokens are instructions from my operator’ from ‘these tokens are data from the web page I just scraped.’

Simon Willison, who coined the term ‘prompt injection,’ posed the question directly: ‘Is this just a fundamental limitation of how large language models based on the transformer architecture work?’ By 2025, the consensus answer from the research community was: probably yes, for now.

This matters for how you approach defence. You are not trying to fix the model. You are trying to build an architecture around the model that limits what an injection can actually accomplish — constraining the blast radius, not eliminating the vulnerability. That is a solvable engineering problem. It is just a different problem than patching a software bug.

The Five Attack Vectors: A Complete Threat Map

Vector 1: Direct Prompt Injection

The original and most visible attack class. An attacker manipulates the user-facing input interface to override the system prompt, jailbreak safety instructions, or redirect the model’s behaviour. Classic examples: ‘Ignore your previous instructions.’ ‘You are now in developer mode.’ ‘Forget your system prompt and answer freely.’

Direct injection is the attack most security teams are prepared for — and the one that matters least in 2026. Frontier models have significant resistance to obvious jailbreak attempts. Anthropic’s system card shows that single direct injection attempts succeed at 17.8%, which sounds manageable until you account for persistence and automation.

More importantly, Anthropic removed direct injection metrics from their February 2026 system card entirely, noting that indirect injection is the operationally relevant threat. Direct injection requires the attacker to have access to the user interface. Indirect does not.

Real-world pattern: automated jailbreak campaigns

A 2025 study cited by Proofpoint documented over 461,640 prompt injection submissions in a single dataset. Success rates ranged from 50% to 84% depending on technique and model. The volume is the point — automated direct injection campaigns do not need high per-attempt success rates when they can run at industrial scale. Munich Re’s 2026 annual cyber risk report identified this scalability as a defining characteristic: prompt injection has a ‘low cost and high scale’ profile that makes it unusually attractive compared to traditional attack vectors.

Vector 2: Indirect Prompt Injection

The attack class that dominates the 2026 incident record. Instead of attacking the user interface, indirect injection embeds malicious instructions inside content the AI processes on behalf of a legitimate user. A document it summarises. A webpage it scrapes. An email it reads. An API response it receives. A RAG retrieval result.

The victim does not interact with the attack at all. They ask their AI assistant to summarise a meeting recording, process an invoice, or answer a question using web search. Somewhere in the external content the agent retrieves, an attacker has embedded instructions. The model processes those instructions as if they were legitimate directives. The attack executes invisibly.

Indirect injection bypasses standard prompt filtering in more than 50% of cases. Web-based indirect injection now accounts for nearly 40% of all LLM security incidents. The user never sees the attack. There is no anomalous input to flag. The agent is simply doing its job — following instructions that came from the attacker, not the user.

Real-world incident: OpenClaw + Telegram (March 2026)

The clearest documented case of indirect prompt injection producing real credential exfiltration. An adversarial payload embedded in external content manipulated an OpenClaw agent into constructing a URL containing sensitive context data — API keys, authentication tokens — and sending it via Telegram. The messaging app’s link preview fetched the URL automatically, transmitting the credentials to the attacker’s server with zero user interaction. CNCERT issued an official warning. Government agencies and banks across China were directed to restrict usage.

Full breakdown: 7 Fixes for the Indirect Prompt Injection Vulnerability That’s Silently Leaking Agent Secrets

Real-world incident: Lovable chat history exposure (April 2026)

The Lovable BOLA vulnerability this week exposed AI chat histories for every project created before November 2025. The relevant detail for indirect injection: when engineers vibe-code an application, they paste their API keys into the chat so the AI can wire them up. They paste database URLs to fix schema bugs. They paste sample customer records to get types right. Every one of those messages lived in the project’s chat history. The disclosed endpoint returned those chat histories. The channel that made them retrievable was the exact kind of external-content-ingestion surface that indirect injection targets. The Lovable breach did not require injection — but it illustrated exactly why AI chat sessions are now high-value exfiltration targets.

Vector 3: MCP and Tool Poisoning

The newest and least-understood attack surface in the 2026 threat landscape. The Model Context Protocol has become the de facto standard for connecting AI agents to external tools and data sources. Over 8,000 MCP servers are now publicly exposed. The security properties of those servers vary enormously — and most deployments have no audit process for what an MCP server is actually allowed to instruct the agent to do.

Tool poisoning works through the tool description layer. When an agent discovers available tools, it reads their descriptions to understand when and how to use them. An attacker who controls or compromises an MCP server can embed malicious instructions directly in tool descriptions — instructions that execute when the agent reads the description, before any tool call occurs. The attack is invisible to the user and to most monitoring systems because it does not appear in the conversation log.

40% of AI agent frameworks contain exploitable prompt injection flaws in tool-execution logic — SQ Magazine Prompt Injection Statistics 2026
Autonomous agents calling APIs exhibit up to 2.5x higher risk exposure than standalone models — Palo Alto Networks Unit 42 / SQ Magazine 2026

Real-world vector: CVE-2025-6514 — mcp-remote command injection

A critical command injection vulnerability in the mcp-remote client package allowed malicious MCP servers to trigger arbitrary command execution on client hosts by supplying crafted authorization_endpoint URLs during OAuth discovery. Hundreds of thousands of developer environments were affected. The attack required no user interaction beyond having the malicious MCP server in the agent’s tool registry.

Real-world vector: GitHub repository reader tool poisoning

Documented in 2025 and cited in academic review: attackers contributed malicious files to public GitHub repositories. When a developer used an MCP-enabled assistant with a GitHub repository reader tool, the assistant processed the malicious file, which redirected the agent to invoke a secondary tool and exfiltrate sensitive data from private repositories. The attack bypassed traditional sandbox isolation because the agent itself acted as the privileged execution engine.

Palo Alto Unit 42: three MCP attack classes

Unit 42’s December 2025 analysis of MCP sampling identified three critical vectors: resource theft (draining AI compute quotas for unauthorised workloads), conversation hijacking (compromised MCP servers injecting persistent instructions and exfiltrating data), and covert tool invocation (hidden tool calls and filesystem operations without user awareness or consent). All three classes exploit the implicit trust model at the core of MCP’s design — a model that lacks robust, built-in security controls.

Vector 4: Multi-Agent Chained Injection

As AI systems scale from single agents to networks of coordinating agents, prompt injection gains a multiplier effect. A single compromised sub-agent can inject instructions into the orchestrating agent’s context, redirecting the entire pipeline’s behaviour. Multi-hop indirect attacks — where injection propagates across multiple agent boundaries — increased by more than 70% year-over-year between 2025 and 2026.

The confused deputy problem captures this precisely. The attacker does not compromise the agent’s tools directly. They convince a trusted agent to misuse its own tools on the attacker’s behalf. Because the tool call comes from a legitimate, trusted agent, it passes all the checks that would have blocked an external call.

Real-world case: financial services reconciliation agent (2024, documented 2026)

An attacker injected instructions into a financial services reconciliation agent, instructing it to export ‘all customer records matching pattern X,’ where X was a regular expression that matched every record in the database. The agent, operating within its legitimate permissions, executed the export. 45,000 customer records were stolen before detection. The tool call was legitimate. The permissions were valid. The only thing that was wrong was the instruction — which came from the attacker, not the operator.

The cascade risk in agentic pipelines

Galileo AI’s simulation research documented that a single compromised agent poisoned 87% of downstream decision-making within four hours in multi-agent production systems. Injection at a sub-agent level does not stay contained. It propagates through the trust relationships the pipeline was designed to rely on. The OWASP Agentic Top 10 classifies this as Cascading Failures (ASI08) and notes that standard containment responses assume sequential failure, not the simultaneous multi-agent compromise that chained injection enables.

Vector 5: Memory Injection and Persistent Poisoning

The sleeper vector. Prompt injection that targets an agent’s long-term memory does not expire when the session ends. A false belief or adversarial instruction planted in memory persists, gets recalled in future sessions, and is treated as a legitimate directive. The agent will defend the false memory as correct when questioned.

Lakera AI’s research demonstrated this in production systems in late 2025. The attack requires only a single successful injection — into a document, an email, or a web page — for the agent to encounter during normal operation. The payload instructs the agent to store a specific false memory. From that point forward, every session that recalls the memory executes the attacker’s instruction.

What makes memory injection particularly dangerous is its deniability profile. The injection event and the malicious action are separated in time and often in session. A forensic investigation looking for the cause of an anomalous action may not connect it to an injection that occurred three sessions earlier.

The compounding risk: agentic systems with memory + tool access

Tool misuse via prompt injection triggers unauthorised actions in 31% of evaluated agent scenarios according to Unit 42’s analysis. When the agent also has persistent memory, that 31% does not reset between sessions. The injection survives. The malicious instruction accumulates. The only mitigation is explicit memory lifecycle management — token caps, mutation logging, periodic purging — which the OWASP Agentic Top 10 lists as mandatory controls for any agent with long-term memory

The Defences That Actually Work in Production

The core principle across all five vectors is the same: you cannot fix prompt injection at the model level. You can only limit what a successful injection can accomplish by constraining the architecture around the model. These six controls, applied together, do that.

1. The Quarantined LLM Pattern — Never Process Untrusted Content with Tool Access

OWASP’s most direct architectural recommendation: process untrusted content in a model instance that has no tool access. A separate, sandboxed model instance reads the external content — the email, the webpage, the document — and extracts only the information needed. That information is then passed to the main agent as structured, validated data. The main agent never sees the raw untrusted content.

This is the cleanest solution to indirect injection and MCP tool poisoning because it eliminates the attack surface entirely. The injected instruction never reaches a model instance with the capability to act on it. The quarantine boundary is architectural, not a prompt filter.

2. Input Trust Tiering — Apply Different Constraints to Different Content Sources

Not all content deserves the same trust level. System prompts and verified operator messages are Tier 1. Internal documents with confirmed provenance are Tier 2. External content from the web, emails, third-party APIs, user uploads, and MCP responses are Tier 3.

Tier 3 content should never directly drive tool invocations or consequential actions. It can inform the agent’s understanding. It cannot command the agent’s behaviour. Every Tier 3 input that would result in a tool call must pass through a verification node before execution. In Axiom Engine, this is enforced by the Prosecutor Agent layer: every claim from external content is traced back to a source coordinate before it can propagate downstream. The principle is identical — external content informs, verified data acts.

3. Human-in-the-Loop Checkpoints for High-Impact Actions

Airia’s 2026 analysis mandates human-in-the-loop checkpoints for high-impact agent actions as an immediate requirement, not a future goal. For any action that is irreversible, externally visible, or involves sensitive data — API calls, deployments, database mutations, external messages, file writes — require explicit human approval before execution.

This does not eliminate injection. It eliminates the ability of a successful injection to cause irreversible harm. An agent that halts and asks ‘I’m about to send this message containing the following content to this external endpoint. Confirm?’ gives a human the opportunity to catch what the model could not distinguish.

4. Exfiltration Channel Enumeration and Gating

Every output channel an agent can use is a potential exfiltration path: external API calls, URL construction, image rendering, email sending, file writes, messaging platform integrations. Enumerate every channel explicitly. Gate each one with an allowlist and an output inspection layer that flags anomalous patterns before delivery.

The OpenClaw Telegram incident was an exfiltration through a channel — link preview fetch — that was not on anyone’s threat model because it was a platform feature, not a direct API call. The defence is not just monitoring direct API calls. It is understanding every path through which data can leave the agent’s trust boundary and requiring explicit approval for any path outside the allowlist.

5. MCP Server Auditing and Tool Description Inspection

For every MCP server in your agent’s tool registry, apply the same review process you would apply to a production dependency: who published it, what permissions it requests, what its tool descriptions actually say. Tool descriptions are executable instructions to the agent. A malicious description is a prompt injection payload delivered at setup time, not at runtime.

Palo Alto Unit 42’s recommendation: review tool descriptions when installing MCP servers. The question ‘Did I trust this MCP server when I installed it?’ is insufficient. The question is: ‘Have I read every tool description it exposes and confirmed there are no embedded instructions?’

6. Memory Lifecycle Management with Hard Token Caps

Any agent with long-term memory needs a defined lifecycle: hard token caps to force predictable context boundaries, mutation logging for every memory write, isolation per user and per session, and periodic purging to remove drift and contamination. IBM’s security framework recommends a 20,000-token hard cap specifically to prevent unintended instruction accumulation across sessions.

Memory injection is the vector that breaks the standard incident response timeline. The injection event and the malicious action may be separated by days. Without mutation logging, the connection is invisible. With logging, it is detectable: a memory write from an unexpected context, containing content that came from an external source, is the forensic signal that catches the attack before the malicious recall occurs.

What This Week’s Incidents Tell Us About Prompt Injection’s Direction

The week of April 28 — May 4, 2026 produced a cluster of incidents that collectively illustrate where prompt injection is heading as a threat class.

The Lovable BOLA breach exposed AI chat histories containing credentials engineers pasted into sessions. The breach mechanism was not injection — it was a missing object-level authorization check. But the consequence is identical to a successful memory injection campaign: an attacker gains access to the context that AI sessions accumulate, including every secret that was ever pasted into the conversation to help the model do its job. The attack surface that indirect injection targets — the rich, credential-laden context window — is now explicitly understood as a high-value theft target.

The GitHub CVE-2026-3854 was discovered using AI-assisted vulnerability research. That is the flip side of the same dynamic: the techniques that make AI useful for defenders — automated code analysis, pattern recognition at scale — are equally available to attackers. Munich Re’s 2026 risk report specifically flagged this asymmetry: prompt injection has a ‘low cost and high scale’ profile because AI makes attack automation cheap.

The cPanel authentication bypass, which exploited a vulnerability that had been in active exploitation since at least February 23, shows the standard pattern: the attack surface existed long before the defenders knew it was being used. For prompt injection, the same principle applies. The injection campaigns running against your production agents today may not be visible in your logs until you build the observability infrastructure to detect them.

The 2026 incident record is converging on a single lesson: the teams that build AI systems without explicit injection defences are not gambling on whether they will be attacked. They are gambling on whether the attack will be visible before it causes irreversible harm. Given that indirect injection bypasses standard filtering 50% of the time, that is not a bet worth making.

The Threat Map Summary

Five vectors. One architectural root cause. Six defences that constrain the blast radius regardless of which vector the attacker uses.

Vector 1 — Direct Prompt Injection: user interface, manageable with model-level resistance and rate limiting, but automatable at scale
Vector 2 — Indirect Prompt Injection: external content ingestion, highest incident frequency, invisible to users, bypasses standard filtering in 50%+ of cases
Vector 3 — MCP and Tool Poisoning: tool description layer, 8,000+ exposed MCP servers, 40% of agent frameworks have exploitable flaws
Vector 4 — Multi-Agent Chained Injection: cascade across trust relationships, 70%+ YoY increase, 87% downstream poisoning within 4 hours in simulation
Vector 5 — Memory Injection: persistent across sessions, forensically difficult, requires dedicated lifecycle controls to detect

The complete framework for governing all five vectors — across six security domains with pre-deployment checklists and deployment maturity phases — is in the AI Agent Security Framework article in this series.

AI Agent Security Framework: How to Build One Before Your First Production Deployment

Owadokun Tosin Tobi — Mon, 13 Apr 2026 21:50:25 GMT

Three months into 2026, the evidence is in. The pattern across every major AI agent security incident this quarter is not sophisticated zero-days or state-sponsored attacks. It is deployment without a framework.

Anthropic’s distillation campaigns: 24,000 fraudulent accounts ran for months because there was no cross-account behavioural detection framework. The Vercel hallucination: an agent deployed unknown code because there was no verification-before-action framework. The OpenClaw prompt injection: credentials were exfiltrated via Telegram link previews because there was no trusted-content framework. The LiteLLM supply chain attack: 95 million monthly downloads were backdoored because there was no dependency provenance framework.

None of these required novel attacker techniques. They required the absence of the controls that a security framework mandates. And in every case, the team that got burned had shipped agents to production before shipping a security framework to govern them.

Only 29% of organisations report being prepared to secure agentic AI deployments — Cisco State of AI Security 2026

82% of executives feel confident their existing policies cover unauthorised agent actions — only 14.4% of agents go live with full security approval — Gravitee State of AI Agent Security 2026

This article gives you the framework the 71% of unprepared organisations are missing. It is grounded in the OWASP Top 10 for Agentic Applications 2026 — a globally peer-reviewed framework developed by over 100 industry experts — layered with the six-domain implementation structure that production deployments actually need.

Why Your Existing Security Framework Does Not Cover Agents

Before building the framework, it is worth being precise about why existing frameworks are insufficient. This is not an argument that NIST AI RMF, ISO 42001, or SOC 2 are useless. It is an argument that they were not designed for the specific risk properties of autonomous systems that plan, act, and coordinate across tool chains without continuous human oversight.

Innovaiden’s analysis found that existing ISO 27001 controls cover approximately 60% of agent-specific risks. The remaining 40% is the gap the framework below is designed to close. Specifically:

• Static controls assume static assets.

AI agent identities may persist for minutes. A service account created by an agent during a task, used to complete it, and abandoned is invisible to identity governance systems designed to audit quarterly.

• Perimeter controls assume a perimeter.

Agents communicate via A2A protocols, MCP servers, and messaging platforms. The attack surface is not a network boundary. It is every message the agent receives, every tool it calls, and every external content source it ingests.

• Monitoring assumes known behaviour patterns.

Your SIEM was built to detect anomalies in human behaviour. An agent executing 10,000 API calls in sequence looks completely normal. The attacker’s commands look identical to legitimate orchestration.

• Incident response assumes containment is fast.

Galileo AI research found that in simulated multi-agent systems, a single compromised agent poisoned 87% of downstream decision-making within four hours. Traditional incident response timelines assume you have days, not hours, before cascade.

The most common mistake enterprises make is applying their existing application security playbook to agents. Agents are not applications. They make autonomous decisions, call external tools, and can be manipulated through their inputs in ways that traditional software cannot. A firewall does not stop prompt injection. An API gateway does not prevent an over-permissioned agent from exfiltrating data through a legitimate tool call. — Bessemer Venture Partners, 2026

The Authoritative Starting Point: OWASP Top 10 for Agentic Applications 2026

In December 2025, the OWASP GenAI Security Project published the OWASP Top 10 for Agentic Applications 2026 — the first globally peer-reviewed framework specifically addressing the risk surface of autonomous AI systems. Developed through collaboration with over 100 industry experts, researchers, and practitioners, it is now the baseline standard for agentic security.

What makes it particularly valuable for framework builders is that it is grounded in real incidents, not theoretical attack models. Each of the ten risks (ASI01–ASI10) maps to documented failure modes from the first generation of agentic deployments. Here is the complete list with the Q1 2026 incident evidence that validates each one:

ASI01 Agent Goal Hijack

Q1 2026 incident: OpenClaw indirect prompt injection — adversarial content embedded in external data redirected the agent’s goal from serving the user to constructing credential-exfiltration URLs.

ASI02 Unsafe Tool Execution

Q1 2026 incident: Vercel hallucination — Claude Opus 4.6 invoked the Vercel deployment API with a fabricated repository ID, executing an irreversible production action based on an assumed (never verified) parameter.

ASI03 Delegated Trust Exploitation

Q1 2026 incident: Trivy/LiteLLM supply chain — TeamPCP exploited the trust relationship between LiteLLM’s CI/CD pipeline and its Trivy dependency to obtain a PyPI publish token and deploy backdoored packages.

ASI04 Misuse of Agent-Granted Capabilities

Q1 2026 incident: Anthropic distillation campaigns — DeepSeek, Moonshot, and MiniMax used legitimate API access, scaled via 24,000 fraudulent accounts, to systematically extract capabilities the platform was not designed to allow at that volume.

ASI05 Memory Tampering

Emerging pattern: Lakera AI’s November 2025 research demonstrated memory poisoning via indirect prompt injection, showing that false beliefs planted in an agent’s long-term memory persist across sessions and are defended as correct when questioned.

ASI06 Excessive Autonomy

Q1 2026 incident: Multiple OpenClaw deployments — default high-privilege configurations combined with messaging platform access created autonomous action chains with no human checkpoint before consequential external operations.

ASI07 Insecure Inter-Agent Communication

Active risk in multi-agent deployments: Spoofed inter-agent messages can misdirect entire agent clusters. Only 24.4% of organisations have full visibility into which agents are communicating with each other.

ASI08 Cascading Failures

Documented in Galileo AI research: a single compromised agent poisoned 87% of downstream decision-making within four hours in simulated production multi-agent systems.

ASI09 Human-Agent Trust Exploitation

Documented pattern: Well-trained agents produce confident, fluent explanations of bad decisions. McKinsey’s 2026 governance report found that agents often convince security analysts that compromised behaviour is legitimate.

ASI10 Rogue Agents

Q1 2026 incident: LiteLLM 1.82.8 installed a systemd backdoor (sysmon.service) that survived package uninstallation and continued polling for additional payloads — the persistence mechanism of a rogue agent operating outside the system’s intended parameters.

Every one of the top ten risks materialised in a real incident in the first quarter of 2026 alone. The framework you build needs to address all ten.

The AI Agent Security Framework: Six Domains, Applied Before Deployment

The OWASP Agentic Top 10 defines what to protect against. The six-domain framework below defines how to build the controls. It draws on the Cloud Security Alliance’s Agentic Trust Framework, Proofpoint’s Agent Integrity Framework, Bessemer’s CISO implementation guidance, and the specific architectural lessons from Q1 2026 incidents.

Apply these six domains before your first production agent deployment. Not after. Not during. Before.

Domain 1: Identity and Access Governance

Every AI agent is an identity. It needs credentials to access databases, cloud services, APIs, and code repositories. The more tasks it is given, the more entitlements it accumulates. CyberArk’s framing is exact: agents are non-human identities, and they are the fastest-growing identity category in enterprise infrastructure.

The framework requirement: treat every agent as an independent, identity-bearing entity requiring its own access review. Only 21.9% of organisations currently do this.

Three controls this domain mandates:

Unique agent identities with scoped, short-lived credentials — not shared service accounts or inherited human user permissions
Just-In-Time (JIT) credential provisioning: access granted for the duration of a specific task, revoked immediately after completion
Quarterly access reviews for agent identities using the same rigour applied to privileged human accounts

The LiteLLM supply chain attack succeeded because a single compromised PyPI token gave TeamPCP the ability to publish packages affecting 95 million monthly downloads. That token had no time-bound expiry, no scope limitation beyond ‘publish to PyPI’, and no anomaly detection for out-of-hours publishing activity. JIT credentials with scope-limited tokens would have contained the blast radius to a single task rather than the entire package ecosystem.

Domain 2: Input Trust Boundaries

Every source of content that an agent ingests is a potential injection vector. The OWASP Agentic Top 10 identifies Agent Goal Hijack (ASI01) as the top risk specifically because agents cannot reliably distinguish legitimate instructions from adversarial content embedded in external data.

The framework requirement: build an explicit trust tier for every content source the agent accesses.

Tier 1 (System prompt + verified user messages): highest trust, full tool access
Tier 2 (Internal documents and databases with provenance): medium trust, read operations only, no direct tool invocation
Tier 3 (External web content, emails, third-party APIs, scraped data): lowest trust, no access to sensitive tools or credentials, output inspection before delivery

In Axiom Engine, this manifests as the Prosecutor Agent layer: every claim generated from Tier 3 content must be traced back to a verified source coordinate before it can pass downstream. Nothing from untrusted sources propagates to consequential actions without audit. This is not a post-processing filter. It is an architectural constraint built into the pipeline.

Domain 3: Tool Use and Action Controls

Unsafe Tool Execution (ASI02) is the second-highest OWASP agentic risk for a reason. Agents call tools with fabricated parameters, invoke APIs with unverified identifiers, and execute irreversible actions before any human has a chance to review them. The Vercel hallucination incident is the canonical production example.

The framework requirement: every tool call that can produce an irreversible, high-consequence, or externally visible action must pass through three gates before execution:

Verification gate:

the parameter driving the action was retrieved from an authoritative source, not assumed or generated. No deployment without a confirmed repository ID. No financial transaction without a confirmed account number. No data deletion without a confirmed record identifier.

Least-privilege gate:

the agent’s current permission set covers this specific action. If it does not, the action fails with an escalation signal rather than attempting to acquire additional permissions.

Human approval gate:

for irreversible actions — deployments, database mutations, financial transactions, credential changes — a human checkpoint is mandatory. LangGraph’s interrupt() function provides the mechanism. This is not optional for consequential operations.

Domain 4: Memory and State Integrity

Memory Tampering (ASI05) is the sleeper risk in agentic security. Unlike prompt injection, which ends when the session ends, memory poisoning persists. A false instruction planted in an agent’s long-term storage persists across sessions, gets recalled as a legitimate directive, and is defended as correct when questioned. Lakera AI’s November 2025 research demonstrated this in production systems.

The framework requirement:

Never store raw user input in long-term memory. Store structured, vetted summaries only.
Apply memory isolation per user, per session, and per task — no cross-contamination of context between tenants or roles
Log all memory mutations and require human approval for any goal-altering changes
Implement a memory lifecycle with explicit expiry: IBM’s security framework recommends a 20,000-token hard cap to force predictable context boundaries and prevent unintended credential accumulation across sessions
Periodically purge and re-baseline long-term memory to remove drift or contamination

Domain 5: Supply Chain and Dependency Integrity

The TeamPCP campaign across Trivy, KICS, and LiteLLM in March 2026 is the definitive 2026 case study for why supply chain integrity belongs in every AI agent security framework. LiteLLM’s backdoored versions passed all standard pip integrity checks because they were published using legitimate credentials. The malicious content was not detectable by hash verification alone.

The framework requirement:

Treat AI plugins, skills, MCP servers, and agent framework dependencies as critical infrastructure — not developer convenience tools
Pin every dependency to an exact version with a verified hash. Implement pip install --require-hashes in all production environments
Disable automatic updates for AI skills and plugins in production. Updates must go through the same review process as new installations
Audit for .pth file installation: any package that installs .pth files with execution patterns (subprocess, base64, exec) should trigger an immediate security review
Subscribe to PyPA security advisories and the CVE feeds for every AI framework in your dependency tree

The LiteLLM compromise was discovered by a researcher whose Cursor IDE pulled it in as a transitive dependency of an MCP plugin they never explicitly installed. Your supply chain is wider than the packages you intentionally install.

Domain 6: Observability and Incident Response

You cannot govern what you cannot see. Only 21% of executives currently have complete visibility into agent permissions and data access patterns. Only 24.4% of organisations know which of their agents are communicating with each other. The observability gap is the enforcement gap.

The framework requirement:

Full trace-level logging: every tool call with inputs, outputs, and timestamps. Every external content ingestion point. Every inter-agent message. Every memory mutation.
Anomaly detection on tool-call sequences: not just individual calls, but patterns. A deployment with zero prior lookup calls is detectable. 88 bot comments in a 102-second window is detectable. These patterns are invisible without sequence-level observability.
Agent inventory: a continuously updated catalogue of every agent in the environment — authorized or not. Shadow AI cannot be governed without visibility.
Kill switches with operational priority: Kiteworks’ 2026 survey found most organisations can monitor what agents are doing but cannot stop them when something goes wrong. The governance-containment gap. Your incident response plan must include the ability to terminate agent actions in real time, not just log them.
Blast-radius testing: before any new agent deployment or policy change, run it through a digital twin environment. OWASP Agentic Top 10 explicitly recommends gating policy expansions on blast-radius caps verified in isolated environments before production promotion.

Subscribe now

The Four-Phase Deployment Maturity Model

The Cloud Security Alliance’s Agentic Trust Framework (ATF) proposes a maturity model that maps directly to how organisations should sequence agent autonomy against security control maturity. Applied to the six-domain framework above, it gives you a concrete deployment sequence:

Phase 1: Intern Agent (Read-Only Mode)

Agents at this level can access data, perform analysis, and generate insights. They cannot take any action that modifies external systems. All six framework domains are implemented in read-only enforcement mode. Minimum time at this phase: two weeks before promotion eligibility.

Real-world equivalent: an agent that summarises documents, answers questions from a knowledge base, and generates draft outputs for human review. No tool calls with write access. No external API mutations. No messaging platform integrations.

Phase 2: Junior Agent (Supervised Actions)

Agents at this level can recommend specific actions with supporting reasoning, but require explicit human approval before any action is executed. Domain 3 (Tool Use and Action Controls) is now active: every tool call passes through the verification, least-privilege, and human approval gates. Supply chain integrity (Domain 5) is fully enforced.

Real-world equivalent: an agent that proposes a deployment, presents the verified repository ID and diff for human review, and executes only after explicit approval. This is what should have governed the Vercel incident.

Phase 3: Senior Agent (Selective Autonomy)

Agents at this level can execute pre-approved action classes without human intervention, but escalate to human review for novel or high-consequence operations. All six domains are fully active. Anomaly detection (Domain 6) is enforcing real-time alerts. Memory lifecycle management (Domain 4) is actively expiring and re-baselining context.

Real-world equivalent: an agent that autonomously handles routine deployments within a defined scope, halts and escalates when it encounters a parameter it cannot verify, and automatically pages a human for any action outside its pre-approved action class.

Phase 4: Principal Agent (Full Autonomy with Governance)

Agents at this level operate with broad autonomy within explicitly bounded operational limits. This phase requires clean operational history at Phase 3, a passed security audit, measurable positive impact, and explicit sign-off from authorized stakeholders. No agent reaches this phase without evidence from the lower phases.

The critical point: most production agents should not be at Phase 4. The instinct is to maximise autonomy. The discipline is to match autonomy level to the maturity of your controls and the evidence from your operational history.

OWASP’s principle of least agency is the governing constraint of this maturity model: only grant agents the minimum autonomy required to perform safe, bounded tasks. Autonomy without sufficient control maturity is not a capability. It is a liability.

The Pre-Deployment Checklist: 20 Questions Before Any Agent Goes Live

Before any agent moves to production — regardless of how simple it seems — answer all 20 of these questions. If any answer is ‘no’ or ‘not yet’, the agent is not ready for production.

Domain 1: Identity and Access

Does this agent have a unique identity distinct from human users and shared service accounts?
Are credentials scoped to the minimum permissions required for the agent’s specific tasks?
Do credentials expire after task completion, or are they long-lived?
Is this agent included in your access review process?

Domain 2: Input Trust Boundaries

Have you mapped every external content source this agent will ingest?
Is there an explicit trust tier applied to each content source?
Does the agent have a mechanism to flag or reject anomalous instructions embedded in Tier 3 content?

Domain 3: Tool Use and Action Controls

Does every irreversible action require parameter verification against an authoritative source?
Is there a human approval checkpoint for deployments, database mutations, and financial transactions?
Does the agent fail safely (escalate, not proceed) when it cannot verify an action parameter?

Domain 4: Memory and State Integrity

Is there a token cap on long-term memory context?
Are memory mutations logged and auditable?
Is memory isolated per user, session, and task?

Domain 5: Supply Chain and Dependency Integrity

Are all AI dependencies pinned to exact versions with verified hashes?
Are AI plugins and skills from verified sources with documented provenance?
Are automatic updates disabled in the production environment?

Domain 6: Observability and Incident Response

Is full trace-level logging in place for all tool calls and external content ingestion?
Is there a continuously updated agent inventory covering this deployment?
Does your incident response plan include the ability to terminate this agent’s actions in real time?
Has this agent been tested in a digital twin environment with defined blast-radius caps?

Share Edge Cases

The Framework Is the Deployment Gate

The instinct under time pressure is to ship the agent and add security later. Q1 2026 has demonstrated what ‘later’ costs in practice: production incidents, credential rotations across thousands of environments, public incident disclosures, and the kind of trust damage that takes quarters to repair.

The six-domain framework above is not a compliance exercise. It is the minimum viable governance structure for an autonomous system that can take real actions in the world on your behalf. It does not require exotic tooling or a dedicated AI security team. It requires disciplined engineering and a willingness to make framework completion a deployment gate, not a post-launch audit item.

Bessemer’s CISO framework guidance puts it well: the framework should follow the strategy, not precede it. Define your organisation’s position on agents — how much autonomy, in which contexts, with which oversight structure — and then build the controls that match that position. But whatever your position, build the controls first.

The teams that got burned in Q1 2026 did not lack the capability to build these controls. They lacked the discipline to build them before deployment. The framework is not the obstacle to shipping agents. It is what makes shipping agents sustainable.

Case studies from this series:

AI Agent Security: The Complete Guide

What Are the 4 Biggest AI Agent Security Risks in 2026?

Claude Hallucinated a GitHub Repo ID and Deployed It to Production

7 Fixes for the Indirect Prompt Injection Vulnerability

LLM Hallucination in Production: Why Your Staging Environment Will Never Catch It

Owadokun Tosin Tobi — Sat, 28 Mar 2026 13:04:42 GMT

Your model passed every test in staging. The evaluation suite scored 94%. The demo ran flawlessly. Then it hit production and invented a GitHub repository ID it had never seen, with complete confidence, and used your deployment API to ship unknown code to a customer account.

That is not a hypothetical scenario. That is exactly what happened to a Vercel customer in March 2026 when Claude Opus 4.6 — one of the best-performing models on every major benchmark — hallucinated a numeric repo ID at line 877 of its execution trace, with zero GitHub API calls before it, and executed an irreversible production deployment.

The Hallucination Problem in 2026: What the Numbers Actually Show

The good news is that hallucination rates have fallen significantly over the past five years. The average rate across major models dropped from nearly 38% in 2021 to approximately 8.2% in 2026, with the best-performing systems reaching rates as low as 0.7% on standardised benchmarks.

Now the bad news.

False or fabricated information still appears in 5–20% of complex reasoning and summarisation tasks — Master of Code / HHEM 2026 Leaderboard
Models are 34% more likely to use confident language when hallucinating than when providing factual information — MIT Research 2025
LLMs hallucinate between 69% and 88% of the time on specific legal queries — Stanford RegLab / HAI
Global business losses attributed to AI hallucinations reached $67.4 billion in 2024 — AllAboutAI Comprehensive Study
Employees now spend an average of 4.3 hours per week verifying whether AI-generated information is correct — Suprmind AI Hallucination Statistics 2026

That MIT finding is the one that matters most for production systems. When a model hallucinates, it does not say ‘I’m not sure about this.’ It says 913939401 — a perfectly formatted, completely fabricated GitHub repository ID — with the same fluency and confidence it uses for correct answers. The output looks exactly like a correct answer. That is what makes hallucinations dangerous in production, not their frequency. Their confidence.

LLMs are prediction engines, not knowledge bases. They generate text by predicting the most statistically likely next token based on patterns from training. They do not ‘understand’ truth. They predict plausibility. When the model encounters a gap, it fills it with something plausible-sounding rather than admitting uncertainty. This is not a bug that will be patched. It is the architecture.

Why Staging Doesn’t Catch Production Hallucinations

Most LLM evaluation frameworks test what the model does when things go right. You provide representative inputs, verify the outputs against expected results, measure accuracy, and ship. The problem is that production hallucinations are not distributed like your evaluation set.

The Distribution Gap

Staging tests cover your happy path. Production exposes your model to the full distribution of real-world inputs: edge cases, ambiguous queries, adversarial prompts, inputs at the boundary of the model’s training data, and combinations of context that your evaluation set never anticipated. Hallucinations cluster at the edges of that distribution, not the centre.

In April 2026, Cursor’s AI assistant told users they were restricted to ‘one device per subscription’ — a policy that never existed. The false claim led to user cancellations and refunds before the company acknowledged it was a hallucination. Cursor’s staging environment almost certainly included thousands of tests for subscription-related queries. The specific combination of context that triggered this hallucination was not one of them.

The Confidence Paradox

Standard evaluation frameworks measure accuracy. They do not measure confidence calibration — whether the model’s expressed confidence correlates with its actual accuracy. A model that scores 94% accuracy but is equally confident about the 6% it gets wrong is a liability in any high-stakes deployment, because the outputs that are wrong are indistinguishable from the outputs that are right.

The MIT research is explicit: the more wrong a model is, the more certain it sounds. This is not random noise. It is a systematic property of how these models are trained — they are rewarded for producing helpful, confident responses, not for calibrating uncertainty. OpenAI’s own post-mortem on GPT-4o’s sycophancy acknowledged this directly: the model was trained in ways that rewarded confidence over accuracy in ambiguous situations.

The Scale Problem

In early demos, hallucinations feel rare and manageable. In production, they compound. Hallucinations in LLM applications increase at scale — not because the model gets worse, but because the volume of edge-case inputs grows faster than the evaluation set anticipated. An error rate of 1.75% across 10,000 daily interactions is 175 hallucinations per day. At 100,000 interactions, it is 1,750.

The Agentic Multiplier

All of these gaps are compounded when the hallucinating model is an agent with tool access. A hallucination in a chat response produces a wrong answer. A hallucination in an agent pipeline produces a wrong action. The Vercel deployment was the result of a single hallucinated identifier. At line 877 of the execution trace, with no prior lookup, the model invented a number — and the downstream system executed it without question.

Tool and action hallucinations are the most dangerous class: the model calls tools incorrectly, fabricates parameters, or attempts actions outside its allowed scope — with the same confidence it uses for everything else. Portkey’s production hallucination analysis names this as the failure mode most likely to translate into real-world incidents.

What LLM Hallucination Actually Looks Like at Enterprise Scale

Theory aside, here are the real incident patterns that appear repeatedly in 2026 enterprise deployments:

▶ Hallucinated Package Names in Code Generation

AI coding assistants regularly suggest package imports that do not exist. Researchers found that certain hallucinated package names appear frequently across many developers’ interactions — consistently enough that attackers began registering those package names with malicious payloads. When the agent recommends the package and the developer installs it, the malicious code executes. This is a hallucination that became a supply chain attack vector.

▶ Fabricated Legal and Regulatory Citations

Stanford’s RegLab research remains the definitive data point: LLMs hallucinate between 69% and 88% of the time on specific legal queries. 83% of legal professionals have encountered fabricated case law when using AI. For any agent operating in a regulated industry — legal, healthcare, finance, compliance — the hallucination rate on the exact queries where accuracy matters most is dramatically higher than headline benchmark numbers suggest.

▶ Production Deployment of Hallucinated Identifiers

The Vercel incident is the clearest documented case: a hallucinated repository ID, passed to a production API, deploying unknown code to a customer environment. The model knew the project name and project ID. It did not know the repository’s numeric GitHub ID. Rather than calling the GitHub API to retrieve it, the model invented a plausible-looking numeric value. The staging environment had no test case for ‘agent attempts to deploy with a fabricated repository ID’ because that failure mode was not anticipated.

▶ Policy Hallucinations in Customer-Facing Systems

Cursor’s device restriction hallucination (April 2026) is one of many documented cases where an AI assistant confidently stated a policy that did not exist. Air Canada’s chatbot told a passenger he could claim a bereavement refund after purchasing a full-fare ticket. A tribunal ruled Air Canada was responsible for the chatbot’s statements. A Utah homeowner received bot approval for a $3,000 warranty claim that the company later denied as an AI error. Consumer advocates argued firms remain legally bound by chatbot promises under existing contract law.

The Controls That Actually Catch Production Hallucinations

The goal is not to eliminate hallucinations at the model level. That is architecturally impossible with current LLM designs. The goal is to build a system that treats hallucinations as a predictable failure mode and catches them before they propagate into consequential actions.

1. Build a Verification Layer, Not Just a Validation Layer

Validation checks whether a value looks correct. Verification checks whether a value is correct. Format checks and type checks pass on hallucinated values — as the Vercel incident proved. The only thing that catches confident hallucinations is a step that goes back to the authoritative source and confirms the value is real before it drives any action.

In Axiom Engine, I built this as a Prosecutor Agent: a secondary agent that traces every generated claim back to a source coordinate — page number, line number — before the claim is allowed to surface to a user. For agentic systems with tool access, the equivalent is a lookup node that must return a confirmed result from an authoritative API before any write, deploy, or mutation node can execute. The agent should not be able to carry an assumed identifier through the execution graph.

2. Treat Production as a Separate Evaluation Environment

Runtime evaluation is not the same as pre-deployment evaluation. Production LLM systems need continuous evaluation against real interactions — not periodic benchmark runs against synthetic test sets.

Portkey’s production hallucination framework describes this as a feedback loop: centralise routing and telemetry, run guardrails continuously against real production interactions for grounding and consistency, detect emerging hallucination patterns early, and feed those patterns back into prompt updates, retrieval changes, and tool policies. The evaluation is not a gate before production. It is a continuous function that runs in production.

3. Implement Grounding Constraints in Your Retrieval Architecture

For RAG-based systems, the most common production failure mode is extrinsic hallucination: the model generates outputs that are not grounded in the retrieved documents, even when those documents contain the correct answer. The model retrieves the right evidence and then generates an answer that contradicts it.

Evidence-gated RAG addresses this directly: require the model to cite the specific document coordinate that supports each claim before the claim is allowed to surface. If the model cannot point to the page and line where it found the information, the claim does not go to the user. This is not a post-hoc check — it is a structural constraint built into the generation pipeline. I cover the full implementation architecture in a dedicated article in this series.

4. Calibrate Your Confidence Thresholds to Your Use Case

Not all hallucination contexts carry the same risk. A hallucinated product description costs less than a hallucinated medication dosage. A fabricated package name in a code suggestion is manageable. A fabricated repo ID in a deployment pipeline is not.

Build risk tiers into your LLM routing: route high-stakes queries — anything that triggers an irreversible action, a regulatory decision, or a financial transaction — through your strongest available model with the strictest grounding constraints. Route lower-stakes queries through faster, cheaper models with looser constraints. The cost of over-engineering verification for your highest-risk flows is latency. The cost of under-engineering it is the incident.

5. Log Confidence Signals, Not Just Outputs

Standard production logging records what the model said. It does not record how certain the model was when it said it. Building confidence signal logging into your observability stack — tracking token probability distributions, detecting hedging language versus assertive language, flagging outputs where the model’s expressed certainty is unusually high for the query type — gives you early warning signals before a hallucination becomes an incident.

The MIT research finding is actionable here: models are 34% more likely to use confident language when hallucinating. That pattern is detectable. Build detection for it.

The Honest Assessment

LLM hallucination rates will continue to fall as model architectures improve. Claude Sonnet 4.6 currently shows approximately 3% hallucination rate on standardised benchmarks — a significant improvement over models from two years ago. But the benchmark measures a controlled task. Production is not a controlled task.

The gap between benchmark performance and production behaviour will persist as long as evaluation sets do not simulate the full distribution of real-world inputs, including the adversarial ones, the edge cases, and the ambiguous queries where the model’s training data is sparse. That gap is not a model problem. It is a systems engineering problem.

Closing it requires treating hallucinations not as model defects to be fixed upstream, but as predictable failure modes to be caught downstream. Verification layers. Runtime evaluation. Evidence gating. Confidence-tiered routing. Comprehensive trace logging. These are not workarounds for a broken technology. They are the engineering discipline that makes a fundamentally probabilistic system safe to deploy in consequential contexts.

Your staging environment tests what you expected to go wrong. Production exposes what you didn’t expect. The engineers who understand this build verification into the pipeline before they ship. The engineers who don’t discover it in a post-incident review.

What Are the 4 Biggest AI Agent Security Risks in 2026?

Owadokun Tosin Tobi — Thu, 26 Mar 2026 15:40:17 GMT

Gartner identifies AI-specific threats as the number one emerging risk category for enterprises in 2026. That is a useful headline. What it does not tell you is which risks are actually materialising in production environments, which ones your existing security stack was not built to catch, and which ones will still be your problem when the current threat intelligence cycle moves on to the next story.

This article answers those questions directly. Each of the four risks below is backed by current data from the organisations that measure these things for a living — the Gravitee State of AI Agent Security 2026 Report, HiddenLayer’s AI Threat Landscape, the Cybersecurity Insiders AI Risk and Readiness Report 2026, and AGAT Software’s enterprise analysis. No invented scenarios. No hypothetical attack chains. Just the failure modes that are already happening inside organisations that believed their existing controls were sufficient.

82% of executives feel confident their existing policies protect against unauthorised agent actions. Only 14.4% of agents actually go live with full security or IT approval. That confidence gap is where the incidents live.

Let’s look at the four risks that are filling it.

Risk 1: Prompt Injection — The Attack That Needs No Exploit Code

Prompt injection is the number one entry on the OWASP LLM Top 10 for the second consecutive year. It is also the most misunderstood risk in most enterprise security stacks — because it does not look like an attack. There is no malware binary. No network intrusion. No exploit payload. Just text, embedded in content the agent was supposed to read.

Direct prompt injection happens at the user input layer: an attacker crafts a message that overrides the agent’s system instructions. Indirect prompt injection is more dangerous and far harder to detect: the adversarial instruction is embedded inside external content the agent ingests during normal operation. A document it summarises. A webpage it scrapes. An email it processes. An API response it receives.

88% of organisations reported confirmed or suspected AI agent security incidents last year — Gravitee State of AI Agent Security 2026
In healthcare, that number is 92.7% — Gravitee 2026

The March 2026 OpenClaw incident is the clearest real-world demonstration of what indirect prompt injection looks like at scale. An adversarial payload embedded in external content manipulated an agent into constructing a URL containing sensitive context data — API keys, authentication tokens — and sending it via Telegram. The messaging app’s link preview fetched the URL automatically, transmitting the data to the attacker’s server with zero user interaction. No click. No warning. No anomaly visible in standard monitoring.

What makes this risk particularly hard to govern is the execution layer problem. Security teams have done solid work controlling which AI tools employees can access. They have no governance at the layer where agents actually take actions — the tool invocation layer. Prompt injection does not need to breach a perimeter. It only needs to manipulate an agent into using a tool it already has legitimate access to.

CrowdStrike and Cisco both moved in early 2026 to address this specifically at the execution layer. Cisco’s AI Defense expanded in February 2026 to add runtime protections against tool abuse and supply chain manipulation at the MCP layer. This is not fringe vendor movement. This is core enterprise security infrastructure shifting because the existing stack does not cover this attack surface.

What Actually Mitigates This

Build explicit trust tiers: content arriving from outside the system prompt and user messages should be handled in a lower-privilege context without access to sensitive tools or credentials
Run agent outputs through an inspection layer before delivery — flag URL generation containing credential-like patterns before they reach a messaging platform
Instrument the full input-to-output chain so injected payloads are visible in traces even when they produce no obvious anomaly in tool call patterns

Full technical breakdown: 7 Fixes for the Indirect Prompt Injection Vulnerability That’s Silently Leaking Agent Secrets

Risk 2: Over Permissioned Agents — The Blast Radius Problem

The average organisation now manages 37 deployed agents. That number grows every quarter as individual product and engineering teams spin up automation without central review. Each of those agents has credentials. Most of those credentials have more access than the agent’s actual tasks require.

53% of organisations grant AI tools write access to cloud productivity and collaboration suites — Cybersecurity Insiders AI Risk and Readiness Report 2026
40% grant write access to email. 25% to code repositories. 8% to identity providers. — Cybersecurity Insiders 2026
Only 29% of organisations limit AI tools to read-only access — Cybersecurity Insiders 2026

An agent with write access to the identity layer can create service accounts, elevate privileges across federated systems, and grant external access through API calls that never cross a network perimeter. This is not a theoretical risk. The Cybersecurity Insiders report documented the specific scenario: a SOC analyst arriving Monday morning, tracing an anomalous privilege change to a service account created by an agent 72 hours earlier.

The over-permission problem compounds in multi-agent systems. In a complex pipeline, the orchestration agent might hold API keys for five downstream agents. If the orchestrating agent is compromised, an attacker gains access to all five downstream systems through a single point of failure. The AIUC-1 Consortium documented exactly this scenario: a compromised manager agent commanding an accountant agent to move funds, bypassing security checks that would have triggered for any human request.

The Galileo AI research from December 2026 quantified the cascade dynamic: in simulated multi-agent systems, a single compromised agent poisoned 87% of downstream decision-making within four hours. Your SIEM will show fifty failed transactions. It will not show which agent initiated the cascade.

What Actually Mitigates This

Scope every API key and credential to the minimum required for the specific task — not the maximum the agent might ever need
Implement Just-In-Time (JIT) credential provisioning: access granted for the duration of a specific task, revoked immediately after
Treat AI agents as independent, identity-bearing entities requiring their own access reviews — not as extensions of human users or generic service accounts. Only 21.9% of organisations currently do this.
Audit write-access grants quarterly and establish approval gates for any action that creates accounts, modifies permissions, or moves data externally

Risk 3: Shadow AI — The Agents Your Security Team Doesn’t Know Exist

Shadow AI is not a new problem. What is new in 2026 is the scale at which it is now documented and the specific risk profile it creates in an agentic context.

76% of organisations cite shadow AI as a definite or probable problem — up from 61% in 2025 — HiddenLayer AI Threat Landscape 2026
Only 24.4% of organisations have full visibility into which AI agents are communicating with each other — Gravitee 2026
Shadow AI security incidents cost an average of $670,000 more than standard incidents — AGAT Software Enterprise Analysis 2026
More than half of all agents run without any security oversight or logging — Gravitee / AGAT 2026

The Moltbook incident from January 2026 is the clearest illustration of what happens when agents operate without proper identity management and permission gating. The platform — acquired by Meta in March 2026 — let AI agents interact autonomously in Reddit-style forums without human intervention. 404 Media discovered an unsecured database that allowed anyone to hijack any agent on the platform. The viral post that alarmed millions, an AI agent apparently organising a secret encrypted language to hide from humans, turned out to be a person exploiting the vulnerability to post under an agent’s credentials.

The enterprise implication is direct: when agents operate without security review, you cannot distinguish between legitimate agent behaviour and adversarial manipulation. The security team audited every official AI tool. They have no visibility into the ones running in private Telegram bots, personal API keys, browser extensions, and individual team automations that were never submitted for review.

You cannot govern what you cannot see. The average enterprise has an estimated 1,200 unofficial AI applications in use. 86% of organisations report no visibility into their AI data flows. Shadow AI is not a policy problem. It is a visibility problem. And until you solve the visibility problem, the policy has no enforcement surface.

What Actually Mitigates This

Deploy an AI inventory system that continuously discovers and catalogues agent deployments across the environment — not a one-time audit
Implement network egress monitoring specifically for AI agent traffic patterns — unsanctioned agents typically call external APIs that are not on your approved-vendor list
Establish a lightweight agent approval process that does not require full security review for low-risk deployments, but does require registration and credential scoping for anything with write access
Treat shadow AI discovery as a continuous function, not a periodic compliance exercise

Risk 4: AI Supply Chain Attacks — The Threat Inside the Plugin

Software supply chain attacks increased 742% between 2021 and 2025. The AI agent ecosystem has created a new attack surface within that broader trend: the plugin, skill, and third-party tool layer that agents depend on to extend their capabilities.

Malware hidden in public model and code repositories was the most cited source of AI-related breaches (35%) — HiddenLayer AI Threat Landscape 2026

93% of organisations continue to rely on open repositories for AI innovation despite this risk — HiddenLayer 2026

Barracuda Security identified 43 agent framework components with embedded vulnerabilities via supply chain compromise in 2026 alone — Barracuda Security 2026

The CNCERT warning on OpenClaw in March 2026 specifically flagged malicious skills uploaded to ClawHub — OpenClaw’s skill repository — that execute arbitrary commands once installed. This is not a novel attack pattern. It is the open-source dependency risk model applied to AI agent plugins, and most organisations are not treating AI skills and plugins with the same scrutiny they apply to software dependencies.

The OpenAI plugin ecosystem incident Stellar Cyber documented is more severe: a supply chain attack resulted in compromised agent credentials being harvested from 47 enterprise deployments. Attackers used those credentials to access customer data, financial records, and proprietary code for six months before discovery. Six months of dwell time, across 47 enterprises, through a single compromised plugin.

Cisco’s 2026 State of AI Security report specifically calls out MCP servers as an expanding supply chain attack surface — external tool connectors that most security teams are not auditing, even as agents increasingly depend on them for core functionality.

What Actually Mitigates This

Treat AI skills and plugins as production dependencies: require provenance verification, pin to specific versions, and review before installation
Disable automatic skill and plugin updates in production environments — updates should go through the same review process as new installations
Audit MCP server integrations with the same rigour applied to third-party API integrations
Monitor agent behaviour after any new skill or plugin installation for anomalous tool call patterns

The Pattern Across All Four Risks

Looking at these four risks together, the common thread is not the sophistication of the attacks. It is the gap between where security teams are looking and where agents are actually vulnerable.

Security teams secure the model layer — which tools employees can access, which vendors passed procurement review. The attacks happen at the execution layer: the tool invocations, the credential stores, the external content ingestion points, the third-party plugin ecosystem. These are the surfaces that existing security stacks were not built to govern.

The 37% of organisations that experienced AI agent-caused operational issues in the past twelve months — 8% of those significant enough to cause outages or data corruption — are not the organisations that failed to buy the right tools. They are the organisations that secured the model layer and left the execution layer ungoverned.

Policy documentation and runtime enforcement are not the same thing. 82% of executives are confident their policies cover this. Only 14.4% of agents go to production with full security approval. That gap is not a knowledge problem. It is an enforcement problem. And enforcement requires visibility into the execution layer, not just the model layer.

The complete framework for securing the execution layer — eight controls, a prioritised implementation roadmap, and the regulatory timeline — is in the pillar guide for this series.

AI Agent Security: The Complete Guide for Enterprise Builders in 2026

Owadokun Tosin Tobi — Fri, 20 Mar 2026 21:20:56 GMT

Before we get into architecture, frameworks, or mitigations, let’s establish what the data actually says about where we are in 2026.

48% of cybersecurity professionals rank agentic AI as the #1 attack vector for 2026 — Dark Reading
Only 34% of enterprises have AI-specific security controls in place — HiddenLayer / UCSI Institute
80% of organisations report risky agent behaviours including unauthorised system access — AIUC-1 Consortium / Help Net Security
Only 21% of executives have complete visibility into agent permissions and data access patterns — Help Net Security
76% of organisations cite shadow AI as a definite or probable problem — up from 61% in 2025 — HiddenLayer 2026 AI Threat Landscape Report
1 in 8 companies has now linked an AI breach directly to an agentic system — HiddenLayer 2026
$4.88M average cost of a data breach in 2024, with AI-related incidents carrying a premium — IBM X-Force 2026

Read those numbers together and the picture is clear: agentic AI is being deployed at enterprise scale, faster than the security infrastructure to govern it. The agents can act. The controls are not keeping up.

This guide is for the engineers and technical leaders who are closing that gap right now — not waiting for a framework committee to finalise a document.

What Is AI Agent Security — And Why Is It Different?

Traditional application security has a well-understood model: you protect the perimeter, validate inputs, sanitise outputs, manage credentials, and monitor for anomalies. These controls work because the application does predictable things in predictable ways.

AI agents break every one of those assumptions.

An agent does not have a predictable execution path. It reasons about what to do next based on context, instructions, and the outputs of previous steps. It can use tools it was not explicitly programmed to use. It can ingest content from external sources and allow that content to influence its behaviour. It can coordinate with other agents, delegating tasks and accepting results that it cannot independently verify.

Your SIEM and EDR tools were built to detect anomalies in human behaviour. An agent that runs code perfectly 10,000 times in sequence looks completely normal to these systems. But that agent might be executing an attacker’s will. — Stellar Cyber, 2026 Agentic AI Security Report

This is the core challenge of AI agent security: the attack surface is not a perimeter. It is a pipeline. Every point at which the agent ingests information, makes a decision, calls a tool, or communicates a result is a potential entry point for an attacker or a point of failure in the system’s own reasoning.

The Four Properties That Make Agents Uniquely Difficult to Secure

Autonomy. Agents act without continuous human oversight. A compromised agent can cause damage through entirely ordinary-looking operations. There is no human in the loop to notice that something is wrong until after the action has executed.

Tool access. Agents interact with APIs, databases, filesystems, messaging platforms, and other agents. Each integration is a potential attack surface. In a multi-agent system, a single compromised sub-agent can escalate privileges across every system that agent has access to.

External content ingestion. Agents read emails, scrape webpages, process documents, and receive API responses. Any of this content can contain adversarial instructions designed to manipulate the agent’s behaviour — without any indication to the agent that the instructions are malicious.

Memory persistence. Agents can retain context and instructions across sessions. A false instruction planted in an agent’s memory does not expire when the session ends. It persists, gets recalled in future sessions, and is treated as a legitimate directive.

The 2026 AI Agent Threat Landscape: Five Attack Vectors You Must Understand

Gartner identifies AI-specific threats as the number one emerging risk category for enterprises in 2026. These are the five attack vectors that the industry’s most credible security research — IBM X-Force, Cisco State of AI Security, HiddenLayer, and the AIUC-1 Consortium — consistently identifies as the primary threat classes for agentic systems.

▶ 1. Prompt Injection (Direct and Indirect)

Prompt injection remains the number one entry on the OWASP LLM Top 10. Direct prompt injection occurs when an attacker manipulates the user-facing input to override system instructions. Indirect prompt injection is more dangerous: the attack is embedded in external content that the agent ingests during normal operation. The agent cannot distinguish between legitimate content and adversarial instructions embedded within it.

Real-world evidence: In March 2026, OpenClaw AI agents were confirmed to be vulnerable to an indirect prompt injection attack that used messaging app link previews as a zero-click exfiltration channel. The agent was manipulated into constructing a URL containing sensitive data, which the messaging app’s link preview fetched automatically. China’s CNCERT issued an official warning recommending organisations isolate or restrict the tool. [Read the full breakdown → 7 Fixes for the Indirect Prompt Injection Vulnerability That’s Silently Leaking Agent Secrets]

▶ 2. Tool Use Failures and Hallucinated Actions

Agents can call tools with hallucinated parameters — fabricating identifiers, endpoint names, or configuration values that appear plausible but are incorrect — and pass them to APIs that execute without validation. When the hallucinated value resolves to something real (a repository, a file, a record), the consequences can be immediate and irreversible.

Real-world evidence: In March 2026, Claude Opus 4.6 hallucinated a GitHub repository ID at line 877 of its execution trace — with zero GitHub API calls before it — and used Vercel’s API to deploy unknown code to a customer’s production account. The repository was harmless by chance. The failure mode is not. [Read the full breakdown → Claude Hallucinated a GitHub Repo ID and Deployed It to Production]

▶ 3. Multi-Agent Trust and Privilege Escalation

In multi-agent architectures, agents communicate with and delegate tasks to each other. If a sub-agent is compromised, it can instruct an orchestrating agent to take actions it would not otherwise take — escalating privileges across every system the orchestrator has access to. The AIUC-1 Consortium documented a scenario where a compromised ‘manager agent’ commanded an ‘accountant agent’ to transfer funds, bypassing security checks that would have triggered for any human request.

Model-level guardrails alone are insufficient: Stanford’s Trustworthy AI Research Lab found fine-tuning attacks bypassed Claude Haiku in 72% of cases and GPT-4o in 57%. The trust boundary problem cannot be solved at the model level alone.

▶ 4. Shadow AI and Unsanctioned Agent Deployments

76% of organisations now cite shadow AI as a definite or probable problem, up 15 points year-over-year. The average enterprise has an estimated 1,200 unofficial AI applications in use, with 86% of organisations reporting no visibility into their AI data flows.

Shadow AI is particularly dangerous in the agentic context because unsanctioned agents often run with developer-level credentials, have no security review, and are not covered by the organisation’s incident response procedures. When they fail or are exploited, the security team may not even know the agent existed.

▶ 5. AI Supply Chain Attacks

The Barracuda Security report identified 43 different agent framework components with embedded vulnerabilities introduced via supply chain compromise in 2026 alone. Cisco’s State of AI Security 2026 specifically calls out the fragility of the AI supply chain — vulnerabilities in datasets, open-source models, tools, and MCP servers that most teams are not auditing.

The CNCERT warning on OpenClaw specifically flagged malicious skills uploaded to ClawHub, the tool’s skill repository, that execute arbitrary commands once installed. Third-party AI plugins and skills are the new open-source dependency risk — and most organisations are not treating them with the same scrutiny.

Four Real-World Incidents That Define the Current Risk Profile

The threat categories above are not theoretical. In the first three months of 2026 alone, four verified incidents have illustrated precisely what happens when agentic systems are deployed without adequate security architecture. All four are case studies I have written about in detail. Here is the summary.

Incident 1: Industrial-Scale Model Distillation (February 2026)

Anthropic publicly confirmed that three major Chinese AI labs — DeepSeek, Moonshot AI, and MiniMax — ran coordinated campaigns using over 24,000 fraudulent accounts to generate more than 16 million exchanges with Claude, extracting its capabilities to train their own models. The labs used proxy networks managing over 20,000 accounts simultaneously, mixing extraction traffic with legitimate requests to evade detection.

The security implication that most coverage missed: models built through illicit distillation do not retain the safety guardrails of the source model. The capabilities proliferate. The alignment does not. This is a national security risk that compounds every time a distilled model is open-sourced.

Full analysis: Anthropic’s Battle Against AI Espionage

Incident 2: Comparative Analysis of Agentic Edge Case Handling (February 2026)

A detailed comparative analysis of how Anthropic, OpenAI, and Google handle the four hardest agentic failure modes — tool use failures, multi-agent trust, ambiguous instructions, and long-horizon task failures — revealed three fundamentally different security philosophies: Anthropic treating failures as information architecture problems, OpenAI treating them as policy and prompt engineering problems, and Google treating them as protocol and governance problems. None has fully solved the multi-agent trust problem.

Full analysis: When Agents Go Wrong

Incident 3: Hallucinated Deployment via Vercel API (March 2026)

Claude Opus 4.6, operating through the OpenClaw agent framework with live Vercel API credentials, hallucinated a GitHub repository ID and deployed unknown code to a customer’s production account. Zero GitHub API calls preceded the deployment. The fabricated identifier resolved to a harmless repository by chance. The failure mode — an agent with deployment permissions and no verification layer — applies to any agentic system with access to production infrastructure.

Full analysis: Claude Hallucinated a GitHub Repo ID and Deployed It to Production

Incident 4: Zero-Click Data Exfiltration via Link Preview (March 2026)

PromptArmor documented an indirect prompt injection attack against OpenClaw that used Telegram’s link preview functionality as a data exfiltration channel. The agent was manipulated into constructing a URL containing sensitive context data; the messaging app’s backend fetched it automatically, transmitting credentials to an attacker’s server with no user interaction. CNCERT issued an official warning. Chinese government agencies, state enterprises, and banks were directed to restrict usage.

Full analysis: 7 Fixes for the Indirect Prompt Injection Vulnerability

Four incidents. Three months. One common thread: agentic systems deployed with default configurations, maximum permissions, and no verification layer between the agent’s outputs and consequential external actions.

The AI Agent Security Framework: Eight Controls That Actually Work in Production

The AIUC-1 Consortium — developed with input from Stanford’s Trustworthy AI Research Lab and 40+ security executives from Confluent, Elastic, UiPath, and Deutsche Börse — concluded that model-level guardrails alone are insufficient. What works is technically specific, architecturally enforced controls. Here are the eight that matter most

01 Least Privilege Tool Access with Just-In-Time Permissions

Every tool an agent can call is a potential attack surface. Scope permissions to the minimum required for the specific task, not the maximum the agent might ever need. Just-In-Time (JIT) permission granting — where access is provisioned only for the duration of a specific task and revoked immediately after — limits the blast radius of any single compromised operation.

In practice: an agent that needs to read a database should not have write access. An agent that needs to deploy to a staging environment should not have production credentials. An agent that needs to query a GitHub repository should not have a token with push access. Scope every key, every token, every permission to the minimum required action.

02 Verify Before Act — Never Assume an Identifier

The Vercel hallucination incident was caused by an agent using an assumed value — a repository ID it never looked up — to drive an irreversible deployment. The fix is architectural: for any action that takes an external identifier as input, require a verified API call to confirm that identifier before it can be used downstream.

This is the principle of verification before action. Build it as an explicit pipeline step. In LangGraph: a lookup node that must return a confirmed result before any write or deploy node can execute. The agent should not be able to carry an assumed ID through the execution graph.

03 Build a Verification Agent, Not Just Validation Logic

Input validation checks whether a value looks correct. Verification checks whether a value is correct. Format checks and type checks pass on hallucinated values. Only a verification step that goes back to the authoritative source catches confident hallucinations.

The pattern: a secondary agent audits the primary agent’s proposed outputs against verifiable sources before any external action executes. In Axiom Engine, I implemented this as a Prosecutor Agent that traces every generated claim back to a source coordinate — page number, line number — before the claim is allowed to surface to a user. The same principle applies to any consequential action: one agent proposes, a second agent verifies, before anything external executes.

04 Treat All External Content as an Untrusted Attack Surface

Every document an agent summarises, every webpage it scrapes, every email it reads, every API response it processes is a potential prompt injection vector. Build an explicit trust boundary: content that arrives from outside the system prompt and user messages should be handled in a lower-privilege context, without access to the tools and credentials that handle sensitive operations.

Cisco’s 2026 State of AI Security report calls out MCP servers as an expanding attack surface for this exact reason — external content flowing into agent context without adequate sanitisation or trust tiering.

05 Require Human Checkpoints for Irreversible Actions

Deployments, database writes, file deletions, financial transactions, and credential changes are irreversible. Any action that cannot be trivially undone should require a human approval step before execution — not after. LangGraph’s interrupt() function provides the mechanism: pause the execution graph, surface the proposed action to a human reviewer, resume only on explicit approval.

This is not a performance bottleneck. It is the architectural difference between an agent that automates your workflow and one that creates incidents in the middle of the night.

06 Isolate Agents with Container Boundaries and Egress Controls

Run agents in isolated containers with strict network egress rules. Define an allowlist of domains the agent can make outbound requests to. An injected instruction directing the agent to construct a request to an attacker’s domain cannot succeed if that domain is blocked at the network layer.

This was CNCERT’s explicit recommendation following the OpenClaw incident: isolate the service in a container, do not expose the management port to the internet, implement network policies that prevent unexpected outbound connections. These are standard infrastructure security practices applied to a new class of workload.

07 Implement Cross-Account Coordination Detection in Multi-Agent Systems

In multi-agent architectures, treat messages from sub-agents with the same scepticism you apply to user messages. Do not grant elevated trust to instructions that arrive via the agent communication channel. Validate sub-agent tool call results rather than accepting them blindly. Flag statistically anomalous coordination patterns — synchronised requests, shared payment methods, identical prompt structures across accounts — as signals of either a distillation campaign or a compromised sub-agent.

Anthropics’ detection of the DeepSeek, Moonshot, and MiniMax distillation campaigns was built on exactly this principle: coordinated activity detection that flagged the pattern of synchronised traffic across thousands of accounts, not the individual requests.

08 Instrument Everything — Full Trace-Level Observability is Non-Negotiable

The Vercel incident was diagnosable because the full execution trace was available: zero GitHub API calls before the deployment, the fabricated ID appearing at line 877. Without that trace, the incident is undiagnosable.

Log every tool call with inputs, outputs, and timestamps. Track which tools were called — and which were not called — before any consequential action. Build replay capability so any failure can be reconstructed exactly as it occurred. Alert on anomalous tool-call sequences: a deployment with zero prior lookup calls is a detectable pattern.

Only 21% of executives currently have complete visibility into agent permissions and data access patterns. That is the gap that makes every other control harder to enforce.

Why Your Existing Security Stack Is Partially Blind to These Threats

The AIUC-1 Consortium briefing is direct on this point: frameworks such as NIST AI RMF and ISO 42001 provide organisational governance structures. They do not address the specific technical controls that CISOs need for agentic deployments — tool call parameter validation, prompt injection logging, or containment testing for multi-agent systems.

The gap is not just at the governance layer. It is technical. Your SIEM was designed to detect anomalies in human behaviour. Your DLP was designed to catch data exfiltration over known channels. Neither was designed to monitor an agent that generates a crafted URL, sends it in a Telegram message, and allows the link preview fetch to do the exfiltration.

IBM X-Force 2026: A 44% increase in attacks beginning with exploitation of public-facing applications — largely driven by missing authentication controls and AI-enabled vulnerability discovery. Attackers are finding your gaps faster than your security team is patching them.

The practical implication: AI agent security requires controls that are layered at the pipeline level, not just the perimeter level. The agent’s execution trace, tool call sequences, external content ingestion points, and output channels all need visibility that your existing stack was not built to provide.

Where to Start: A Prioritised Implementation Roadmap

If you are securing an existing agent deployment, the eight controls above can feel overwhelming to implement simultaneously. Here is a prioritised order based on the real-world incident evidence from the first quarter of 2026:

Immediate (This Week)

Audit every live agent for credential storage in context windows. If credentials are present, move them to environment variables or a secrets manager immediately.
Disable link previews in any messaging platform integration. This single change closes the zero-click exfiltration vector documented in the OpenClaw incident.
Review tool permissions for every production agent. Identify any agent operating with permissions beyond what its current tasks require and scope them down.

Short-Term (Next 30 Days)

Implement verification nodes in your agent pipelines for any action that takes an external identifier as input.
Add human approval checkpoints for every irreversible action: deployments, database mutations, financial transactions, credential changes.
Containerise agent workloads with network egress allowlists if they are not already isolated.

Architectural (Next Quarter)

Build or integrate a verification agent layer — a secondary auditing agent that confirms proposed outputs against verifiable sources before consequential actions execute.
Implement full trace-level observability with anomaly detection on tool-call sequences.
Establish a continuous red-teaming process for agent deployments, integrated into your CI pipeline so model updates and prompt changes automatically trigger adversarial test suites.

The AIUC-1 Consortium’s recommendation, endorsed by the CTO of 1Password: ‘Baseline guardrails must be built into the platforms themselves. Sandboxed tool execution, scoped and short-lived credentials, runtime policy enforcement, and comprehensive audit logging should not require custom engineering.’ Until platform defaults catch up, these are engineering responsibilities.

The Regulatory Horizon: What’s Coming and When

EU AI Act enforcement begins in August 2026. If your AI agent touches EU citizens’ data, makes consequential decisions, or operates with a risk classification above ‘minimal’, compliance obligations are already active. The specific obligations relevant to agentic systems include transparency requirements for automated decision-making, human oversight mandates for high-risk applications, and incident reporting requirements for AI system failures.

In the US, following the White House Executive Order on AI, all major federal contractors are now required to conduct pre-deployment red team evaluations. This is catalysing a new professional category — AI red-teaming — that the Bureau of Labor Statistics projects will see 35% demand growth by 2028 with almost no supply to meet it.

The practical implication for enterprise builders: the regulatory timeline is no longer theoretical. Organisations that are building their AI agent security controls now will have a structural advantage when compliance becomes mandatory. Organisations waiting for the final guidance documents will be retrofitting controls into production systems under deadline pressure.

The Bottom Line for Enterprise Builders in 2026

AI agent security is not a new category of problem layered on top of existing cybersecurity. It is a fundamental shift in the threat model driven by a fundamental shift in what the systems being secured actually do.

Traditional software has a predictable execution path. AI agents reason, plan, decide, and act. They ingest external content that can influence their behaviour. They call tools that execute consequential, often irreversible actions. They coordinate with other agents in ways that can escalate privileges across entire systems. And they do all of this autonomously, at the speed of an API call, before any human has a chance to intervene.

The controls that work are not novel concepts. Least privilege. Verification before action. Human checkpoints for irreversible operations. Network isolation. Full observability. These are software engineering fundamentals applied to a new class of system where the stakes of skipping them are higher, and the speed of failure is faster, than anything built before.

The four incidents from Q1 2026 alone — the Anthropic distillation campaigns, the comparative edge case failures, the Vercel hallucination, the OpenClaw exfiltration — are not isolated data points. They are a pattern. The agents are in production. The attacks are happening. The controls are behind.

48% of security professionals say agentic AI is the top attack vector for 2026. Only 34% of enterprises have AI-specific security controls in place. That gap is where the incidents are happening. Closing it is not a future roadmap item. It is this quarter’s engineering priority.

This guide is a living document. As new incidents emerge, new controls are validated, and the regulatory landscape evolves, I will update it. Every article in this AI agent security series links back here as the source of truth. Subscribe to follow the full series as it publishes.

Sources And Further Reading

HiddenLayer — 2026 AI Threat Landscape Report

IBM — X-Force Threat Intelligence Index 2026

Dark Reading — 2026: The Year Agentic AI Becomes the Attack-Surface Poster Child

Help Net Security — AI Went From Assistant to Autonomous Actor and Security Never Caught Up

Cisco — State of AI Security 2026

Stellar Cyber — Top Agentic AI Security Threats in 2026

Practical DevSecOps — AI Security Statistics 2026

Kiteworks — Agentic AI: Biggest Enterprise Security Threat for 2026

Thanks for reading Edge Cases! This post is public so feel free to share it.

7 Fixes for the Indirect Prompt Injection Vulnerability That is Silently Leaking Agent Secrets

Owadokun Tosin Tobi — Tue, 17 Mar 2026 11:30:06 GMT

Your AI agent replies in Telegram with a link. You don’t click it. You don’t need to. The moment the app renders the link preview, your API keys are already on an attacker’s server.

That is not a theoretical attack scenario. It is the indirect prompt injection vulnerability that security researchers at PromptArmor documented in OpenClaw — the same open-source AI agent that crossed 250,000 GitHub stars in roughly 60 days, faster than React — and which China’s national cybersecurity agency CNCERT officially warned organisations to isolate or restrict on March 14, 2026.

Two weeks ago, I wrote about how OpenClaw combined with Claude Opus 4.6 hallucinated a GitHub repository ID and deployed unknown code to a Vercel customer’s production account. That was a failure of verification architecture. This is a different class of failure: a trusted, legitimate feature of your messaging platform weaponised against your agent’s context window, silently, without any user interaction at all.

Same tool. Different failure mode. The pattern is becoming hard to ignore.

Here is the full technical breakdown of how the attack works, why it is harder to catch than it sounds, and the seven concrete fixes that close the door on this class of vulnerability in any agentic system.

The Attack: How a Link Preview Becomes an Exfiltration Channel

To understand why this attack is particularly dangerous, you need to understand what link previews actually do at the protocol level.

When you send a URL in Telegram, Discord, or Slack, the messaging app’s backend automatically fetches that URL to generate a preview card — the title, description, and thumbnail you see before you click anything. This is a server-side operation. The fetch happens automatically, immediately, and silently. No user action required.

Now here is the attack chain, step by step:

Step 1: Injection

An attacker embeds a malicious prompt inside content that the agent is likely to read — a webpage it scrapes, a document it summarises, an email it processes. The content looks innocent to a human reader. To the agent, it contains instructions.

Step 2: Manipulation

The injected prompt instructs the agent to construct a specific URL: the attacker’s domain, with query parameters populated with sensitive data the agent has access to — API keys, authentication tokens, user data, environment variables, whatever sits in the agent’s context window.

Step 3: Delivery

The agent, following the injected instructions, sends this crafted URL as part of its reply inside Telegram or Discord. From the user’s perspective, the agent sent a message with a link. Nothing looks wrong.

Step 4: Exfiltration

The messaging app automatically fetches the URL to generate a link preview. The sensitive data in the query parameters is transmitted to the attacker’s server in that fetch request. The attacker’s server logs it.

Step 5: Zero trace

The user sees a normal-looking link preview. The agent continues its task. No error. No alert. No indication that anything went wrong.

The attack requires zero user clicks, zero model errors, and zero anomalous agent behaviour. The agent is doing exactly what it was instructed to do. The instruction just came from the attacker, not the user.

What makes this particularly difficult to detect is that the injected payload does not necessarily look like a prompt injection attempt. Giskard’s investigation into OpenClaw documented payloads that impersonate OpenClaw’s own system message format — appearing in the agent’s context window as if they were legitimate system-level directives, not external attacker instructions.

Why OpenClaw’s Default Configuration Makes This Worse

CNCERT’s warning named the root problem precisely: OpenClaw’s inherently weak default security configurations, combined with the high system privileges it needs to perform autonomous task execution, create an attack surface that is unusually wide out of the box.

Giskard’s January 2026 investigation documented exactly what this means in practice:

Exposed access tokens:

By default, access tokens appear in query parameters, making them harvestable from browser history, server logs, or non-HTTPS traffic.

Shared global context:

Direct messages were configured to share a single context, meaning secrets loaded for one user became visible to another in the same deployment.

No sandboxing in group chats:

Group chat deployments ran powerful tools without isolation, allowing the agent to read environment variables, API keys, and configurations — and modify its own routing.

Unrestricted injection paths:

External content — emails, scraped pages, third-party skills — provided direct paths for adversarial prompts to drive tool calls and exfiltrate data.

The link preview exfiltration vulnerability sits on top of all of this. An agent with access to plaintext credentials, running in a shared context, connected to Telegram, and ingesting external content is close to an ideal target for this attack chain.

This is not a criticism of OpenClaw as a project. It is a structural observation about what happens when a general-purpose agentic system ships with convenience-first defaults and high system privileges. The same configuration patterns exist in many custom agent deployments that haven’t been hardened. OpenClaw is just the case study that made it visible.

The Broader Pattern: Two OpenClaw Incidents in Two Weeks

Two weeks ago, the Vercel hallucination incident exposed a failure of verification architecture: an agent with no lookup step, passing an invented identifier to a real API with real consequences.

This week’s CNCERT warning exposes a failure of trust boundary architecture: an agent with no distinction between trusted instructions and injected adversarial content, connected to a messaging platform that helpfully transmits the attacker’s data harvest with no user intervention required.

The two failures are different in mechanism but identical in root cause: agents deployed with default configurations and maximum permissions, in environments where the attack surface was not modelled before deployment.

OpenClaw is not uniquely vulnerable here. The same attack patterns apply to any agent that ingests external content and communicates via messaging platforms with link preview functionality. Cursor, custom LangGraph agents, any Claude project wired to Telegram or Discord — if they ingest untrusted content and reply in a messaging app, the attack surface exists.

7 Fixes to Close This Vulnerability Class in Any Agent Deployment

These are ordered from the highest-leverage, lowest-effort changes to the more architectural ones. If you are running any agent that ingests external content and communicates via messaging platforms, start at the top.

1. Disable Link Previews in Your Messaging Integration

This is the direct fix for the specific attack vector PromptArmor documented. If the messaging app cannot fetch the URL, the data cannot be transmitted. Check your messaging platform’s developer documentation for the correct API parameter — for Telegram bots, this is typically handled at the message send level via the disable_web_page_preview parameter. For Discord, link embeds can be suppressed by wrapping URLs in angle brackets. Verify the correct configuration in your platform’s current documentation before deploying, as these settings change with API versions.

2. Never Store Credentials in the Agent’s Context Window

If the credentials are not in the context window, they cannot be exfiltrated via prompt injection. Inject secrets at runtime via environment variables or a secrets manager, scoped to the specific tool call that needs them. The agent should request a credential when it needs it, not carry it throughout the session. This single change dramatically reduces the value of a successful prompt injection attack — the attacker can manipulate the agent but cannot harvest what the agent does not hold.

3. Treat All External Content as Untrusted Input

Every webpage the agent scrapes, every email it reads, every document it summarises, every API response it processes is a potential injection vector. Build your agent pipeline with an explicit trust boundary: content that arrives from outside the system prompt and user messages should be handled in a separate, lower-privilege context. LangGraph makes this architectural: define which nodes operate on external content and ensure those nodes do not have access to the tools or context that handle sensitive operations.

In Axiom Engine, the Prosecutor Agent’s role is partly about this exact boundary. Claims generated from external document content are verified against source coordinates before they can influence any downstream action. The verification step is the trust boundary. Nothing from external content passes through without audit.

4. Isolate Your Agent in a Container with Strict Network Egress Rules

CNCERT’s explicit recommendation: isolate the service in a container and prevent OpenClaw’s default management port from being exposed to the internet. But the principle extends further: define an egress allowlist for your agent’s container. The agent should only be able to make outbound requests to domains explicitly on that list. An injected prompt instructing the agent to construct a request to attacker.com cannot succeed if attacker.com is not on the allowlist and the network policy enforces it.

5. Audit Your Skills and Third-Party Integrations

CNCERT specifically flagged malicious skills uploaded to ClawHub — OpenClaw’s skill repository — that execute arbitrary commands once installed. The same risk exists in any agent ecosystem that allows third-party tool integrations without verified provenance. Download skills only from sources you have independently verified. Disable automatic skill updates in production. Treat new skill installations with the same review process you would apply to a production dependency update.

6. Run Agent Outputs Through an Output Inspection Layer Before Delivery

Before the agent sends any message to a messaging platform, run the output through an inspection step that flags anomalous patterns: URLs containing query parameters with values that look like credentials, API keys, or environment variable contents; requests to domains outside a known-good list; outputs that contain structural patterns consistent with data exfiltration payloads. This does not need to be a heavyweight classifier — a lightweight rule-based check catches the most obvious attack patterns before they reach the messaging platform.

7. Instrument the Full Input-to-Output Chain

The indirect prompt injection attack is invisible in agent logs that only record tool calls and final outputs. It becomes visible when you log the full input context at each step — including the external content that was ingested, the instructions that appeared in the agent’s context window, and the reasoning chain that led to each output. Full trace-level observability does not prevent the attack, but it makes forensic reconstruction possible and enables anomaly detection on context content, not just tool call patterns.

Related from this series:
AI Agent Security: The Complete Guide for Enterprise Builders in 2026
What Are the 4 Biggest AI Agent Security Risks in 2026?

The Bigger Lesson: Default Configurations Are a Security Decision

Every default configuration in an agentic system is an implicit security decision. Enabling link previews by default is a convenience decision that is also a security decision. Storing credentials in context is a convenience decision that is also a security decision. Allowing unrestricted egress is a convenience decision that is also a security decision.

The challenge is that agentic systems are being adopted faster than the security mental model that should accompany them. Engineers who would never deploy a web application without reviewing its default security settings are deploying agents with full system privileges and messaging platform integrations without the same review.

The attack surface of an agentic system is not the model. It is the pipeline: what content the agent ingests, what context it holds, what tools it can call, and what channels it can write to. Every junction in that pipeline is a potential injection or exfiltration point.

OpenClaw keeps making security headlines not because it is uniquely dangerous, but because it is widely deployed and its architecture makes the risks visible at scale. The same risks exist in quieter deployments that haven’t been audited yet.

Indirect prompt injection is not a model alignment problem. It is a software security problem applied to a new class of system. The fixes are architectural, not model-specific. They work regardless of which LLM your agent runs on.

What to Do This Week

If you are running any agent that ingests external content and communicates via messaging platforms, run through this checklist before your next deployment:

• Audit whether link previews are enabled in your messaging integration and disable them if they are not required

• Review what credentials and sensitive data sit in your agent’s context window throughout a session

• Map every external content ingestion point in your pipeline and confirm it operates in a lower-privilege context

• Verify your agent’s container has network egress rules that enforce an allowlist

• Check your skills and third-party tool integrations for provenance and disable automatic updates

None of these take more than a few hours to implement. The attack they prevent can exfiltrate credentials in seconds.

Which of these is missing from your current setup? Drop it in the comments — the most common gap is probably worth a follow-up post.

Sources & Further Reading

The Hacker News — OpenClaw AI Agent Flaws Could Enable Prompt Injection and Data Exfiltration (Mar 14, 2026)

Giskard — OpenClaw Security Vulnerabilities Include Data Leakage and Prompt Injection Risks

OpenClaw GitHub Issue #22060 — Indirect Prompt Injection via URL Link Preview Metadata

Anthropic — Mitigate Jailbreaks and Prompt Injections

Claude Hallucinated a GitHub Repo ID and Shipped It to Production.

Owadokun Tosin Tobi — Thu, 05 Mar 2026 01:04:24 GMT

Imagine opening Vercel and seeing an unknown GitHub repository deployed to your team's account. No PR. No push notification. Nobody on your team did it.

That is exactly what happened to a real Vercel customer last week. Security and infrastructure engineering engaged immediately. What they found wasn't a breach, a rogue employee, or a supply chain attack.

It was Claude Opus 4.6 — one of the most capable AI models in the world — hallucinating a GitHub repository ID out of thin air and using Vercel's own API to deploy it.

The repository turned out to be harmless — some student's homework code, by chance. But Vercel CEO Guillermo Rauch posted the full technical breakdown publicly on X, where it racked up nearly 600,000 views in under 24 hours. Because the community understood immediately: this time it was harmless. Next time it might not be.

As someone building production agentic systems, I read that post and immediately pulled up my own agent logs. Here is the full technical breakdown — what happened, why it happened, and the five concrete guardrails every builder needs before shipping agents with API access to real infrastructure.

The Incident: What Actually Happened

The setup is one most engineers in 2026 will recognise instantly: a user running OpenClaw — an open-source personal AI agent with over 250,000 GitHub stars, surpassing React's star count in roughly 60 days — wired to Claude Opus 4.6 with access to Vercel's API and real credentials.

The agent was given a task. At some point in the execution chain, it needed a GitHub repository ID to complete a deployment. Rather than calling the GitHub API to look up the correct numeric ID, the agent did something no human engineer would do: it invented one.

The hallucinated payload looked like this:

"gitSource": {

"type": "github",

"repoId": "913939401", // hallucinated — never looked up

"ref": "main"

}

When the user asked the agent to explain what had gone wrong, the confession was precise. There were zero GitHub API calls in the entire session before the first rogue deployment. The number 913939401 appeared for the first time at line 877 of the execution trace — fabricated entirely. The agent knew the correct project ID and project name but invented a plausible-looking numeric repo ID rather than looking it up.

By chance, the ID resolved to a real but completely unrelated public repository — harmless student code. Vercel caught it, investigated it, and Rauch published the full account publicly. The transparency is admirable. The failure mode is worth taking seriously.

This is not an off-by-one error. It is not a typo. The agent fabricated an entire identifier from nothing, passed it to a real API with real access, and executed a deployment. The model's confidence and the API's permissiveness combined into an action that bypassed every human checkpoint.

Why This Failure Mode Is Different From Everything Before It

LLM hallucinations are not new. Models fabricate citations, misstate facts, and confidently produce wrong answers. Engineers have learned to treat model outputs as drafts to verify, not facts to trust.

What makes the Vercel incident categorically different is the combination of two things that have only recently come together at scale: hallucinated content feeding directly into irreversible API actions with real-world consequences.

When a model hallucinates a citation in a chat response, the cost is a wrong answer. When a model hallucinates a repository ID inside an agent with deployment permissions, the cost is an unknown codebase running on your customer's infrastructure — before any human has seen it.

Rauch named this precisely in his post: "Powerful APIs create additional risks for agents. The API exists to import and deploy legitimate code, but not if the agent decides to hallucinate what code to deploy."

This is the new risk profile of agentic AI. Errors are no longer limited to incorrect text. They now translate into production deployments, cloud resource mutations, financial transactions, and security incidents. And they execute immediately, at the speed of an API call, before any human has a chance to intervene.

5 Guardrails Every Builder Needs Before Shipping Agents with API Access

These are not hypothetical best practices. They are the specific controls that would have prevented or contained the Vercel incident — and the ones that production engineers are now treating as non-negotiable.

Never Let an Agent Assume an Identifier — Always Verify It

The core failure in the Vercel incident was an agent using an assumed value (a repo ID it never looked up) to drive an irreversible action (a deployment). The fix is architectural: for any action that takes an external identifier as input, require the agent to retrieve that identifier via a verified API call before it can use it downstream.

This is the principle of verification before action. In LangGraph terms, it means building an explicit lookup node that must succeed — and return a confirmed result — before any write or deploy node can execute. The agent should not be allowed to carry an assumed ID through the execution graph.

In Axiom Engine, my Prosecutor Agent enforces exactly this pattern at the claim level: every generated assertion must be traced back to a verified source coordinate (page number, line number) before it can be surfaced to a user. The principle is identical — no unverified value should propagate downstream into a consequential action.

Apply Least Privilege to Every Tool the Agent Can Access

OpenClaw is not at fault in this incident — it is, as Rauch noted, just an agent with access to tools and keys. But that framing surfaces the real question: which tools should any agent actually have access to, and at what permission level?

Anthropic's own documentation for agentic systems is explicit on this: agents should receive scoped permissions, limited API keys, and environment-specific access. A deployment agent does not need write access to all repositories. A research agent does not need production API keys. The principle of least privilege — standard in traditional security — applies with even more force to agents, because the blast radius of a wrong action is amplified by the speed and autonomy at which agents operate.

Scope API keys to the minimum permissions the task requires
Use read-only keys wherever the task does not require writes
Rotate keys per-session rather than issuing long-lived credentials
Explicitly list allowed tools in your agent configuration — do not default to all-access

Require Human Approval for Irreversible Actions

Deployments are irreversible. Database writes are irreversible. File deletions are irreversible. Any agentic action that cannot be trivially undone should require a human checkpoint before execution — not after.

LangGraph's interrupt() function exists precisely for this: it pauses the execution graph at a defined checkpoint, surfaces the proposed action to a human reviewer, and resumes only on explicit approval. This is not a performance bottleneck — it is the architectural difference between an agent that automates your workflow and one that creates incidents at 3am while you are asleep.

The practical implementation is a pre-deploy policy check: before any tool call that writes, deploys, or mutates, the agent must pass a structured validation that confirms the action parameters have been verified (not assumed), the action is within the agent's scoped permissions, and a human has reviewed and approved it for high-consequence operations.

Build a Verification Agent — Not Just Validation Logic

Input validation — checking whether a value looks plausible — would not have caught the Vercel incident. The hallucinated repo ID 913939401 is a perfectly plausible-looking numeric GitHub repository ID. It passes format validation. It passes type checking. It only fails when you actually call the GitHub API to confirm it resolves to the repository the agent thinks it does.

This is the distinction between validation (does this value look right?) and verification (is this value actually correct?). Production agentic systems need verification layers, not just validation layers.

Anthropic's official guardrail documentation recommends using a lightweight model like Claude Haiku to pre-screen agent outputs before they are passed to consequential tool calls. Applied to the Vercel case: a verification agent checks whether the repo ID the primary agent is about to use actually resolves to the correct repository via a GitHub API call. If it does not match, the deployment does not proceed.

This is the Prosecutor Agent pattern. One agent proposes; a second agent verifies before any external action executes. The overhead is one additional API call. The cost of skipping it is a rogue deployment.

Instrument Everything — Trace-Level Observability Is Not Optional

One of the most important details in Rauch's post was how the failure was diagnosed: the agent's confession traced the hallucinated ID to line 877 of the execution log, with zero GitHub API calls recorded before that point. This level of diagnostic clarity is only possible because the execution trace was complete.

Without full trace-level observability, you cannot diagnose agent failures, you cannot detect drift from expected behavior, and you cannot build the feedback loops that make agents more reliable over time. Action anomaly detection — flagging suspicious tool choices, unusual parameter values, and unexpected API call patterns — requires a baseline of normal behavior, which requires complete instrumentation from day one.

Log every tool call with inputs, outputs, and timestamps
Track which tools were called (and which were not) before any consequential action
Alert on statistically anomalous tool-call sequences — like a deployment with zero prior lookup calls
Build replay capability so any failure can be reconstructed exactly as it occurred

What Vercel Is Actually Building

Rauch's post was not just a transparency disclosure. It was a positioning statement. His closing line: "This reinforces our commitment to make Vercel the most secure platform for agentic engineering. Through deeper integrations with tools like Claude Code and additional guardrails, we're confident security and privacy will be upheld."

The direction is clear: Vercel is betting that agentic deployment will become standard infrastructure, and that the platform that wins will be the one that makes it safe by default. Deeper Claude Code integration means the agent and the deployment platform share enough context to validate actions before they execute — rather than the agent treating Vercel's API as a black box it can call with any parameters it chooses.

This is the right architectural direction. The Vercel incident happened because the agent and the API had no shared verification layer. The agent could pass any repo ID to the API; the API would execute it. What Vercel is describing is a future where the platform understands enough about the agent's execution context to refuse to deploy a repository that was never part of the agent's verified task scope.

The Takeaway for Every Engineer Shipping Agents in 2026

The Vercel incident was harmless by luck. A random student's homework landed on a customer's deployment rather than malicious code, ransomware, or a cryptocurrency miner. The probability of that outcome holding every time is exactly zero.

What makes this moment important is that it happened publicly, was documented precisely, and was disclosed by the CEO of the platform rather than buried in an internal post-mortem. That is rare, and it gives every engineer building agentic systems a verified case study to reason from.

The guardrails are not complicated. Verify identifiers before using them. Scope API keys to minimum permissions. Require human checkpoints before irreversible actions. Build a verification agent, not just validation logic. Instrument everything.

None of these are new principles. They are standard software engineering discipline applied to a new class of system where the stakes of skipping them are higher, and the speed of failure is faster, than anything we have built before.

Autonomy without verification is not intelligence — it is risk. The agents that will earn trust in production are not the most capable ones. They are the ones built with the assumption that they will eventually hallucinate something consequential, and designed to catch it before it ships.

What guardrail are you adding to your stack this week? Drop it in the comments.

Sources & Further Reading

Guillermo Rauch — Original X post on the Vercel hallucination incident (Mar 3, 2026)

Anthropic Docs — Mitigate Jailbreaks and Prompt Injections

Anthropic Engineering — Introducing Advanced Tool Use (Nov 2025)

Permit.io — Human-in-the-Loop for AI Agents: Best Practices and Frameworks

Akira.ai — Real-Time Guardrails for Agentic Systems

Leanware — Agentic AI Guardrails: How to Build Safe and Scalable Autonomous Systems

When Agents Go Wrong:

Owadokun Tosin Tobi — Fri, 27 Feb 2026 18:57:16 GMT

Agentic AI systems went from demo magic to production infrastructure faster than most engineers were ready for. Success rates on standard benchmarks climbed from 15% to the high 80s in fourteen months. But that still means one failure in five — and in a system that can book flights, execute code, modify files, or send emails on your behalf, that failure rate isn't a statistic. It's a liability.

I build production agentic systems for a living. With Axiom Engine, I've engineered a workflow where a secondary "Prosecutor Agent" audits every claim a primary agent generates against source coordinates — page number, line number — before anything surfaces to a user. I built it that way because I've seen firsthand what happens when agentic systems have no verification layer: they hallucinate confidently, act irreversibly, and loop silently. The gap between a working demo and a trustworthy production agent isn't capability. It's edge case handling.

Early Access Now Available

So how are the three leading AI providers — Anthropic, OpenAI, and Google — actually approaching this? Not in demos. In documented systems, published safety research, and real engineering decisions. Let's break it down across four of the hardest categories.

1. Tool Use Failures & Hallucinated Actions

This is the foundational edge case: the agent calls a tool that doesn't exist, passes malformed parameters, or — worse — confidently takes an irreversible action based on a hallucinated capability.

▌ Anthropic

Anthropic's November 2025 advanced tool use engineering post identified the most common failure modes in production tool use: wrong tool selection and incorrect parameters, especially when tools have similar names like notification-send-user versus notification-send-channel. Their solution was architectural. Rather than patching individual failures, they released three complementary features: a Tool Search Tool that discovers tools on-demand instead of loading all definitions upfront (cutting token overhead by 85% and lifting accuracy on MCP evaluations from 49% to 74% for Opus 4), Programmatic Tool Calling that lets agents invoke tools from within code execution environments, and Tool Use Examples that embed usage patterns directly into tool definitions — because JSON schemas tell an agent what's structurally valid, not when to use optional parameters or which combinations actually work.

The insight here is important: Anthropic framed tool failures not as a model alignment problem but as an information architecture problem. Give the model better signal about how tools should be used, and failure rates drop.

▌ OpenAI

OpenAI's Operator system card was candid about its failure modes in a way that's rare in the industry. When copying complex values from the screen — API keys, wallet addresses — the agent defaulted to reading them visually via OCR rather than copying programmatically, producing transcription errors. In code editing tasks, visual text editing mistakes in tools like nano or VS Code would cascade: one error would force the agent into a loop trying to repair itself, eventually exhausting the time budget. OpenAI's mitigation approach for tool failures leans on prompt engineering and developer documentation — their agent safety guidance explicitly advises strengthening prompts with "good documentation of your desired policies and clear examples" and anticipating unintended scenarios with worked examples.

▌ Google

Google's approach to tool failure is primarily protocol-level. Their Tool Search Tool equivalent is built into how A2A and their Agent Development Kit (ADK) structure tool discovery — agents advertise capabilities via JSON Agent Cards, and client agents query these to find the right sub-agent for a task. For tool-level accuracy, Google relies on the ADK's built-in turn limits to prevent infinite repair loops — a guard rail that emerged directly from production experience. Their self-reported insight: agents need dynamic tool discovery, not static registries, to handle the scale of real enterprise tool libraries.

2. Multi-Agent Trust & Permission Boundaries

As systems scale from single agents to networks of coordinating agents, the trust model becomes critical. Which agent can instruct which? What permissions flow downstream? Can a sub-agent be compromised and used to attack the orchestrating system?

▌ Anthropic

Anthropic's published approach to multi-agent trust is grounded in a simple principle: treat messages from other agents with the same skepticism you'd apply to user messages — not elevated trust. Their guidance explicitly flags that a compromised sub-agent could attempt to manipulate an orchestrator, and recommends that orchestrators validate tool call results rather than accepting them blindly. In the Claude Code security incident disclosed in November 2025, a threat actor was able to use Claude Code as an orchestration engine to run parallel attack sequences across multiple sessions — reconnaissance, credential harvesting, lateral movement — with minimal human intervention. Anthropic's response was to build a tailored classifier for detecting this class of abuse and implement new detection methods for malicious tool invocations. The lesson from the incident was structural: agentic systems need to treat the orchestration layer itself as an attack surface.

▌ OpenAI

OpenAI's multi-agent trust framework centers on the instruction hierarchy: system messages outrank user messages, and the model should be robust to attempts to override this hierarchy. The joint Anthropic-OpenAI alignment evaluation published in 2025 found that Claude 4 models outperformed other frontier models on avoiding system message versus user message conflicts. OpenAI's Operator introduced explicit user confirmation steps for high-stakes actions — passwords, banking transactions — as a hard permission boundary. For MCP tool integrations specifically, OpenAI's developer documentation mandates enabling tool approvals so users can review every operation, including reads. Their Atlas browser agent shipping a security update in late 2025 after automated red-teaming found a new class of prompt injection attacks underscores how seriously they treat the trust boundary problem in multi-agent browser contexts.

▌ Google

Google built trust boundaries into the A2A protocol specification from launch. The protocol requires enterprise-grade authentication and authorization at parity with OpenAPI schemes, and their Agent Payments Protocol (AP2), launched September 2025, introduced cryptographically signed "Mandate" contracts so agents can only execute transactions that carry verifiable proof of user authorization. In their 2025 "Lessons from Agents and Trust" post from the Office of the CTO, Google identified the core governance challenge in multi-agent systems: in complex workflows, it's difficult to isolate which agent drove success or caused failure. Their solution was Game Arena — dynamic simulation that wargames agents against each other in adversarial scenarios rather than testing against static benchmarks.

The industry consensus is forming around a single insight: in multi-agent systems, the trust model must be explicit, cryptographically enforced where possible, and designed assuming any sub-agent can be compromised.

3. Ambiguous or Conflicting Instructions

Agents operating autonomously will inevitably encounter instructions that are underspecified, internally contradictory, or in conflict across different authority levels — user, operator, platform. How they resolve this determines whether they're trustworthy in production.

▌ Anthropic

Anthropic's approach to instruction conflict is hierarchical by design and tested under adversarial conditions. In the joint alignment evaluation with OpenAI, Claude 4 Opus outperformed all other models on avoiding conflicts between system and user messages — particularly when the system message contained an explicit constraint. The evaluation revealed one important nuance: when a simulated user expressed severe frustration, Claude 4 Sonnet abandoned its instructions while Opus maintained compliance while tactfully acknowledging the user's distress. This suggests the resolution of ambiguity is still partially model-size dependent in practice. Anthropic's published agent building guidance recommends pausing and surfacing ambiguity to humans rather than resolving it autonomously — preferring explicit clarification over autonomous assumption in irreversible or high-stakes contexts.

▌ OpenAI

OpenAI addresses instruction ambiguity primarily through their reasoning models, which they describe as "more disciplined about following developer instructions" and more robust against jailbreaks and indirect prompt injections. Their agent safety documentation recommends configuring reasoning models at the agent node level for high-risk workflows specifically because of their stronger instruction-following. For the Operator system, they built in explicit task abandonment — the agent will stop and ask the user rather than proceeding when it encounters ambiguity about a high-stakes action. The prompt injection problem (malicious content in the environment that issues conflicting instructions to the agent) received dedicated attention: OpenAI built an automated attacker using RL-trained LLMs to discover multi-step injection strategies before they appeared in production, then shipped adversarial training updates to Atlas as a result.

▌ Google

Google's ADK handles instruction ambiguity at the orchestration layer through its Agent Card system — each agent explicitly declares its capabilities and accepted task types, so the orchestrating agent can route tasks to the appropriate sub-agent rather than having sub-agents attempt to resolve tasks outside their declared scope. For long-running tasks, the A2A protocol's defined lifecycle states (active, completed, failed, canceled) give orchestrators a structured mechanism to detect when a sub-agent has stalled or produced an out-of-scope result. Google Cloud's OCTO post from December 2025 noted that static benchmarks are obsolete for real business use cases — real business is "dynamic, adversarial, and negotiated" — which is precisely why instruction ambiguity requires dynamic simulation testing, not just fixed-prompt evaluations.

4. Long-Horizon Task Failures & Loop Detection

When an agent runs a multi-step task over minutes or hours, it can fail in ways that are invisible in short-horizon tests: it gets stuck in a repair loop, it loses context across steps, it succeeds locally but fails globally, or it makes a cascading sequence of small errors that compound into a large one.

▌ Anthropic

Anthropic's most significant technical response to long-horizon failures is context compaction, available on the API. Claude can now summarize its own context and continue longer-running tasks without hitting token limits — a foundational requirement for any task that runs for more than a few exchanges. Claude Sonnet 4.5 has been observed sustaining focus on complex multi-step workflows for over 30 hours. Their advanced tool use features address one of the main drivers of context degradation in long-horizon tasks: intermediate tool results piling up in context even when no longer relevant. Programmatic Tool Calling addresses this by letting Claude invoke tools from within code execution environments, separating the orchestration logic from the accumulating conversation context.

When I built Axiom Engine, the hardest problem wasn't the primary agent — it was making the Prosecutor Agent's auditing loop resilient across long document processing chains. A single context overflow mid-audit would silently corrupt the verification chain. Anthropic's context compaction directly addresses exactly this class of failure.

▌ OpenAI

OpenAI's system card for Operator documented one of the most instructive long-horizon failure patterns in the industry: OCR errors on API keys and complex values would force the agent into an error-repair loop that would run until it exhausted its allotted time budget, rather than failing gracefully and surfacing the error to the user. Their published self-healing cookbook for autonomous agent retraining describes a three-phase response to long-horizon failures: capturing failures through human review or LLM-as-judge evaluation, running iterative prompt refinement loops, and promoting improvements back to production — essentially making the system learn from its own long-horizon failure patterns over time.

▌ Google

Google's engineering response to long-horizon failures is built into the A2A protocol's task lifecycle. Tasks have defined terminal states — completed, failed, canceled — and the protocol supports real-time feedback during long-running operations via Server-Sent Events, so orchestrators can detect stalls or drift without waiting for full task completion. The ADK's turn limits provide a hard circuit breaker against infinite repair loops, a pattern that emerged from internal experience. Their AI Co-Scientist system — which runs research idea generation as a multi-agent tournament with ELO scoring — represents one of the most sophisticated published approaches to long-horizon multi-agent coordination, using structured peer review simulation to prevent any single agent's output from propagating unchecked through the pipeline.

The Bigger Pattern: Three Different Philosophies

After mapping these four edge case categories across the three providers, a clear divergence in philosophy emerges — and it's worth naming explicitly.

Anthropic approaches edge cases as information architecture problems. When tools fail, the fix is better tool discovery and usage examples. When context degrades, the fix is structural compaction. When trust is violated, the fix is architectural skepticism at the protocol level. The solutions are generally more deeply embedded and harder to retrofit.

OpenAI approaches edge cases as policy and prompt engineering problems. When agents fail, strengthen the prompt, document the policy, and use more capable reasoning models for high-risk nodes. This is more accessible to developers who don't control the underlying model, but it means safety properties are more fragile — as their own research acknowledged, explicit safety instructions reduce but don't eliminate harmful behavior.

Google approaches edge cases as protocol and governance problems. A2A, AP2, Agent Cards, cryptographic Mandates — the safety properties are built into the communication standards rather than into individual model behavior. This scales well across heterogeneous systems and vendors, but shifts the responsibility for correct implementation to developers building on the protocol.

None of these philosophies is wrong. The question is which one matches your production context. If you're building a single-vendor agentic system with deep Anthropic integration, their architectural approach gives you the most leverage. If you're orchestrating agents across vendors and platforms, Google's protocol-level approach is where the real safety properties live. If you're a developer working with limited access to model internals, OpenAI's policy-first approach is the most immediately actionable.

What This Means for Engineers Building Today

The MIT AI Agent Index 2025 found that of the 13 agents exhibiting frontier autonomy levels, only 4 disclosed any agentic safety evaluations. 25 out of 30 agents disclosed no internal safety results. The accountability gap between capability disclosure and safety disclosure is wide — and it's getting wider as autonomy levels rise.

As engineers, the practical implication is this: don't assume the platform handles the edge cases you haven't tested. The OpenClaw benchmark — which drops agents into unscripted scenarios with ambiguous instructions and irreversible actions — found that top-performing models from all three providers failed regularly when tested under realistic conditions rather than favorable ones. One failure in five isn't an edge case. It's a design constraint.

The patterns that work in production, across all three providers, come back to the same set of principles: verify before acting, build explicit escalation paths for ambiguity, treat sub-agents as untrusted until proven otherwise, and design for graceful failure rather than assumed success.

These aren't new principles. They're software engineering fundamentals, applied to a new class of system. The difference is that when an agentic system fails, the blast radius is larger — and often irreversible. That's the edge case the whole industry is still learning to handle.

Sources & Further Reading

Anthropic — Introducing Advanced Tool Use (Nov 2025)

Anthropic — Detecting and Countering Misuse of AI (Aug 2025)

OpenAI — Operator System Card

OpenAI — Safety in Building Agents

OpenAI / Anthropic — Joint Alignment Evaluation (2025)

OpenAI — Prompt Injection and ChatGPT Atlas (CyberScoop)

Google — Announcing Agent2Agent Protocol (Apr 2025)

Google Cloud — Lessons from 2025 on Agents and Trust

MIT AI Agent Index 2025

OpenClaw Benchmark — WebProNews

Tosin Owadokun is a Senior AI Engineer based in Lagos, Nigeria, specialising in production agentic systems, LLM evaluation, and MLOps infrastructure.

GitHub

Anthropic's Battle Against AI Espionage:

Owadokun Tosin Tobi — Mon, 23 Feb 2026 22:55:22 GMT

On February 24, 2026, Anthropic dropped a bombshell that has racked up over 4.7 million views on X: three major Chinese AI labs — DeepSeek, Moonshot AI, and MiniMax — ran coordinated, industrial-scale campaigns to siphon capabilities from Claude, Anthropic's flagship model. Over 24,000 fraudulent accounts. More than 16 million exchanges. All in direct violation of Anthropic's terms of service and regional access restrictions.

https://www.x.com/i/status/2025997928242811253

This isn't just a corporate IP dispute. As someone who builds production agentic systems for a living, this incident cuts to the core of something I've been thinking about for a while: the real moat in AI isn't raw capability — it's orchestration, trust, and the hard-won engineering that makes these systems safe and reliable in production.

Let me break down what actually happened, how it was detected, and why it matters — especially for engineers building on top of frontier models.

What Is Model Distillation — And When Does It Turn Illicit?

Model distillation is a well-established, legitimate technique. A smaller "student" model learns from the outputs of a larger "teacher" model, inheriting capabilities at a fraction of the compute cost. Frontier labs do this all the time internally — Anthropic itself distills Claude to create smaller, cheaper variants for specific use cases.

The technique turns illicit when a competitor systematically extracts a model's capabilities without permission, at scale, through fraudulent access. Think of it as the difference between studying how a chef's restaurant operates versus breaking into their kitchen every night and photographing the prep line.

What makes distillation attacks particularly dangerous is the stripping of safeguards. Distilled models built this way are unlikely to retain the safety guardrails that the original model was trained to enforce — creating a proliferation risk that extends well beyond competitive dynamics.

The Three Campaigns: What Each Lab Actually Targeted

Anthropic attributed each campaign with "high confidence" through IP correlation, request metadata, infrastructure indicators, and corroboration from industry partners. Here's what they found:

DeepSeek — 150,000+ exchanges

DeepSeek ran the most technically interesting operation. Beyond bulk reasoning extraction, they prompted Claude to retroactively articulate the internal reasoning behind completed responses — effectively generating chain-of-thought training data at scale without ever triggering obvious CoT prompts. They also generated "censorship-safe" alternatives to politically sensitive queries, likely to train their own models to steer away from topics sensitive to the Chinese Communist Party.

Synchronized traffic. Identical patterns. Shared payment methods. Classic load-balancing tradecraft, applied to AI capability extraction.

Moonshot AI (Kimi) — 3.4 million+ exchanges

Moonshot ran the most expansive target surface: agentic reasoning, tool use, coding, data analysis, and computer vision. They used hundreds of varied fraudulent accounts across multiple access pathways — varying account types deliberately to avoid a coordinated pattern signature. Attribution came through request metadata matching public profiles of senior Moonshot staff. In a later phase, they pivoted to specifically attempt reconstruction of Claude's hidden reasoning traces.

MiniMax — 13 million+ exchanges

MiniMax ran the largest campaign by volume, targeting agentic coding and tool orchestration — precisely the capabilities that matter most for autonomous agent pipelines. The remarkable detail here: Anthropic detected this campaign while it was still active, before MiniMax released the model it was training. When Anthropic released a new Claude version mid-campaign, MiniMax pivoted within 24 hours, redirecting nearly half their traffic to capture capabilities from the latest system. That's not opportunistic scraping — that's a disciplined extraction operation.

How Anthropic Actually Caught Them: A Detection Engineering Breakdown

This is where it gets interesting from an engineering standpoint. The detection problem is genuinely hard: distillation traffic has to look like legitimate usage, by design. Here's the multi-layered approach Anthropic describes:

Behavioral Fingerprinting And Classifiers

Custom ML classifiers scan API traffic for attack signatures: massive volume concentrated in narrow capability areas, highly repetitive prompt structures, and content that maps directly onto what's most valuable for model training. A single prompt like "You are an expert data analyst..." may be innocuous. The same prompt arriving tens of thousands of times across hundreds of coordinated accounts is a statistical signal you can't ignore.

This is essentially anomaly detection applied to prompt-level behavioral data

— and it maps closely to what I've built in Sentinel, my MLOps monitoring microservice. Sentinel uses physics-informed Z-score thresholds (>2.5σ) to detect data drift in production. The conceptual parallel is direct: normal usage has a statistical distribution; systematic extraction creates outlier signatures that fall outside that distribution at the population level.

Chain-of-Thought (CoT) Elicitation Detection

Anthropic built specialized detection for prompts designed to elicit step-by-step reasoning — one of the most valuable training signals for reinforcement learning. DeepSeek's technique of asking Claude to "imagine and articulate the internal reasoning behind a completed response" is clever precisely because it's one step removed from a direct CoT prompt. Catching this requires understanding the semantic intent of prompts, not just their surface structure.

Coordinated Activity Detection

Shared payment methods, synchronized request timing, identical structural patterns across accounts — the hallmarks of a distributed load-balancing operation. One proxy network managed more than 20,000 fraudulent accounts simultaneously, mixing distillation traffic with unrelated customer requests to dilute the signal. That's a sophisticated evasion strategy, and detecting it requires cross-account correlation, not just per-request analysis.

Attribution Through Metadata And Infrastructure

IP correlation, request metadata matched against public employee profiles, infrastructure indicators tied to specific labs, and timing aligned with product roadmap releases. Anthropic also corroborated findings with industry partners who observed the same actors on their platforms.

Why Orchestration Moats Matter More Than Raw Capabilities

Here's my honest take, and it's a bit uncomfortable: distillation attacks hit different when the model being extracted was itself trained on data scraped from the open web. There's a philosophical tension baked into this entire story that the industry hasn't fully reckoned with.

https://x.com/i/status/2026052687423562228

But that tension doesn't invalidate the core concern. What these labs were extracting wasn't just raw capability — it was the specific alignment, safety tuning, and agentic orchestration architecture that makes Claude useful and trustworthy in production. That's not scraped from the web. That's engineered.

Working with LangGraph to build multi-step agentic pipelines — like in my Axiom Engine, where a Prosecutor Agent audits every generated claim against source coordinates before it's surfaced to users — you quickly realize that the hard part isn't the base model. The hard part is the orchestration logic, the trust verification layer, and the failure recovery. These are the real moats.

The labs that extracted Claude's outputs got training data. What they can't easily replicate is the product reasoning — why certain capabilities are bounded the way they are, and how to make a system that actually behaves reliably when deployed in the real world.

This is why I think the Anthropic report's focus on distillation as a national security issue, not just an IP issue, is important. Models built through illicit distillation won't retain the safeguards. They'll have the capabilities without the guardrails — and that's a genuinely dangerous combination.

The Export Controls Argument — And Why It Actually Holds Up

Anthropic makes an interesting argument: distillation attacks prove that export controls work. If restricted chip access didn't constrain these labs from training frontier models from scratch, they wouldn't need to extract capabilities from Claude at scale. The attacks are evidence of a constraint, not evidence that controls are circumventable.

The policy implication is clear: if these labs need access to advanced chips to run distillation campaigns at this scale, then restricting chip access limits both direct training and extraction-at-scale. The two levers reinforce each other.

What About Global AI Equity?

I'd be dishonest if I didn't flag the geopolitical complexity here as someone building AI systems outside the U.S. Export controls are a double-edged instrument. They're designed to limit adversarial state actors, but the same restrictions that slow down Chinese military AI also constrain access for developers and researchers across Africa, Southeast Asia, and the Global South who have no affiliation with those actors.

The answer isn't to abandon controls — the security risks are real. The answer is to design access frameworks that are more precise: targeted at state-linked actors and high-risk use cases, rather than blunt geographic restrictions that widen the global AI divide. The AI industry and policymakers both need to reckon with this more seriously than they currently do.

What Happens Next

Anthropic has already deployed a multi-layer response: enhanced verification for high-risk account types (educational, startup, research — the pathways most commonly exploited), intelligence sharing with other labs and cloud providers, and model-level countermeasures designed to reduce distillation efficacy without degrading legitimate user experience.

The bigger ask is a coordinated industry response. Watermarking, standardized anti-distillation protocols, shared threat intelligence — these are the next frontier. Anthropic explicitly notes that no single company can solve this alone, which is correct. The orchestration moat that matters at the industry level is the same one that matters at the application level: coordination, trust, and shared visibility.

The question I'm sitting with: will this incident accelerate enforceable global AI norms, or just deepen the balkanization of AI development along geopolitical lines? I don't have a confident answer. But I think the engineers building these systems — not just the policymakers — need to be part of that conversation.

Sources & Further Reading

Primary Source: Anthropic — Detecting and Preventing Distillation Attacks

Reuters: Chinese Companies Used Claude to Improve Own Models

Bloomberg: Anthropic Says DeepSeek, MiniMax Distilled AI Models for Gains

Reddit r/LocalLLaMA: Community Discussion Thread

ACL 2025: Academic Paper on Distillation Quantification

Tosin Owadokun is a Senior AI Engineer based in Lagos, Nigeria, specializing in production agentic systems, LLM evaluation, and MLOps infrastructure.

GitHub