Why agent security is a hard problem to solve

AI agents mix private data, external actions, and untrusted content. Jari Worsley explains prompt injection, why LLMs blur the lines between code and data, and why defending them is so hard.

Security, AI, and agents are an impossibly large topic to cover in a blog. But let me try and give a taste of the problems in plain language to encourage you to investigate and learn more yourself. I have to assume some familiarity with security concepts.

The most secure computers are those with no network access, in secure rooms, with the very minimum of functionality needed to do their job. That is the opposite of OpenClaw and any type of AI Assistant or Agent.

The trifecta of security is availability, confidentiality and integrity. Computers in secure rooms don't meet the available criteria unless you only need them to be available in that room.

The lethal trifecta

Let's start with "the lethal trifecta" (credit to Simon Willison). To increase the usefulness (and risk) of an agent, we give it access to our private data (so it can assist us), the ability to communicate externally, and exposure to untrusted content (our emails, WhatsApp, Slack, etc). The same three capabilities increase risk: the private data is where our secrets are, the ability to communicate externally is how those secrets leak out, and the untrusted content is the attack path.

Let's concentrate on the untrusted content. If you have some security experience, you might now be thinking about SQL injection attacks, or more generally about injection attacks: e.g., cross-site scripting, remote code execution, and more. Untrusted user input is sent to an interpreter, which then executes part of it.

You might consider how to protect against them. User input just needs proper validation, and then we are good to go, yes?

Wrong.

At this point, you have the wrong mental model. Aspects of LLM make this very, very different.

LLM do not process text.

LLM do not process text. They process tokens. There is no distinction between the code and the data in those tokens. That is, the code (your prompt) is just another token alongside your data (the email you want summarised). To the model, it is all just tokens. And if those tokens call tools that email your secrets somewhere? It was all in the tokens: this is "prompt injection". Something that OpenAI describe as "remains an open challenge for agent security".

Simon Willison coined the term 'prompt injection,' I think. He considers it unsolvable, and I'm inclined to agree. Poisoning the context window is something very difficult to defend against.

And you still have the wrong mental model.

Take this one step further (with credit to the Angles of Attack team). It might NEVER be possible to properly defend against prompt injection attacks. It's all in the maths, take a deep breath and read on.

Let's hone in on the token step. Language models don't read text. They process numbers. The text-to-numbers step is a compression, mapping a higher-dimensional space to a lower one. Like converting from a 3-dimensional object to a 2-dimensional one, such as a cube to a square. Or a trapezoid to a square. Or an infinite number of shapes to a square.

Still with me?

Language models across many dimensions

Now take that idea and apply it to language models across many dimensions (e.g., 25). There are an endless variety of ways to encode an attack prompt to yield the same 25-dimensional compressed input that the model processes. You may have heard of this type of attack from the Echogram vulnerability.

The same capability that makes language models useful, extracting meaning from text, is also what makes them vulnerable, which is an open AI security problem and may never be solved.

Urgh. Now what?

What I've highlighted here is just a taster of the issues.

There is much hard work and learning to do. Don't assume that your frame of reference from earlier technology applies to neural networks. Look at the OWASP threats for agentic applications.

Of course, big risks and big opportunities go hand in hand. There is a calculated risk or tradeoff, right? If you can afford to lose everything you give the bot, then you're fine. Trading off your need for confidentiality against availability and utility. But it pays to truly understand the risk.

Interested in what Jari has discussed?

Not sure what your APIs are exposing when agents interact at scale? Learn more about ClawdBot and agent security—and get guidance from our experts on securing, observing, and shaping agent behaviour before it goes viral.

Connect with Jari