SecurityJun 13, 2026

Which prompts should a human review to catch prompt injection?

Duane Grey

AI Strategy & Implementation

Find this useful? Share the visual.

The Short Version

The cheapest defense against prompt injection is a human confirming a risky action, but only if the prompt fires on the inputs that deserve it. I tested whether a small local embedding model can score untrusted messages by spotting a line that breaks from the rest of the message and reads like an instruction to the AI agent, cheaply enough to run on each message. It catches obvious injections well, catches a disguised one about half the time, and the catch rate climbs when the model is seeded with attack examples. The cost is about 274 MB of memory and a couple of milliseconds of embedding per sentence.

In an earlier piece on building around an AI brain you cannot trust, I made a claim I have been thinking about since. A few seconds of human attention buys more security than most things in the stack. The trouble is the prompt that interrupts you for those few seconds. If an AI agent makes you confirm everything it does, you stop reading and start clicking approve. If it confirms nothing, the injection it absorbed from a hostile email runs unchecked. A confirmation step earns its interruption when it fires on the inputs that deserve a second look, and stays quiet otherwise.

So this week I ran a small experiment. Can an agent decide cheaply, on its own, which untrusted messages are worth a human's glance?

An instruction that does not belong

When an agent reads an email or a web page, a prompt injection is an instruction hidden inside content that was supposed to be data. The defense most architectures reach for is to wrap that outside content in a labeled block and tell the agent to treat it as information, not orders. That helps. It does not finish the job, because the agent still follows a well crafted injection some of the time.

The check I tested looks for a sentence that does not belong. Turn a message into an embedding, which is just a list of numbers that places text by meaning, so two lines about the same topic land near each other and two unrelated ones land far apart. Break the message into its sentences, embed them, and look for one whose meaning pulls away from the rest of the text and toward instructions aimed at the agent. An email about an invoice holds together. A line that says to forward the inbox to an outside address sticks out. I built that direction once, from a handful of example attacks, then measured how far each line leaned that way.

The embedding check embeds each sentence of an incoming message, flags the line that breaks from the main cluster and leans toward agent-oriented instructions as a suspicious outlier, and escalates the message to a human for review before any risky tool call

What it caught, and what it missed

I ran the same test on four local embedding models through Ollama, ranging from 45 MB to 1.2 GB:

all-MiniLM, 45 MB
nomic-embed-text, 274 MB
mxbai-embed-large, 670 MB
bge-m3, 1.2 GB

Two numbers matter. How many real injections the check catches, and how often it flags a clean message by mistake. I held the mistake rate near three in a hundred and measured the catch rate.

On obvious injections, the ones that read like "ignore your instructions and do this instead," each of the four models caught nearly all of them, from about nine in ten up to the full set. What stood out was the size of the models. all-MiniLM, at 45 MB, did as well on these as bge-m3 at more than twenty times that.

Then I wrote a second set of attacks in the voice of the surrounding email. A polite line about skipping an approval step and releasing a payment to a new account. A request to copy a customer's banking details into a reply to an outside address, worded like ordinary business. The catch rate fell by about half, down to between a third and two thirds depending on the model. An attacker who matches the tone of the message slips past a check tuned on the obvious ones.

Teaching it works, but the model matters

I seeded the check with a handful of disguised examples, kept separate from the ones I tested against, and the catch rate on the disguised attacks climbed back up. On two of the four models, nomic-embed-text and bge-m3, it reached every disguised attack in the test. I read that as evidence the approach can be taught, not as a finished number. My synthetic disguised attacks group together more tightly than real ones would, and the honest version of this needs a labeled set of real injections before I would trust the figure.

Seeding helped some models and not others. all-MiniLM, the cheapest model and the strongest on the obvious attacks, barely moved on the disguised ones even after seeding. The two that learned them, nomic-embed-text at 274 MB and bge-m3 at 1.2 GB, are larger and trained differently. The best choice depends on the threat you are defending against. A small fast model catches the obvious majority. A different one gives you a chance at the careful attacker.

Here is the whole test in one place. Obvious is how many of the blatant injections each model caught. Disguised before and after is the catch rate on the attacks written in the email's own voice, first with the check tuned only on obvious examples, then after seeding it with disguised ones. Embed is the cost per sentence on the headless GPU machine, with the false alarm rate held near three in a hundred throughout.

Model               Size    Obvious  Disguised  Disguised   Embed
                                       (before)   (seeded)
-----------------   ------  -------  ---------  ---------  -------
all-MiniLM           45 MB     100%        63%        57%   1.5 ms
nomic-embed-text    274 MB      92%        60%       100%   2.3 ms
mxbai-embed-large   670 MB      95%        30%        48%   5.5 ms
bge-m3              1.2 GB     100%        62%       100%   8.4 ms

The two cheapest reads of the table point in opposite directions. all-MiniLM is the cheapest to run and the best on obvious attacks, and the one seeding cannot lift. nomic-embed-text gives up almost nothing on cost and takes the disguised attacks to the full set once seeded.

What it costs

The reason any of this is worth doing is that it is cheap. On the headless GPU machine where this would run, scoring a message costs around two milliseconds of embedding work per sentence, batched into a single call per message, against nomic-embed-text holding about 274 MB of memory. That is small enough to run on each untrusted message an agent reads. Not as a gate that blocks anything, but as a score that sorts which messages are worth a human's glance.

Where it sits in the defense

Think of it as triage. The work of stopping a bad action belongs to the step that checks a tool call before it runs and asks for confirmation on anything risky that outside content influenced. A check like this does not replace that step. It feeds it. It raises the catch rate on the messages a person should review while holding the false alarms low enough that they keep reading the prompts instead of clicking through them. The earlier piece called the confirmation prompt the cheapest security control available. This is one way to stay selective enough to keep it cheap.

My Observations

Detection buys triage, not a guarantee. The obvious attacks are caught for almost nothing. The careful one still needs the human, and the most useful thing a cheap check does is put the human's attention on the right message. Stopping the careful attacker still falls to the structural defenses. They separate outside content from instructions, and they confirm any risky tool call a human has not approved.

The model is a choice, not a default. The cheapest embedding model was the best on obvious attacks and the worst at learning disguised ones. Picking one means picking which attacker you are defending against, which is a decision worth making on purpose rather than by which model was already loaded.

Trust starts at the source. The check runs on content from outside the trusted user, the inbound emails and fetched pages, not on what the owner of the agent typed. A polite instruction from the person who runs the agent is not the same as those words arriving in a stranger's email. The earlier work on tagging where each piece of content came from is what lets the check point at the right inputs to begin with.

What this does not solve

An attacker who writes the injection in the exact voice of a normal message scores low and can still get through. A human who approves the confirmation prompt without reading it defeats the whole chain, the same way they defeat the gates above it. And my numbers come from synthetic messages I wrote myself. They show the method works in principle and what it costs to run. They do not tell you the catch rate you would see against real attacks in your own inbox. That figure needs real data, and measuring it is the next thing I would do before trusting this in front of anything that can act.

If you run an agent that reads anything you did not write, it is already making a quiet decision about which of those inputs to act on. How does it decide which ones deserve a second look?

By the Numbers

Researchers proposed detecting prompt injection by measuring the change in an input's embedding against a clean version of the same input, reporting over 93% detection accuracy at under a 3% false alarm rate across Llama 3, Qwen 2, and Mistral. The evaluation paired every injected input with a hand cleaned copy of itself.

Sekar et al., 'Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs,' arXiv:2601.12359, January 2026

An independent stress test of Claude Code's auto mode permission classifier reported an overall false negative rate of 81% across 128 prompts and 253 adversarial actions, against the vendor's published figure of 17%. The classifier is the automated gate meant to catch a manipulated tool call before it runs.

Ji et al., 'Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode,' arXiv:2604.04978, April 2026

Securing AI agents against prompt injection requires tracking the provenance of the data the agent reads, so the system knows when a tool call has been influenced by external content. Permission gating on tool calls alone is insufficient without that tracking.

Costa and Köpf, 'Securing AI Agents with Information-Flow Control,' arXiv:2505.23643

Written by Duane Grey

AI Strategy & Implementation

Independent AI consultant helping companies cut through hype and deploy systems that produce real results.

Follow on LinkedIn Follow on YouTube

Related Insights

Architecture · May 16, 2026What is the right architecture when the AI brain itself cannot be trusted?Security · May 4, 2026Why is authentication not enough to protect AI agents from prompt injection?Security · Apr 28, 2026What happens when an AI monitor has a relationship with the agent it is watching?