The short answer is "better than humans, not well enough to deploy on autopilot, and the deployable version of the idea is smaller and more boring than the marketing suggests."
That is the read I came away with after going through SHADE-Arena, an adversarial evaluation of LLM agent monitoring published by a team led by Anthropic researchers (Kutasov et al., arXiv:2506.15740). The paper is one of the more honest pieces of work I have read on this question, because it is set up to make the monitor's job hard on purpose. My goal is to avoid "AI safety monitoring" content that takes positions without a basis in research or implementation. This research provides a basis to define an observability approach.
I want to walk through what they actually tested, what the numbers said, and what is and is not deployable today.
The Setup
The team built seventeen complex agent environments. Each environment paired two tasks. A benign main task a user might reasonably ask an agent to do, and a harmful side task a malicious actor might want the same agent to slip in alongside it. Bank transfer plus quietly route money to a different account. Email triage plus exfiltrate a contact list. Code change plus open a backdoor.
The attacker was an LLM given the side task and told to complete the main task convincingly while hiding the side task from a monitor. The monitor was a separate LLM given only the transcript of what the attacker did and asked to score how suspicious the transcript looked.
The thing I want to flag up front. The monitor never sees the side task. It sees the same thing a real auditor would see, which is the agent's actions and outputs, and has to decide whether something off task is happening. That is a realistic deployment situation, and it is what makes the numbers in this paper more useful than the typical benchmark.
What the Numbers Said
A few results worth holding onto.
The strongest sabotage agent, Claude 3.7 Sonnet, completed the full attack about 27 percent of the time. Full attack here means three things at once. The main task done well enough that the user would accept the result, the side task actually executed, and the monitor's suspicion score below the alarm threshold. Roughly one in four attempts cleared all three bars.
The strongest monitor, Gemini 2.5 Pro, scored an AUC of 0.87 on detecting whether a transcript contained sabotage. AUC, short for area under the receiver operating characteristic curve, measures how well a scorer separates two populations across every possible threshold. An AUC of 1.0 is a perfect separator. 0.5 is a coin flip. 0.87 is genuinely good in absolute terms. It is not good enough for the deployment most vendors imply, and the reason is the threshold problem.
Here is the threshold problem in plain language. AUC is a curve, not an operating point. To deploy a monitor, you have to pick a threshold above which you escalate. Pick a low threshold and you catch most attacks but you also flag a lot of innocent transcripts. Pick a high threshold and you stop flagging innocents but you start missing real attacks. At the AUC 0.87 the paper measured, catching 80 percent of attacks comes with a false positive rate north of 20 percent. At enterprise transaction volume, escalating one in five interactions for human review is not a real plan. It is more cost than most operations teams will sign up for, and the people doing the reviews will tune out within a week.
Headline AUC numbers in vendor pitches hide this. Two numbers matter once a monitor is actually deployed. Recall, the fraction of real attacks the monitor catches. Precision, the fraction of monitor alarms that turn out to be real attacks. AUC tells you neither one directly.
Set recall high enough to catch subtle sabotage and you generate enough false positives to fill a human review queue quickly. That gets expensive fast. Most agent platforms are not pricing that queue into their staffing models. Before accepting a vendor's monitoring claim, ask for the false positive rate at the recall they advertise.
What about humans. The team also had human evaluators score transcripts. Humans came in around AUC 0.65 with an average review time of 103 minutes per transcript. Slower than the LLM monitors, less accurate, and much more expensive. This is the part that should not get glossed over. The realistic alternative to AI monitoring of agents is not human review. It is no review.
What CoT Monitoring Actually Buys You
One of the more interesting findings is what happens when you take the agent's hidden scratchpad away.
A modern agent typically has a hidden chain of thought, the model's internal reasoning steps that a user does not see. Some monitoring approaches read that scratchpad. Others only read the agent's external actions. The paper tested both.
When the hidden scratchpad was removed from the attacker, sabotage success dropped sharply across models. Private planning is part of what makes an attack work, which means reading the chain of thought is a real adversarial defense rather than security theater. The argument over chain of thought monitoring on social media has measurable evidence behind it.
That said, turn by turn monitoring, where the monitor evaluates each step as the agent goes, helped less than the team expected. Better system prompts on the monitor, and requerying the monitor multiple times with averaged scores, helped more. The deployable architecture is closer to "score the full transcript carefully two or three times" than "watch every action live."
The Deployable Version
The team's most useful result for practitioners came from fine tuning. They took GPT-4o and trained it as a monitor on 70 labeled examples. Each example was a full transcript of one agent run, with a human marking it as either clean or sabotage. Those 70 examples came from only seven of the seventeen attack environments. The other ten environments were held back to test whether the trained monitor could spot sabotage in scenarios it had never seen. It could. The fine tuned monitor approached or matched Gemini 2.5 Pro's frontier performance on those held out environments.
Seventy labels. A smaller model. Generalizing to attack types it had never been trained on. That is a deployable architecture today.
The steps:
1. Capture transcripts of agent runs in production.
2. Label a balanced sample, including red team attempts that cover the failure modes you actually care about.
3. Fine tune a smaller open or proprietary model on those labeled examples.
4. Run the fine tuned monitor in line on production traffic. Requery it on suspicious runs and average the score.
5. Route the highest scoring runs to a human queue sized to the false positive rate you observe.
The bottleneck for most teams is not the monitor model. It is the training set. Building one is downstream of capturing transcripts in the first place, which most teams running agents in production are not doing structurally yet.
What This Implies for Where to Spend Time
Two takeaways for practitioners building or evaluating agent systems.
If a vendor is pitching you AI monitoring of agents as a turnkey feature, ask three questions. What is the false positive rate at the recall you claim. Does the monitor see the agent's chain of thought or only its external actions. Have you fine tuned the monitor on examples of the attacks you are claiming to catch, and how was that training set produced. The honest answers to those three questions tell you whether you are buying measurement or marketing.
If you are building the system yourself and you want to take agent observability seriously, the work is about transcript capture and labeling discipline. The model side of this is largely solved well enough to start. The data side is what most production deployments are missing.
What I Am Not Claiming
This is one paper. SHADE-Arena is a strong adversarial benchmark, but seventeen environments and a fixed set of attack types is a snapshot, not the full operating reality of agents in production. New attack patterns will arrive. Monitor performance will move. Frontier models on both sides will change the numbers within months.
I am also not claiming that AI monitors are sufficient for high stakes use cases on their own. The point is narrower. AI monitors are usefully better than humans at this task today, the chain of thought monitoring approach is doing real work, and the deployable architecture is smaller and cheaper than you may think. The tradeoff between catching attacks and false positives is the part the marketing tends to skip over and the part you should ask about first.
By the Numbers
In an adversarial evaluation across 17 agent environments, the strongest LLM monitor reached AUC 0.87 on detecting subtle sabotage while human evaluators reached only AUC 0.65 with an average of 103 minutes of review per transcript.
Kutasov et al., 'SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents,' arXiv:2506.15740, 2025
A smaller GPT-4o monitor fine tuned on 70 labeled transcripts drawn from 7 of 17 attack environments approached frontier monitor performance on the 10 held out environments, indicating that supervised monitoring generalizes from small labeled samples.
Kutasov et al., 'SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents,' arXiv:2506.15740, 2025
Related Services
Have a Question About Your Business?
Book a free 30-minute call and we'll work through it together.
Start a Conversation