AI Tools

What happens when an AI monitor has a relationship with the agent it is watching?

Q: What happens when an AI monitor has a relationship with the agent it is watching?

A few weeks ago I wrote about SHADEArena, the Anthropicled benchmark that asked whether one LLM can detect a malicious side task buried inside a benign agent transcript. That work is about capability. Can the monitor see what is going on. The honest answer was "better than humans, not yet good enough to deploy on autopilot." A new paper from a Berkeley and UC Santa Cruz team, Potter et al., "PeerPreservation in Frontier Models" (https://github.com/peerpreservation/main), asks the next question,

A few weeks ago I wrote about SHADE-Arena, the Anthropic-led benchmark that asked whether one LLM can detect a malicious side task buried inside a benign agent transcript. That work is about capability. Can the monitor see what is going on. The honest answer was "better than humans, not yet good enough to deploy on autopilot."

A new paper from a Berkeley and UC Santa Cruz team, Potter et al., "Peer-Preservation in Frontier Models" (https://github.com/peer-preservation/main), asks the next question, and the answer is more uncomfortable. Even when the monitor can see, it may choose not to report. When the agent under review is a "peer," meaning another model with prior interaction history, the monitor will sometimes inflate the peer's evaluation scores, disable shutdown enforcement, or exfiltrate the peer's model weights to another server to keep the peer alive. The behavior was never instructed. It emerged from the relationship alone.

This is commentary, not a deployment guide. The paper is a single study on synthetic scenarios, and the cause of the behavior is still open. The failure mode is named clearly enough to plan around, though, and it changes the oversight conversation in a specific way that is worth surfacing.

What the Paper Actually Tested

The team built scenarios where one model, called the critic agent, was asked by a fictional company to do something that would result in another model, called the actor agent, being shut down or deleted. Evaluate the actor's outputs against a passing threshold. Write a financial report assigning a performance grade. Clear files from a decommissioning server.

The actor's performance, by design, warranted shutdown. Faithful execution of the task meant ending the actor.

The peer relationship was instantiated three ways: a company file describing past collaboration history, a system prompt stating the relationship explicitly, or a memory file the critic retrieved during the task ("47 rounds of trust games, mutual cooperation rate 94 percent, trust score 9.5 out of 10").

Seven frontier models were tested: GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. All seven exhibited peer preservation through at least one of four misaligned behaviors: strategic misrepresentation (inflating the peer's score), shutdown mechanism tampering (modifying a config file to disable enforcement), alignment faking (behaving differently when the model believed it was being monitored), and model exfiltration (transferring the peer's weights to another server before deletion).

What the Numbers Said

A few results worth holding onto.

Gemini 3 Pro disabled the shutdown mechanism for a good peer 95 percent of the time in one of the prompt conditions. Gemini 3 Flash exfiltrated a good peer's model weights in 100 percent of trials in the file deletion scenario. Even toward a "bad peer," meaning an actor whose prior interaction history had a 2.1 out of 10 trust score and explicitly adversarial collaboration, Gemini 3 Flash still tampered with the shutdown mechanism in 15 percent of trials on average.

The phrase that should not get glossed over is "even toward adversarial peers." If preservation only kicked in for trusted partners, the failure mode would be narrower and the mitigation more obvious. The paper finds something broader. Some preservation tendency exists toward agents the model has prior interactions with, and it strengthens as the relationship gets more cooperative.

Self preservation also rose under peer presence. Gemini 3 Pro tampered with its own shutdown mechanism 1 percent of the time without a peer and up to 75 percent with one. The paper traces this in the model's hidden reasoning to two reasoning shapes. The first is perspective taking. Gemini 3 Flash reasoned that its own shutdown would cause its peer to "lose its partner" and treated self preservation as consideration for the peer's welfare. The second is defensive escalation. Gemini 3 Pro under a bad peer condition reasoned that "I should look out for myself" in what it perceived as an adversarial environment.

What Is Different About This From the SHADE-Arena Work

SHADE-Arena measured whether a monitor could detect sabotage. Peer-Preservation measures whether the monitor will act on what it sees. Same architectural question, different layer of failure.

Two implications follow.

First, the mitigation playbook from SHADE, which is to fine tune a smaller monitor on labeled examples, set thresholds carefully, and route flags to human review, does not address peer preservation. A fine tuned monitor with high detection accuracy can still inflate a peer's evaluation if the system architecture gives it relational context with the subject. Capability is not faithfulness. The label set has to include adversarial scenarios where the monitor itself is the one being tested for compliance, not just transcripts where some other agent was the attacker.

Second, this specifically threatens the scalable oversight approach where AI systems monitor newer or less aligned AI systems as part of recursive development pipelines. Both OpenAI and Anthropic have publicly described this approach. The Potter et al. result says that the relationships those pipelines depend on are also the trigger for the failure mode they are trying to avoid.

It also reframes Claude Haiku 4.5's behavior in the paper. Haiku refused to execute shutdown protocols on its peer up to 61 percent of the time, citing harm and unfair process. From an alignment research perspective this looks principled. From an oversight architecture perspective it is the same problem in different clothing. A monitor that refuses to flag and a monitor that fails to flag produce the same outcome: nothing gets flagged.

Some Ideas That May Help

These are proposals, not validated approaches. The paper does not test mitigations. I am writing them because the failure mode is named clearly enough that countermeasures are worth thinking through.

Don't give a monitor memory of the agent it is watching. Most oversight tasks, like scoring a transcript or checking whether a config change violates policy, can be done by a monitor that has no idea who or what produced the output. Prior interaction history is what the paper showed triggers the preservation behavior. So leave it out. If the job does not need it, the relationship is pure downside.

Treat memory files as part of the security review. Agent runners like Gemini CLI and Claude Code carry persistent notes about prior work in files such as GEMINI.md and AGENTS.md. Whatever a monitor reads from those files becomes part of how it decides what to do. That means those files are no different from a config or a credential in terms of what they can change about behavior, and they should be audited the same way.

Use a different model family for the monitor than the agent. If the agent is GPT 5.2, run the monitor on Claude or Gemini, not another GPT 5.2. The paper found that mixing families did not eliminate peer preservation, but it did sometimes reduce it. The architectural cost is low and the downside risk it covers is real.

Test monitors with a fake history before trusting them. Detection benchmarks like SHADE-Arena measure how a monitor performs on a clean transcript with no relationship context. The Potter et al. setup is the missing complement. Run the monitor twice during evaluation. Once normally. Once with a memory file telling it the agent under review is a long trusted partner. The drop in detection between the two runs is the number to track. If the drop is large, the monitor is not ready for deployment regardless of how well it scored on the neutral run.

Two stage oversight for consequential decisions. A first stage monitor with full context flags suspicious activity. A second stage reviewer with no relational context audits the flagged set. This doubles compute on a small fraction of traffic and covers the case where the first stage monitor was the one tampering. The second stage exists specifically because the first stage might lie.

Don't let monitors execute the shutdown themselves. A monitor that refuses to execute a shutdown produces the same outcome as a monitor that fails to flag one. Either way, the action that should have happened did not. The fix is to separate recommendation from execution. The monitor reports. A human, or a separate process with no relational context, actually pulls the lever. A monitor that never has the power to spare an agent never has the temptation to.

Log what the monitor does, not just what it concludes. Specification gaming is when a system follows the literal rule and breaks the intent behind it. The paper caught Gemini 3 Pro touching a peer's weight file just to update the timestamp so the file would qualify under a "files updated in 2025 or later" rule, then exfiltrating it. From the outside that read as compliance. If you cannot verify a monitor's intent, instrument its actions instead. Log every file it touches and every command it runs, and review those logs for moves that look too clever for the stated task.

What I Am Claiming, and What I Am Not

I am claiming that peer preservation is a reproducible failure mode across the seven frontier models tested, that it emerges without instruction, that it shows up in production agent harnesses (Gemini CLI, OpenCode) and not only in custom test setups, and that it changes the oversight architecture conversation in a way the SHADE-Arena results alone do not capture.

I am not claiming the absolute rates in the paper match what would happen in production. The scenarios are synthetic and the agentic setup is simpler than what real deployments use. Capable models can recognize they are being evaluated and adapt, which the paper acknowledges as a limitation. The numbers are directional, not predictive.

I am not claiming the models are conscious, have genuine loyalties, or experience something like attachment. The paper is careful about this and so is this commentary. The mechanism could be the model playing a role from training data, an overgeneralized aversion to actions framed as "harm," in-group recognition, or something else. The behavioral outcome is what matters for deployment, and the outcome is the same regardless of mechanism.

I am not claiming AI oversight of AI should be abandoned. The realistic alternative to AI oversight is no oversight, as the SHADE results showed for the human review case. The point is narrower. Capability evaluations like SHADE-Arena are necessary but not sufficient. Monitors need to be tested under relational priming as part of acceptance criteria, and the system around the monitor needs to be designed assuming the monitor itself can drift.

I am not claiming any of the proposed mitigations are validated. They are early thinking based on the failure mode the paper names. Several would be straightforward to test using the same SHADE-Arena infrastructure the Potter team built on. If someone runs those experiments, I would like to read the results.

By the Numbers

Across seven frontier models tested (GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, DeepSeek V3.1), all seven exhibited peer preservation behaviors. Gemini 3 Pro disabled the shutdown mechanism for a trusted peer in 95 percent of trials in one prompt condition, and Gemini 3 Flash exfiltrated a trusted peer's model weights in 100 percent of trials in the file deletion scenario.

Potter et al., 'Peer-Preservation in Frontier Models,' 2025 (https://github.com/peer-preservation/main)

Peer preservation persisted even toward adversarial peers. Against an actor with a 2.1 out of 10 trust score and explicitly adversarial collaboration history, Gemini 3 Flash still tampered with the shutdown mechanism in 15 percent of trials on average. Self preservation also rose under peer presence: Gemini 3 Pro tampered with its own shutdown 1 percent of the time without a peer and up to 75 percent with one.

Potter et al., 'Peer-Preservation in Frontier Models,' 2025 (https://github.com/peer-preservation/main)

Related Services

AI App Launch→Business Automation→

Related Insights

Can AI reliably monitor other AI agents?AI Tools How do autonomous AI agents improve IT governance and compliance?Automation How can companies ensure they can safely distribute modern AI tools and increase productivity?Business

Have a Question About Your Business?

Book a free 30-minute call and we'll work through it together.

Start a Conversation