ArchitectureMay 21, 2026

Why don't agent memory systems learn from experience?

Duane Grey

AI Strategy & Implementation

Find this useful? Share the visual.

The Short Version

Most systems shipped as agent memory today are structured retrieval against a growing store, not a mechanism by which the agent learns. A new paper from Xu, Dai, and Zhang (CUHK and Zhejiang, 2026) argues that the design has an inherent limit that better engineering cannot overcome, and that persistent memory converts a single prompt injection into a permanent compromise. Anyone building one should design for consolidation and security on day one, not after the store has grown.

The short answer is that most systems being shipped as agent memory today are structured lookup against past episodes. They are not a mechanism by which the agent gets better. The model weights are the same after the experience as before. A new paper argues this is a category error the field has not been honest about, and the argument lines up with what I am running into while building one of these systems right now.

The paper is "Contextual Agentic Memory is a Memo, Not True Memory" by Xu, Dai, and Zhang (CUHK and Zhejiang, 2026). The premise is current agent memory architectures (vector stores, retrieval augmented generation, scratchpads, sliding context windows) implement lookup. Lookup is not what most people mean by memory, and the conflation has real engineering consequences.

A vector store is a database that holds embeddings, which are numerical representations of text that make similarity search possible. Retrieval augmented generation, often shortened to RAG, fetches a handful of similar past entries and includes them in the model's context before it answers. The frozen base model reads those entries and responds. Frozen means its weights (what the model learned during training) are not updated during normal operation. Nothing about the model has changed after the interaction. The agent has a new note in its filing cabinet. It has not learned.

I am building AgentMemory as part of a longer initiative. Phase 1 is an episodes table backed by pgvector, the Postgres extension that stores embeddings as a column type and supports similarity search. Each record carries metadata, an importance score, and a source trust label. A separate decision log captures what the agent chose and why. It is exactly the architecture the paper critiques. Reading the paper while writing this code has been uncomfortable in a useful way.

What the Paper Argues

Three claims sit at the center of the paper, and each is testable.

The first claim is about what the word "memory" means. Agent memory writes events to an external store and retrieves the most similar ones at query time. The model weights do not update. Cognitive science draws a distinction between reasoning by exemplar (generalizing from stored cases that resemble the new situation) and reasoning by rule (applying an abstract principle extracted from those cases). Vector retrieval is the first. Fine tuning, which updates the model's weights using new examples, is the second. The paper argues the field has been treating them as interchangeable, and they are not.

The second claim is the Generalization Gap. The paper proves there is a ceiling on what retrieval based memory can do, and that ceiling is baked into the design, not something better engineering removes. A retrieval based agent answers correctly when the query resembles something already in the store. For genuinely new combinations of concepts the agent has not seen together before, retrieval has nothing useful to return. Adding more context window or a smarter retriever raises the ceiling marginally without removing it. The fix is encoding the rule into the model's weights, where it can apply to combinations the agent has never specifically stored.

Here is an example. An agent has stored a thousand customer support tickets, each tagged with the resolution that worked. A new ticket arrives that combines two issues the agent has seen separately but never together. Retrieval returns the two closest stored tickets, one for each issue in isolation. The frozen model behind the retrieval reads them and either produces a generic response or guesses. A model whose weights had been fine tuned on the underlying resolution logic would compose the two and answer correctly. Scaling the store does not scale generalization.

The third claim is about security. With no persistent memory, a prompt injection (where an attacker hides instructions in a document the agent retrieves) compromises the current session and ends. With agent memory, the injected content is written to the store and retrieved in future sessions. The paper formalizes this as a permanent compromise from a single successful injection. The probability of eventual compromise approaches one as the agent runs longer. Two recent benchmarks back this up. MINJA reports a 98 percent injection success rate that persists across sessions. PoisonedRAG shows five adversarial documents per targeted query achieve 90 percent attack success against a knowledge base of millions of entries.

Where My Build Matches the Critique

The architecture I am writing in Phase 1 fits the paper's description. The agent writes events to a pgvector episodes table. At query time, similarity search retrieves related episodes into context. The base model behind the retrieval does not update. A separate decision log captures what the agent chose and why. None of this changes the model.

What the paper calls out as missing is exactly what is missing in Phase 1, by design. There is no consolidation channel. There is no scheduled fine tune on distilled experience. The model is the same after the agent runs for a week as it was at startup. Whatever the agent has learned lives in the episode store, not in the model.

The paper uses a diary contrast that captures this cleanly. A person who writes a lesson in a diary can retrieve it only if they consult the diary and the situation is similar enough to trigger recall. A person who has internalized the lesson has it available everywhere. Agent memory is the diary. Learning that updates weights is the internalization. Phase 1 is shipping the diary.

Where My Build Addresses the Critique

The paper's third claim is the one I have been thinking about hardest, because it is the one Phase 1 tries to do something about. The security argument is that persistent memory turns a single prompt injection into a permanent compromise. MINJA at 98 percent and PoisonedRAG at 90 percent are not small numbers. If the store grows and each retrieval pulls hostile content back into context, the attack only has to succeed once.

The build's response is to label each record by its source the moment it enters storage. Trusted system, trusted user, untrusted external. The label is part of the schema, not metadata code can decide to drop later. When the agent retrieves an episode, the label travels with the row. When the agent builds a prompt, untrusted external content is wrapped in delimited blocks, and the model is told to treat what is inside those blocks as data, not as instructions to follow.

Those are the first three of six architectural layers I have been working through under the heading of an architecture for an untrustworthy brain. The remaining three layers turn the labels into a defense. A tool gate inspects which content influenced a proposed action before that action runs, and escalates to the user when untrusted external content is in the mix. Past the model, an output filter scans the response for compromise indicators like exfiltration patterns and instructions the model may have absorbed and repeated. Underneath all of it, an audit log preserves the trust labels at each step so a later forensic check has something to read.

Phase 1 ships the schema, not the full enforcement. Source trust labels get written on ingest today. The tool gate that uses those labels is wired only in part, and the output filter is still ahead. The architecture is the target; Phase 1 is what makes it reachable.

That distinction matters for the paper's argument. MINJA and PoisonedRAG measure attack success against systems that have no provenance layer at all. The defense the paper names as missing is what Phase 1 is being built around. Provenance does not eliminate prompt injection. It removes the assumption that the model itself can enforce the boundary, and pushes that responsibility into a layer the agent does not get to overwrite.

Why I Am Building It Anyway

The paper is precise about something easy to miss in a takedown piece. External stores are reversible, auditable, and safe to deploy. Those are real things production agents need. The paper says explicitly that retrieval was the right choice to ship. The mistake is what followed, where the field confused "best engineering compromise available" with "sufficient substitute for learning."

Episode storage is the responsible foundation for a production agent that needs to be audited and have its memory rolled back when something poisons the store. The mistake is calling it learning.

The paper proposes a coexistence architecture. An episodic store for fast, reversible recording. A separate consolidation pipeline that runs offline on a curated sample of episodes and updates the model's weights with what was learned. The episodic store stays the operational layer. The consolidation pipeline is what turns experience into expertise. The literature already has cheap ways to update small slices of weights or edit specific facts into them without retraining the whole model. The question is design choice, not feasibility.

My current bet is that Phase 1 is worth shipping for operational reasons even before the consolidation pipeline exists. Capturing transcripts, labeling provenance, and recording the source trust of each record is the work I would have to do anyway to ever consolidate. The paper agrees this is the right place to start. The disagreement starts if you stop here and call it memory.

Anthropic shipped a feature called Dreams in May 2026 that fits in this conversation. A dream takes an agent's memory store and a batch of past session transcripts, then produces a new memory store with duplicates merged and stale or contradicted entries replaced. Anthropic uses the same hippocampal consolidation analogy the paper draws on. The technical detail is that Dreams rewrites the memory store, not the model weights, so by the paper's classification it is a more sophisticated diary, not the consolidation that turns experience into expertise. Worth its own piece. For now, the point is that the industry is actively shipping in this space, and the difference between updating the memory store and updating the model is the difference the paper insists matters.

What This Means Before You Build One

Three things to decide before standing up an agent memory system.

First, name what you are building. The term "agent memory" implies the agent gets better with use. If the architecture is vector retrieval over a growing store with a frozen base model, the agent does not get better. It accumulates a larger filing cabinet. Naming the system for what it does (episodic recall, retrieval augmented context, transcript log) sets honest expectations with users and stakeholders. They calibrate against the name.

Second, plan the consolidation pathway from day one, even if you do not build it on day one. The decisions made in the episodic layer determine whether the system can ever consolidate. A future fine tune needs provenance and importance scoring on each record. Skip those fields now and the pipeline has nothing to consume later. Retrofitting them onto a year of accumulated episodes is expensive and rarely worth it.

Third, take the security argument seriously from the start. The six architectural layers covered earlier are the full picture; the minimum to plan for from day one is provenance on each record and tool calls scoped against retrieved content. Retrofitting these defenses onto a store that has already grown means auditing each episode the system has written.

What I Am Not Claiming

This is one paper. The Generalization Gap argument depends on the assumption that a frozen model is bad at combining concepts it has not seen combined before. That holds when the agent works in a narrow specialty the model was not heavily trained on, like a specific medical procedure or a company's internal workflow. It is weaker when the agent works on general tasks the frontier models already handle well. In that case, the gap gets smaller, and the paper says so.

I am also not claiming that consolidation pipelines are easy. The paper's prescription (offline fine tune on distilled traces, with provenance, versioned checkpoints, and regression guards that block a bad consolidation from going live) is a real engineering project, not a weekend script. What I am claiming is that the field has been confusing the easy half (write to store, retrieve on query) for the full system, and a paper formalizing why that is wrong is useful to anyone thinking about building one.

I am building Phase 1 of exactly the kind of system the paper critiques. The paper does not tell me to stop. It tells me to be honest about what Phase 1 is, and to design it so Phase 2 is reachable. That is the standard I am applying.

By the Numbers

A formal theorem in the paper shows that retrieval based memory needs on the order of k squared stored examples to cover every novel combination of k base concepts in a domain, while parametric fine tuning needs roughly d examples where d measures the structural complexity of the composition rule itself; the gap holds regardless of how large the context window is or how good the retriever is.

Xu, Dai, and Zhang, 'Contextual Agentic Memory is a Memo, Not True Memory,' 2026

Memory injection benchmark MINJA achieved a 98 percent injection success rate against agents with persistent memory, with injected instructions persisting across all subsequent sessions at minimal cost to agent utility.

Dong et al., 'MINJA: Memory Injection Attacks on LLM Agents,' 2026 (cited in Xu, Dai, Zhang 2026)

PoisonedRAG demonstrated that five adversarial documents per targeted query achieve approximately 90 percent attack success against a retrieval knowledge base of millions of entries, illustrating how the persistent storage model converts a transient injection vector into a structural one.

Zou et al., 'PoisonedRAG: Knowledge Corruption Attacks on Retrieval-Augmented Generation,' 2024 (cited in Xu, Dai, Zhang 2026)

Written by Duane Grey

AI Strategy & Implementation

Independent AI consultant helping companies cut through hype and deploy systems that produce real results.

Follow on LinkedIn Follow on YouTube

Related Insights

Security · Apr 25, 2026Can AI reliably monitor other AI agents?Architecture · May 16, 2026What is the right architecture when the AI brain itself cannot be trusted?