Michelle Tristy

Posted on Jun 13

Your AI agent remembers what sounds related, not what worked

#ai #agents #llm #machinelearning

I spent a couple of weeks asking people a pretty basic question. If you are actually running agents, past the demo, in something resembling production, how do you handle memory?

I was expecting a handful of tips. What I got instead was the same frustration over and over, and a problem that, as far as I can tell, nobody has cleanly solved yet. So I am writing it down, because if you build with agents you are going to run straight into it.

The thing everyone starts with

Most agent memory works the same way. Embed everything the agent has seen, store the vectors, and when a new task shows up, pull back whatever is closest and drop it into context.

That is fine right up until it isn't. The catch is that "closest in vector space" really means "sounds related," and sounding related is not the same as having worked last time.

So the agent recalls the thing that resembles the task in front of it, not the thing that actually helped. It will cheerfully head down a path it already failed three sessions ago, because nothing ever told it that path was a dead end. If you have watched an agent repeat its own mistake with total confidence, that is the whole bug right there. It is not stupid. It just never found out how the last attempt turned out.

What people are actually doing about it

Here is the part I did not expect. Almost everyone I talked to had already hit this and quietly built their own fix. And the fixes were all over the place, which to me is the tell that there is no standard answer yet.

A few that kept coming up.

Some people just use files. No memory platform, nothing fancy. Working memory lives in plain files the agent reads on startup, the agent decides what to write, and old stuff rolls off into a vector store later. For one person working alone this was apparently rock solid, and they were a little smug about it, fairly.

Other people keep a separate failure log. Pull "this failed and here is why" out of the general memory entirely, and when the agent wonders whether it has tried something before, check that log first, ahead of the normal similarity search. Somebody put it in a way that stuck with me. Embeddings are great at recalling topics. They almost never hold on to "we went down this road and it blew up because of X."

A few have the agent write its own little post mortem after each task. Tried this, it broke because of that, next time do the other thing. Then search those before starting fresh. The honest downside they admitted is that after thirty or forty of these the file turns into noise, so they had to bolt on a step that summarizes the old ones.

And some split memory into tiers. Stable facts the agent is allowed to trust, versus everything else, which it can mention but not act on unless it can point to where it came from.

Different shapes, same underlying instinct. Stop pretending every memory is equally trustworthy.

Where it all falls apart

Once I lined these up next to each other, one thing jumped out.

Every single approach handles what to write down. None of them really handles what to keep.

Noticing that something failed turns out to be the easy half. You can catch tool errors, failed tests, timeouts, a change that got reverted. You can even treat "the task just ended and nobody ever confirmed it worked" as its own kind of failure, which is how you catch the quiet ones that never throw an error.

It is everything after that gets hard. Which failures are worth keeping, and which were flukes. When a lesson stops being true because the system moved underneath it. How you stop a memory from sliding from "this happened once" into "this is the rule," when nobody actually checked that it should be a rule.

One person framed it in a way I keep coming back to. A memory should hold proof, not a moral. The raw event, what happened and the evidence for it, should stay put and stay checkable. The lesson you draw from it should be allowed to change when something later contradicts it. The moment those two things become a single object, the system starts defending its interpretation instead of just remembering what actually happened. Which, honestly, is a very human way to be wrong.

What the newer tools still skip

There is a fresh wave of memory tooling now that handles a nearby but different problem, which is tracking whether a stored fact is still true as time passes. Who owned this before, who owns it now. That is genuinely useful and a real step up from blind similarity.

But notice it is answering a different question. "Is this fact still current" is not the same as "did acting on this memory actually lead somewhere good." A fact can be perfectly up to date and still be the exact thing that sent the agent into the wall three times in a row. Whether something is still true and whether it ever worked are two different axes. Most of the field is busy on the first one.

If you are building this today

The practical stuff I took away, mostly secondhand from people deeper in it than me.

Do not lean on similarity on its own. It hands you what looks related, not what helped. Treat failures as real memory, because what did not work is often more useful than what is merely similar. Keep the event and the lesson separate, so you can record what happened plainly and still revise the conclusion later. Put a real gate in front of what gets promoted into a durable rule, because noticing a break is not the same as having learned the right thing, and bad lessons calcify fast. And assume you will have to go back. A lesson that was true two weeks ago can be actively harmful once you have refactored the thing it was about.

None of this is solved. The people doing it well are using sensible rules of thumb, recency, prove it twice, a human glance, the occasional cleanup pass. And every one of those rules breaks somewhere predictable.

I do not think a better embedding model is the way out. The question feels different to me. Less "what is most similar to this," and not even "what is still true," but something closer to "what actually worked, and how do we hang onto that while the rest quietly fades."

If you are running agents in production and wrestling with this, I would genuinely like to hear how you handle it. The conversation that kicked all of this off taught me more than anything I have read on the topic.

Top comments (7)

Alex Shev • Jun 14

This is the memory problem that keeps showing up in real agent work. Semantic similarity is useful for recall, but it has no concept of outcome quality. The missing layer is not just more memory, it is scored memory: what worked, what failed, what was later corrected, and under what constraints. Otherwise the agent keeps retrieving familiar mistakes with high confidence.

Michelle Tristy • Jun 14

Scored memory is the right frame, and the part I keep snagging on is where the score actually comes from. Outcome quality is not knowable at write time. You find out whether acting on a memory helped much later, if you capture it at all, and most setups never close that loop. So you end up scoring on proxies instead. Recency, did a human nod at it, did it at least not error. And those proxies are exactly where the familiar mistakes walk back in with high confidence. "Under what constraints" is the piece almost nobody stores, and I suspect it is the piece that matters most.

Alex Shev • Jun 14

Exactly. The score has to be allowed to mature after the write. I would start with weak priors at capture time, then update them when the memory is reused: did it help complete the task, did a human correct it, did it cause a rollback, did it only apply under a constraint that was missing? Without that feedback loop, "memory quality" becomes a nicer name for recency plus vibes.

Michelle Tristy • Jun 16

Recency plus vibes is the most honest description of the current state I have read. I am keeping that.
The weak priors maturing on reuse model is right, and the place I get stuck implementing it is attribution. When a task succeeds, several memories were usually in context, not one. Crediting all of them rewards the passengers that happened to ride along, and crediting the top retrieved one is often just rewarding whatever was most similar, which is the exact bias the score was supposed to correct. Your rollback and human correction signals are cleaner precisely because they tend to point at a specific memory, the one that caused the revert, rather than the whole retrieved set. The diffuse positive case is the one I cannot cleanly assign.
The constraint signal you mentioned, did it only apply under a condition that was missing, is the one I think is most underrated. A memory that worked ten times can be carrying a hidden precondition nobody wrote down, and it keeps scoring well right up until the context shifts out from under it and it fails for a reason the score never captured. Have you found a way to surface that the precondition exists before the failure teaches it to you, or is it always after the fact?

Alex Shev • Jun 17

That attribution layer is the difference between memory and folklore. If a system remembers a preference, it should also know where it came from, when it was last confirmed, and whether it was a one-off instruction or a durable rule.

Without that, memory starts sounding helpful while quietly losing accountability.

Mehmet Can Farsak • Jun 14

Great breakdown of agent memory failures. The "sounds related ≠ worked" problem is real — I've seen the same pattern with agents picking the wrong mode. You ask an agent to brainstorm, it recalls a coding session because embeddings think they're related, then starts writing code instead of exploring ideas. That's why I built Brainstorm-Mode (mehmetcanfarsak/Brainstorm-Mode on GitHub) — it enforces mode discipline at the infrastructure level with PreToolUse hooks, so the agent stays in ideation instead of downgrading to execution. Three modes (divergent, actionable, academic) each with different constraints.

Michelle Tristy • Jun 14

That mode collapse example is a sharp version of it. The agent is not really choosing the wrong mode, it is retrieving a session that looks related and inheriting its behavior, which is the similarity trap wearing a different hat. Enforcing the mode at the hook level is interesting because it sidesteps the memory question entirely. You constrain the behavior instead of trusting the recall. I do wonder where that runs out though. Hard boundaries work when the modes are known up front. The cases that get me are the ones where the right behavior depends on how a similar attempt actually turned out last time, which no hook can know in advance.