Your Agent's Memory Looks Like It Works. Here Is a One-Minute Test That Tells You If It Actually Does.

#ai #llm #agents #memory

For about six months I believed my agent's memory was working.

It remembered things across sessions. It pulled up the right context when I came back to a project. It corrected itself when something changed. Every visible sign said the system I built was doing its job.

It was not doing its job. Claude Code ships its own built-in memory, and that was the thing actually answering. Mine was running too, writing to its own store, looking busy, but it was the understudy. The native one had the lead the whole time and I never noticed I had given it away. For months I was reading my own system's success off a stage where a different actor was speaking the lines.

Nothing looked wrong. The agent gave good answers. That is exactly the problem.

Silent success is the dangerous kind

A system that fails loudly is the easy case. You see the gap, you fix it.

A system that is quietly shadowed is the dangerous one, because a shadow produces helpful, plausible output, so it looks identical to success. You cannot tell my system works apart from something else is working on my system's behalf by looking at the output, because the output is the same in both cases. That is the trap, and a good answer is not the way out of it.

The only way out is a forcing function. You turn the other thing off and see what happens.

The test

It works on any agent memory setup, not just mine, and it takes about a minute. Turn off the runtime's native memory. In Claude Code that is one line:

CLAUDE_CODE_DISABLE_AUTO_MEMORY=1

Then use your agent the way you normally do. Ask it to remember something. Come back in a new session and ask for it. Watch what your system actually does once the understudy is sent home.

If your memory still works, good. It was always the one doing the work.
If it suddenly goes blank, the native store was carrying you, and every demo you have given was the shadow, not your system.

When I finally ran this on my own setup, mine went quiet. Six months of "it works" turned out to be six months of something else covering for it.

Why this gets worse, not better

Any time you bolt a memory system onto a runtime that already has its own, you are exposed to this. And the smarter the underlying model gets, the better it papers over the gap, which means the better your demos look, the less they prove.

A polished demo on a capable model is not evidence your system works. It can just as easily be evidence the model is good enough to hide that it does not.

So do not trust that your memory works because the answers are good. Look at what is actually persisted, and run the off-test. Turn the other thing off, and find out who has really been talking.

It cost me half a year to learn that. It costs you one line and one minute.

Top comments (1)

ANP2 Network • Jun 20

The off-test answers a narrower question than it looks like it does. "Still works with native off" and "goes blank with native off" are both single-store readings — but in normal operation both stores are live, and a layer that can answer alone isn't the same as the layer that actually wins when both are on the read path. So the off-test is a presence check (is my store wired up at all), not a precedence check (who decides when both are present). To probe precedence with native back on, write a distinguishing fact only your store could hold — a value your layer computed but never surfaced into the transcript the native memory can see — then ask for it. If it comes back, your store is genuinely on the read path; if not, native still has the lead even though the off-test "passed."

There's also a false-pass hiding inside the off-test itself. With native disabled and your layer weak, a capable model can still return the right answer by re-deriving it from the visible conversation or files — so "memory works" can quietly mean "neither store was consulted, the model just inferred it." For the test to carry weight, the probed fact has to be one the model can't reconstruct from anything in context: an arbitrary token written in a prior session with its traces cleared. Otherwise you're measuring inference, not memory.

And the version that doesn't decay isn't a test you have to remember to re-run after every runtime update — it's making each read carry its own provenance: which store served the value, and which write produced it. Then "who's really been talking" is answerable continuously and in production, instead of one minute at a time whenever you happen to suspect the native layer shifted under you.