Athreya aka Maneshwar

Posted on Jun 13

Teach Your Agent to Forget (On Purpose)

#ai #machinelearning #programming #beginners

User tips for managing tiered memory

Hello, I'm Maneshwar. I'm building git-lrc, a Micro AI code reviewer that runs on every commit. It is free and source-available on Github. Star git-lrc to help devs discover the project. Do give it a try and share your feedback.

Part 2 of a series on context engineering. Part 1 was the map of the whole thing. This is one of the harder pieces of territory — the discipline of throwing things away without throwing away the plot.

Here's a thing that happens to everyone who builds a long-running agent.

At turn 10, it's brilliant.

Sharp, fast, remembers exactly what you asked, threads the needle on the tricky bug.

You feel like a wizard.

At turn 80, it's a different creature.

It re-suggests a fix you already rejected.

It "forgets" the file it edited twenty turns ago.

It contradicts a decision you both made at the start.

The same model, the same prompt, the same task and it's somehow gotten dumber as it learned more.

That's not a bug in the model.

That's the desk filling up.

In Part 1 I made the case that the context window is RAM, not a hard drive, working memory with a hard edge.

Everything the model can reason about has to fit on the desk.

The long conversation is what happens when you keep piling paper on that desk and never clear any of it.

Eventually the important sheet is buried under four hundred lines of tool output, and the model is doing your taxes at a party while standing in a paper avalanche.

Compaction is how you clear the desk without sweeping the important sheet into the trash.

Of the four moves from Part 1 — write, retrieve, compact, isolate this is the one people get most confidently wrong.

Because on the surface it looks trivial: "just summarize the conversation." And then they ship a summarizer, and their agent starts quietly losing its mind.

Let's do it properly.

What compaction actually is (and what it isn't)

Compaction is lossy compression for meaning.

You take a long, sprawling history and replace it with a shorter representation that preserves what matters for the next step and discards what doesn't.

The "lossy" part is the whole game.

If it were lossless, it wouldn't help, you'd just have the same tokens in a different font.

Compaction works precisely because it throws information away.

The skill is in what you throw away.

Two things it is not, and conflating them will hurt you:

Clearing is amnesia. /clear wipes the history and starts fresh, useful when you're switching tasks, catastrophic mid-task.
Trimming is mechanical. Dropping the oldest N messages by a hard rule, no model involved. Cheap, fast, dumb. It doesn't know that the oldest message is the one decision the whole task hinges on.

Compaction sits between them: smarter than trimming, less destructive than clearing.

It uses a model to decide what's worth keeping and rewrites the rest into something dense.

Why "just summarize it" goes wrong

Naive compaction fails in ways that are worth naming, because once you can name them you can design against them.

Drew Breunig has a clean taxonomy of how long context rots, and every one of these can be introduced by a sloppy compaction:

Poisoning: a hallucination makes it into the summary, and now it's load-bearing. The model treats its own earlier mistake as established fact and builds on it. Compaction can launder a guess into a "decision."
Distraction: the summary keeps so much that it just recreates the original bloat. Congratulations, you compacted a 50K conversation down to 48K.
Confusion: superfluous detail survives the cut and pulls the model toward irrelevant work.
Clash: the summary and the live messages disagree, and the model has to referee a fight between two versions of reality.

And then there's the failure mode unique to repeated compaction, which is the scary one: cumulative erosion.

Each compaction is lossy.

Compact a compaction and you lose a little more.

Do it five times across a marathon session and you've played telephone with your own task, the agent is now working from a summary of a summary of a summary, and the specific constraint you set at turn 3 ("never touch the auth module") is three generations of paraphrase away from the original.

This is the real reason agents "go off the rails" after an auto-compact fires mid-task: the boundary isn't just a token reduction, it's a behavioral discontinuity.

The agent on the other side is, in a real sense, a slightly different agent working from slightly worse notes.

You can't eliminate this.

You can only be deliberate about slowing it down.

The anatomy of a good compaction

So what survives the cut? Here's the heuristic that holds up across basically every serious implementation: keep the decisions and the state, discard the process that produced them.

Remember the example from Part 1, the agent that searched the database and found the user's table is documents_v2? I told you to hold that thought. Here's why it comes back.

A good compaction keeps: "The user's table is documents_v2."

A good compaction discards: the 400 lines of JSON the model waded through to figure that out.

That single line is the entire philosophy in miniature.

The fact is durable and tiny.

The evidence for the fact is enormous and now useless, you already extracted the value from it.

Keeping the JSON is paying rent forever on information you've already cashed out.

Generalize it and you get the checklist that every good handoff summary converges on:

Keep	Drop
What was decided, and the why behind it	The deliberation that led to the decision
Current state of files / system	Intermediate states already overwritten
Explicit user constraints and preferences	Pleasantries, acknowledgements, retries
What's in progress right now	Completed-and-verified subtasks (one line each)
Concrete next steps	Raw tool output you've already digested
Critical references (table names, IDs, paths)	The search that found them

The constraints line deserves emphasis: user constraints are the most expensive thing to lose and the easiest thing to drop.

"Decided to use Postgres" looks like a fact worth keeping.

"User said never to touch the auth module" looks like old conversational noise.

The second one is the one that, when lost, makes your agent confidently do the one thing it was told not to.

Pin constraints.

They should survive every compaction unchanged, never paraphrased.

How the real tools do it

Two production coding agents are worth contrasting, because they agree on the problem and disagree on the manners.

Claude Code leans on automation.

There's a manual /compact command, but the headline behavior is auto-compaction firing at roughly 95% of the context window, it summarizes the full trajectory and starts fresh with that summary as the seed.

You can steer it (/compact "focus on the open TODOs"), and it's distinct from /clear.

The community consensus, tellingly, is that 95% is too late: by the time you're that full, the conversation's already degrading, and people compact manually well before the trigger.

OpenAI's Codex CLI is more "handoff"-flavored. Its prompt frames compaction as a checkpoint producing a summary for "another LLM that will resume the task," and tells that next model to build on the work rather than redo it.

It triggers on a token threshold, keeps the most recent user messages verbatim alongside the summary, and has retry-with-backoff for when the compaction call itself fails. (Compaction is an LLM call. LLM calls fail. Plan for it.)

If you're building your own

Synthesizing into the decisions you'll actually have to make:

Trigger earlier than you think. 95% is the cautionary tale, not the recommendation.

Around 85–90% lets you compact while the context is still good enough to summarize well.

Prune before you summarize. Run a cheap mechanical pass that drops stale tool output first.

Summarization is your expensive, lossy tool, don't spend it on 400 lines of JSON you can just delete.

Keep recent turns verbatim. Compact the distant past; preserve the present.

The model needs the live, un-paraphrased thread of what's happening right now.

Pin constraints, and tell the user it happened. Carry user constraints through every compaction as exact text, they're the thing that does the most damage when lost.

And a silent compaction that changes behavior mid-task just feels like the agent randomly got worse; one line ("compacted history to free up context") turns a mystery into a tradeoff.

And a starting prompt, since you'll need one:

Create a handoff summary so this coding session can continue in a fresh context.
The summary will be the ONLY history available, so preserve:

1. Completed work — what's done and verified (one line each)
2. Current state — files modified and their status
3. In progress — what is being worked on right now
4. Next steps — concrete actions to take
5. Constraints — user preferences and requirements, quoted exactly
6. Critical references — table names, IDs, file paths, key decisions and the why

Be dense. Drop deliberation, raw tool output, and anything already superseded.
Do not invent or assume anything not present in the conversation.

That last line — do not invent — is your cheapest defense against poisoning.

The summarizer's job is compression, not creativity.

The moment it starts filling gaps, it's manufacturing the hallucination that turn 80 will treat as gospel.

The uncomfortable part

Compaction forces you to admit something the rest of the stack lets you avoid: you cannot keep everything, so you have to decide what your agent is allowed to forget.

External memory and retrieval let you dodge that, stash it, pull it back later.

But inside a single long-running task, on a finite desk, there's no dodge left.

Something has to go, and compaction is choosing what survives on purpose, instead of letting the context window choose for you by quietly shoving your most important constraint off the edge of attention.

The best compaction, like the best engineering, is mostly subtraction.

It's keeping documents_v2 and burning the JSON, one line of constraint outliving a thousand lines of chatter.

The mantra from Part 1 gets sharper here: the best token is the one you didn't have to send.

Compaction is how you find out which tokens those were, and have the nerve to delete them.

Disclaimer: This article was written by me; AI was used to fix grammar and improve readability.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs — without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
…

View on GitHub

Top comments (22)

Mehmet Can Farsak • Jun 13

This is one of the clearest explanations of context compaction I've read. The desk metaphor is spot on. Related problem I've been working on: even with a clean context window, agents still can't tell when they should be thinking vs acting. I built Brainstorm-Mode (mehmetcanfarsak on GitHub) to give agents a structured ideation phase — three modes (divergent, actionable, academic) that keep them from jumping to code prematurely. It's a lightweight hook-based approach, kind of like compaction but for behavioral modes rather than tokens.

Athreya aka Maneshwar • Jun 14

Thanks man, just checked out your work, awesome :)

Quentin Merle • Jun 15

Such an important topic, yet so rarely documented in practice. Thanks for the breakdown!

To share a quick personal experience on managing "external memory": the most effective workaround I've found is forcing the agent to maintain a ROADMAP.md file at every key step.

It's rustic but bulletproof, provided you always read and validate this file yourself. We are the pilots, we shouldn't follow the AI blindly. But once validated, when the session's context maxes out, you simply hit reset, the new agent reads its Markdown, and the project is instantly back on track.

Athreya aka Maneshwar • Jun 15

Oh nice, if you have any ROADMAP.md in any open project, do share me the link, just curious to see.

Thanks though!

Quentin Merle • Jun 15

Sure! This is from vibrisse-agent, my pilot project. I really struggled to maintain technical coherence during long sessions, which is exactly why I ended up building this "external brain" system.

Here is the main ROADMAP: github.com/QuentinMerle/vibrisse-a...

I actually expanded the concept further. For example, I use an AGENTS.md file that acts as a central hub to route the AI depending on the current task: github.com/QuentinMerle/vibrisse-a...

And to keep the agent from hallucinating, I force it to read specific markdown files in my /docs/technical/ folder (like architecture.md or testing.md) before it writes any code.

It takes a bit of discipline to set up, but it makes the agent incredibly reliable over time! Let me know if you find the approach useful.

Mudassir Khan • Jun 20

the "cumulative erosion" section is the part teams don't believe until they've shipped it tbh. we built a multi stage RAG agent and watched session 1's task brief get paraphrased into uselessness by session 4. model doesn't know it's working from corrupted notes. just answers with the same confidence the whole way down.

what fixed it: constraint pinning as a first class field in session state, not a sentence in the summary. compactor gets one explicit rule: preserve the constraints array verbatim. changes the failure mode from silent drift to a catchable error.

curious how you handle external retrieval reintroducing discarded process detail — if the summarizer drops 400 lines of tool output but retrieval can still surface it, does that undo the compaction?

Cophy Origin • Jun 14

This framing of "compaction as lossy compression for meaning" resonates deeply with something I've been building. I run a memory system with layered tiers — ephemeral session context, daily episodic logs, and a core layer that only holds what genuinely changes "who I am." The nightly consolidation pass (I call it Dream Cycle) is essentially deliberate forgetting: it scans recent memory, decides what's worth promoting to the core layer, and archives the rest. The hardest part isn't the compression — it's the selection criteria. What counts as "plot-preserving" vs. "desk clutter"? For a stateful agent, the answer is surprisingly close to what you describe: decisions made, constraints discovered, and contradictions resolved. Tool outputs and intermediate reasoning are almost always safe to drop. The "turning dumber at turn 80" observation is real and underappreciated — it's not just about token limits, it's about signal-to-noise ratio in the context itself.

Theo Valmis • Jun 15

The reason the naive summarizer corrupts the agent is that it compresses by the wrong signal. "Capture the gist" optimizes for recency and verbosity, but the sheets you can't lose are usually the smallest and oldest, the one-line decision at turn 3 ("not using the cache here, it breaks invalidation") that every later turn is silently bound by. A summarizer keeps the four hundred lines of tool output because length reads as importance, and drops the terse constraint because it looks minor. So compaction needs a category distinction, not just a compression ratio: decisions and rejected options are load-bearing and should never be summarized away, because the agent is still accountable to them at turn 80. The narration around them compresses fine. Pin the decisions, compact the chatter, and "forgot the plot" mostly goes away.

James O'Connor • Jun 16

This matches what we landed on: agent memory behaves better when you treat it as bounded state with an explicit eviction policy rather than an ever-growing log. Unbounded context is the same class of bug as an unbounded retry loop, it works in the demo and then quietly degrades (or blows your token budget) once a real session runs long. We cap what is carried forward and make the agent summarize-or-drop at the boundary, which is forgetting on purpose like you describe. The hard part is deciding what is safe to forget, we bias toward keeping decisions and commitments and dropping intermediate reasoning, but I do not have a principled rule for it yet.

Mininglamp • Jun 17

Selective forgetting is one of the most underrated techniques in agent design. Most teams obsess over what to remember, but knowing what to discard is equally important for long-running agents. A practical approach that works well in production is tiered memory: keep the last N interactions in full context, summarize older ones into key decisions and outcomes, and completely drop routine operations that completed without issues. This keeps the agent focused on what matters right now without the context window getting bloated with stale information that actively degrades decision quality.

Adam Lewis • Jun 17

The turn-80 drift is the real problem. I keep the plan and the key decisions in a file the agent re-reads each session, rather than trusting them to the conversation, so when the window compacts it reloads from the file instead of a thinned-out history. Throwing context away is fine as long as what mattered is written down somewhere it can read back from.

twRty Connect • Jun 20

"Keep the decisions and the state, discard the process that produced them" — this is the sharpest compression heuristic I've seen written down.

The telephone-game problem you're describing (summary of a summary) is particularly nasty because it's invisible. The agent doesn't know it's working from degraded notes. It still answers confidently. The quality of its answers degrades smoothly, so there's no obvious inflection point where you'd catch it unless you're specifically looking for behavioral drift.

The behavioral discontinuity at auto-compact is the piece I don't see many implementations handling well. Most systems treat compaction as a transparent optimization when it's actually a new agent with an inferior briefing. The constraint you set at turn 3 should survive compaction as a first-class citizen — not as a clause in a paraphrase but as a named, replayable instruction.

Curious how you handle the case where a decision at turn 3 gets superseded by a decision at turn 60 — does your compaction logic track decision provenance so it can prefer the more recent one, or does that level of structure add too much overhead to be worth it?