There's a particular way an AI coding agent goes bad. Not a crash, not an error. It just gets duller. Halfway through a long session it forgets a constraint you set early, repeats a question you already answered, or starts giving you shorter, vaguer replies to the same kind of ask it handled well an hour ago. You can feel the quality sag without anything actually breaking.
My first instinct was to blame MCP. I had a few servers connected, I'd read that connected servers eat the context window, so the story wrote itself: too many tools loaded, no room left to think, of course it's drifting. I was about to start disconnecting things. Then I decided to measure first, and the measurement didn't say what I expected.
I read the breakdown instead of guessing
The agent I use can print a breakdown of what's currently filling the context window, by category. So before cutting anything, I looked at where the tokens were actually going. I'll give this in proportions rather than raw numbers, because the absolute figures depend on the model and window size, and the shape is the part that transfers.
Roughly, in a session that had started drifting:
- Conversation history (the back-and-forth so far): the single biggest slice, around a fifth of the whole window on its own
- Fixed startup overhead (system prompt, tool framework, memory files): a meaningful chunk, but stable and one-time
- Connected MCP tool definitions: a small slice. Smaller than the rounding error I'd been worried about
The thing I was about to blame was near the bottom of the list. The thing I hadn't thought about, the plain accumulation of conversation, was the top.
Why the MCP assumption was half right
I want to be careful here, because "MCP doesn't cost anything" would be the wrong lesson, and it's not what I found.
MCP can be heavy. A connected server can load its full tool schema and carry it on every turn, and if your client loads all of that up front, a handful of servers really can take a large bite out of the window before you type a word. That version of the warning is real, and plenty of people have measured it on their own setups. So if you connect many servers and your client front-loads their schemas, the usual advice to disconnect what you don't use is sound.
What I'd add is narrower: it depends on how your client loads tools. Some setups defer the schema and only pull a tool's definition in when it's actually needed. In a setup like that, idle connected servers cost much less than the worst-case number suggests, and on the session I measured, they weren't my bottleneck. The general claim "MCP is expensive" and my specific result "MCP wasn't what filled my window" aren't in conflict. They're about different loading behavior. The honest takeaway isn't "MCP is innocent," it's "don't assume which line item is the problem, because it varies by setup."
What was actually filling it
The slice that grew without me noticing was conversation history. It makes sense once you see it: every exchange stays in the window, and a long exploratory session piles up turn after turn until the early context is competing for space with the part the model needs right now. Nothing dramatic added it. It was just the steady weight of a long conversation, and it was the part I hadn't thought to look at because it didn't feel like a "feature" I'd switched on.
That reframed the drift for me. The agent wasn't getting dumber because of what I'd connected. It was getting dumber because I'd been having one very long conversation, and the room to reason was slowly filling with the transcript of that conversation.
What I do about it now
None of the fixes are clever. They're just the things that follow once you know history is the heavy part.
I don't let one exploratory session run forever. When a thread of work is basically done, I start fresh instead of carrying the whole transcript into the next, unrelated task. When I do need continuity, I have the agent summarize where things stand and carry the summary into a new session, rather than dragging the entire history across. The point is to move the gist, not the full back-and-forth, because the full back-and-forth is exactly the weight I measured.
The mental model that stuck: the context window is a desk, not a filing cabinet. Everything you want the model to use at once has to fit on the desk's surface, and a long conversation slowly covers it with paper until there's no room to work. Clearing the desk is sometimes better than buying a bigger one.
The actual lesson isn't about MCP
If I'd followed my first instinct, I'd have disconnected a few servers, freed up a small slice, watched the drift continue, and learned nothing. The fix would have missed the cause, and I'd have blamed the tool I'd primed myself to blame.
So the thing I'm keeping isn't "history is always the culprit," because on someone else's setup it really might be the connected servers, or the memory files, or something I'm not thinking of. The thing I'm keeping is the order of operations: when the agent starts drifting, read the breakdown before you cut anything. The line item you're sure is the problem and the line item that's actually the problem are often not the same, and the only way to tell them apart is to look.
A note to my next self
When the agent gets dull mid-session, don't reach for the explanation you already have. Measure first. Read where the tokens are actually going, fix the slice that's actually large, and accept that it varies by setup so last time's culprit isn't a rule. For me it was conversation history, so I keep sessions shorter and hand off a summary instead of a transcript. Next time it might be something else, which is the whole reason to look instead of guess.
I build WordPress plugins and write about AI tooling and security at https://raplsworks.com/.
Top comments (28)
Measuring the context window before blaming MCP is the right instinct. A lot of tool-problem debugging is really context decay, stale instructions, or the model compressing away the one constraint that mattered. I like treating context as an observable resource instead of a vague feeling that the agent got worse.
"An observable resource instead of a vague feeling" is exactly the shift I was after. The moment you can read the breakdown, "the agent got worse" stops being a mood and becomes a measurement you can act on. Half the time the tool was never the problem; the instruction you set early got buried, or the model compressed away the one constraint that mattered, and you'd never see that by staring at the tools.
The compression case you name is the sneaky one, because it looks identical to the model "forgetting." The constraint was there, it was real, and it got squeezed out to make room. Which is the same lesson from another angle: don't trust the felt sense of what went wrong, read where the tokens actually went. Tools get blamed because they're visible; context decay does its damage quietly.
Yes, compression failure is nasty because it looks like personality drift from the outside. The model still sounds coherent, but the constraint that made the run safe is no longer active enough to steer behavior.
That is why I like explicit state checks: what instructions are still loaded, what artifacts are pinned, what budget is left, and what was summarized away.
"Looks like personality drift from the outside" is the exact trap. The voice survives compression, so it sounds fine, and the thing that's gone is invisible because it never showed up in the prose to begin with. You stop trusting tone as evidence once you've been burned by that once.
Your explicit-state-check list is the part I'd put on a wall: what's still loaded, what's pinned, what budget's left, what got summarized away. That last one is the quiet killer, because a summary reads as complete even when it dropped the one line that was steering the run. Checking what got compressed out is harder than checking what's present, and it's usually the one that matters. Good thread.
Yes, checking what disappeared is the hard half.
One practical trick is to make the summary list its omissions: "dropped raw logs, intermediate candidates, rejected branches, unresolved uncertainty." It is not perfect, but it gives the next agent a reason to distrust the summary in the right places.
Compression is safest when it admits what it compressed away.
Making the summary declare its own omissions is the most practical fix in this whole thread, because it attacks the exact thing that makes compression invisible: the dropped constraint never showed up in the prose, so the only way to see it is to force the prose to name what it left out. "Dropped raw logs, rejected branches, unresolved uncertainty" turns an absence into a line you can actually read.
"Compression is safest when it admits what it compressed away" is the sentence I'm keeping. It reframes the summary from a clean artifact into an honest one, and honest beats clean here, because a clean summary that silently dropped the steering constraint is more dangerous than a messy one that flags the gap. The omission list is basically the summary leaving its own audit trail, telling the next agent where to distrust it. That's the discipline. Good thread.
The 'read the breakdown before you cut anything' point is the whole thing, and it generalizes past context windows. The instinct is always to blame the newest component (MCP here), and the actual cost is usually the boring accumulated thing nobody switched on. I hit the same with long agent sessions: the transcript is the weight, and the fix is exactly your handoff-the-summary move, carry the gist into a fresh session, not the full back-and-forth. The desk-not-filing-cabinet line is the right model. Measure first, cut the slice that's actually large, and accept last time's culprit isn't a rule.
"It generalizes past context windows" is the part I didn't say out loud but should have. The pattern isn't about tokens, it's about where attention goes when something degrades: the newest, most nameable component gets blamed because it's visible, and the boring accumulated thing gets a pass because nobody switched it on. That's true of context windows, slow test suites, bloated CI, almost any system that quietly fills up.
The one line I keep coming back to is your last one: last time's culprit isn't a rule. That's the whole discipline, really. The moment you turn "it was history that time" into "it's always history," you're back to guessing, just with a more specific guess. The only durable move is to measure again, because the boring thing that fills up is rarely the same boring thing twice. Good read on it.
Last time's culprit isn't a rule' being the whole discipline is exactly it, the moment you promote one measurement into a law, you're guessing again with extra confidence. The newest nameable component takes the blame, the boring accumulated thing skates because nobody switched it on, and it's the same across context windows, CI, test suites, all of it. Measure again every time, because the thing that fills up is rarely the same thing twice. Good exchange, this one's going to stick with me.
"Guessing again with extra confidence" is the sharpest way anyone's put it in this thread. That's the trap exactly: the law feels like progress, but it's just a more confident version of the thing measuring was supposed to replace. Glad this one's sticking with you, it's sticking with me too. Thanks for the exchange.
The measure-before-cutting instinct is the whole game here. I've watched people rip out MCP servers when the real drift was a 40k-token transcript nobody trimmed, and the per-category breakdown is the part most tools just don't show you. Did seeing it change how you structure long sessions, or only what you cut?
The 40k-token transcript nobody trimmed is the exact shape of it, and you're right that the per-category breakdown is the part almost nothing surfaces. Without it you're not measuring, you're just picking the component you already suspected.
To your question: it changed structure more than it changed what I cut. Cutting was the first reaction, but once I could see that history was the slice that grows, the better move was to stop letting it grow that far in the first place. So now I scope sessions tighter. One task per session, hand off a summary, start clean, instead of running one long thread and trimming it after it's already heavy. The trim is reactive, it's paying down a cost after you've taken it on. Structuring for shorter sessions is preventive, you never let the transcript reach 40k. Seeing the breakdown is what moved me from cutting after the fact to structuring so there's less to cut. What about you, did the visibility change how you start sessions, or mostly what you remove from them?
The most useful takeaway here is the reminder to measure before optimizing. It's easy to blame the newest component in the stack MCP, tools, memory systems when behavior degrades, but bottlenecks are often much less exciting.
The "context window as a desk, not a filing cabinet" analogy is spot on. Long-running agent sessions accumulate hidden costs, and conversation history often becomes the biggest source of context pressure without anyone noticing.
A good lesson for AI builders in general: when quality drops, inspect where tokens are actually being spent before redesigning the system around assumptions. The diagnosis is often more valuable than the fix.
"The diagnosis is often more valuable than the fix" is the line I'll keep from this. The fix in my case was almost boring, shorten the session, hand off a summary. What had value was finding out that history, not the component I'd have bet on, was the weight. The fix took a minute; the measurement changed what I'd reach for next time.
And you put your finger on why the newest component gets blamed: it's the one you can picture, the one with a name and a config. Conversation history has no on-switch, so it doesn't feel like a thing you added, which is exactly why it grows unwatched. Boring bottlenecks win because nobody's looking at them. Measuring is mostly just agreeing to look where it isn't interesting.
One thing that stands out to me is that context pressure is not purely a token-budget problem; it’s also an information-quality problem.
Two sessions can consume the same number of tokens and produce very different outcomes. A context filled with outdated assumptions, abandoned approaches, and exploratory noise can be more harmful than a much smaller context containing only current decisions and constraints.
This is why I increasingly see context engineering as a form of information architecture. The challenge isn’t just fitting information into the window, but continuously deciding what remains operational knowledge, what becomes summarized knowledge, and what should be discarded entirely.
Measuring token allocation tells us where the space goes. Understanding information value tells us what deserves to stay.
"Measuring tells you where the space went; understanding value tells you what deserves to stay" is the sharper version of what I was reaching for. My post measured the symptom. You named the discipline under it. Two sessions burning the same token count can land in completely different places, and the difference is the quality of what's loaded, not the quantity.
The information-architecture frame is the part I'll carry. Token budget asks "does it fit." Information value asks "should it still be here," and those come apart fast: a context can be half-full and still poisoned by abandoned approaches the model keeps treating as live. The hard, ongoing call is exactly the three buckets you drew, operational, summarized, discarded, and almost nothing in the tooling helps you decide which is which. That's a human judgment for now. Good thread.
The desk analogy maps directly onto something we ran into building Blogboat, our AI blog writing tool. When users run a full AI session to draft, then revise, then re-prompt for tone in the same context, the later outputs get noticeably muddier than if they started a fresh session for each distinct task.
We ended up designing around it: treat each block-level edit as its own bounded context rather than a continuation of the original draft session. The AI doesn't "know" what came before it rewrote section 3. That sounds like a limitation but it's actually cleaner — the edit is evaluated on its own terms, not on the accumulated noise of all the earlier turns.
Your measure-first principle is the one more tools need to surface. Most writing AI just hides the context plumbing entirely, which means users never learn why quality drops late in a session. The breakdown you described is exactly the kind of visibility that would help them course-correct.
Treating each block edit as its own bounded context instead of a continuation is the move, and I like that you framed the forgetting as a feature rather than a workaround. It is one. "The AI doesn't know what came before it rewrote section 3" sounds like a limitation right up until you notice the alternative is the edit inheriting every earlier turn's noise whether or not it's relevant. Bounded context isn't the model knowing less, it's the model not being steered by things that have nothing to do with the current edit.
Your last point is the one I'd underline hardest. Hiding the context plumbing entirely is what keeps users stuck, because the quality drop feels like the AI randomly getting worse, with no visible cause to act on. They can't course-correct a thing they can't see. Surfacing even a rough breakdown turns "it got dumb" into "the session got long, start fresh," which is a move the user can actually make. The tools that expose the plumbing are teaching the user the one habit that fixes it; the ones that hide it are quietly training learned helplessness. Good example to bring in, the bounded-edit design is the same lesson from the product side.
This resonates with something I’ve noticed as well.
We often talk about context windows as a capacity problem, but in practice they’re also an attention-allocation problem. Not all tokens have equal value. A 5,000-token transcript full of exploration, dead ends, and outdated assumptions can be less useful than a 500-token summary containing the current constraints and decisions. What makes this interesting is that the challenge starts to look less like memory management and more like knowledge management. The question becomes: what information deserves to stay in active context, and what should be compressed into a higher-level representation?
In that sense, context engineering is starting to feel a lot like software architecture.
The shift from capacity to attention-allocation is the part I undersold in the post. I measured which slice was biggest and stopped there, but you're right that size is the crude proxy. A 5,000-token transcript of dead ends can actively cost you, not just by taking room, but by diluting the few hundred tokens that actually carry the current constraints. Big and useful aren't the same axis.
The knowledge-management reframe is where it gets real. Once you accept that tokens have unequal value, "keep the session short" is too blunt. The better move is deciding what gets promoted into a compact, high-level representation and what stays raw, which is exactly the summarize-and-hand-off step, just named properly. And the architecture comparison holds: you're drawing module boundaries, deciding what's interface and what's implementation detail the caller shouldn't hold. Context engineering as deciding what deserves to stay loaded. That's a better frame than the one I wrote to.
Good framing. The agent often does not get dumber. The context gets messier.
Before blaming MCP, it is worth checking token usage, noisy tool outputs, stale instructions, and whether too many tools are enabled.
"The agent doesn't get dumber, the context gets messier" is a cleaner way to say it than my whole title. The list you give is the right order to check too, and the sneaky one is stale instructions, because they look fine sitting in the window while quietly steering nothing. Cheap to check, easy to miss. Good addition.
Strong write-up. The part that resonated with me most was treating context as something observable instead of something you only feel after quality drops. In my own agent workflows, the biggest gains usually come from shortening the session boundary and carrying forward a deliberate handoff, not from ripping out tools. Your desk-vs-filing-cabinet framing explains that tradeoff really well.
That match between your workflow and the post is the part I find reassuring, because it means it isn't just my setup. The deliberate handoff is the move people skip, I think, because it feels like overhead in the moment. Ripping out a tool feels like progress, you can see the thing leave. Writing a clean handoff feels like extra work for no visible gain, right up until the session you start clean runs noticeably sharper than the one you dragged the whole transcript into. The gain is real, it's just deferred, which is exactly why the visible-but-wrong fix wins by default. Appreciate you reading it.
I do the same as well on the part of letting AI summarizing the whole chat in one prompt before moving on a fresh chat. It's a hassle when it slows down and you are currently in the flow of learning new things, verifying documents, and testing simple code. So the rule I set for myself is that if I saw some slight slowdown, I would immediately move to a fresh chat. I just copy and paste my written "summarize this chat" prompt from my notepad to keep things faster.
The saved-prompt-in-a-notepad trick is the part that makes this actually stick. The summarize-and-restart idea is easy to agree with, but in the middle of a flow nobody wants to write the handoff prompt from scratch, so they just push the slow session further. Having it one paste away removes the friction that would otherwise make you skip it.
The one thing I've added: I try to move at the first slight slowdown like you do, rather than waiting for it to get bad, because the summary you write early is cleaner than the one you write from a session that's already bloated. Once the context is heavy, even the summary it generates starts dragging the noise along with it. Early handoff, smaller and sharper summary. Sounds like you're already there.
@butialrj what you are mentioning is correct but I'm just not that disciplined yet to make prompt summaries each time before it slows down hahaha. But that is a good habit to follow. Sometimes I just notice that my chat is longer when it slows down, that's basically my indicator, but yeah I will try and do yours as well.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.