DEV Community

Kunal
Kunal

Posted on • Originally published at kunalganglani.com

Context Engineering for AI Agents: 4 Pillars That Replace Prompt Engineering [2026]

Originally published at kunalganglani.com — read it there for inline code, hero image, and live links.

Context engineering is what happens when you stop obsessing over the perfect prompt and start thinking about everything else the model can see. It's the discipline of managing what information your AI agent can access, remember, and reason about at each step of execution. Not the words you whisper to the model. The entire information environment around it.

Think of it this way: prompt engineering is choosing what to say. Context engineering is furnishing the model's brain.

If you've built agents that work brilliantly for 10 turns and then fall apart, you've already felt why this matters. The model didn't get dumber. Its desk got buried.

Why Prompt Engineering Alone Can't Save Your Agent

I keep seeing the same pattern play out. A team spends weeks perfecting their system prompt. They test edge cases, tweak tone, dial in the instructions. The agent looks incredible in demos. Then it hits real users with long, messy conversations and starts contradicting itself by turn 40.

The prompt was never the problem. Everything else was.

Anthropic's engineering team documented this after working with dozens of real-world agent deployments. Their key finding wasn't about prompts at all: the most successful implementations used simple, composable patterns and treated the model's information environment as a first-class architectural decision. Not an afterthought bolted on after launch.

Prompt engineering asks: "What do I say to the model?" Context engineering asks: "What does the model have access to at each decision point?" That second question is harder, more structural, and increasingly the one that separates agents that ship from agents that demo well.

I've shipped enough agent features to know that a perfect prompt inside a polluted context window produces garbage. And a mediocre prompt inside a well-managed context produces surprisingly good results. The context is the product.

This tracks what Lilian Weng, VP of Research at OpenAI, identified back in 2023: all in-context learning is fundamentally short-term memory utilization, while long-term memory requires external vector stores with fast retrieval. Context engineering takes that insight and turns it into something you can actually build against.

Context Engineering vs Prompt Engineering: What Actually Changed

The terminology shift isn't branding. It reflects a real expansion in what builders need to control.

Dimension Prompt Engineering Context Engineering
Scope System prompt + user message Entire information environment: prompt, memory, retrieved docs, tool outputs, conversation history
Lifecycle Static per deployment Dynamic per turn — changes as the agent acts
Core skill Wordsmithing instructions Architecting information flow
Failure mode Wrong output for a given input Degradation over time as context fills
Optimization target Single-turn accuracy Multi-turn coherence and reliability
Memory model Implicit (whatever fits in the window) Explicit: write, retrieve, compact, isolate
Who owns it The prompt author The system architect

Prompt engineering isn't dead. It's a subset. You still need to write good prompts, but that's now roughly 20% of the work. The other 80% is deciding what information surrounds that prompt at runtime.

If you've been following the evolution from generative AI to agentic AI, this should feel obvious. Agents don't just respond to prompts. They execute multi-step plans, call tools, read files, query databases. Every one of those steps produces output that lands in the context window. Managing that flow is the entire game.

The Four Pillars: Write, Retrieve, Compact, Isolate

Maneshwar (Athreya), the developer behind git-lrc, laid out the clearest framework I've come across in his Context Engineering 101 series on Dev.to. Four operational pillars. Clean, actionable, and they map to real production decisions:

1. Write — deciding what gets written into the context and how. Not everything an agent observes should land in the window. Tool outputs, intermediate reasoning, file contents — each write is a choice. Write too much and you bury signal in noise. Write too little and the agent loses track of what happened two steps ago.

2. Retrieve — pulling the right information at the right time. This is where RAG and vector databases earn their keep. Instead of cramming everything into the window upfront, you retrieve relevant context on demand. The agent's window stays lean.

3. Compact — lossy compression that preserves meaning. Teams get this one wrong constantly, and they're confident about it, which makes it worse. It looks like "just summarize the conversation." It's not. Compaction is disciplined, lossy compression that deliberately throws away the right information. As Maneshwar puts it: the context window is RAM, not a hard drive. Working memory with a hard edge.

4. Isolate — giving each subtask its own dedicated, uncontaminated context window. Instead of one omniscient agent trying to do everything, you spawn focused sub-agents with clean contexts. The SQL agent doesn't see the email-drafting instructions. Each desk stays small and sharp.

These four moves are the difference between an agent that degrades at turn 80 and one that stays sharp across thousands of interactions. If you've worked on AI agent control flow, you'll recognize that context management is the missing layer that makes control flow actually work.

The "Dumb at Turn 80" Problem

This is the failure pattern that makes context engineering urgent. And if you've run agents in production, you've probably seen it yourself.

At turn 10, your agent is brilliant. Sharp, fast, remembers exactly what you asked. By turn 80, the same model with the same prompt is re-suggesting rejected fixes and contradicting earlier decisions. It hasn't gotten dumber. The context window has filled with stale tool outputs, old reasoning chains, and irrelevant intermediate results.

I've watched this happen in real time. You pull up the logs and trace the degradation step by step: around turn 30-40 the agent starts losing precision on earlier decisions. By turn 60 it's hallucinating context it never had. By turn 80 it's a different creature entirely. Same model. Same prompt. Completely different behavior.

The metaphor Maneshwar uses nails it: your context window is a desk, not a filing cabinet. Everything the model can reason about has to fit on the desk at once. Keep piling papers and eventually the important sheet — your original goal, the user's constraints, that critical decision made at turn 5 — is buried under four hundred lines of tool output nobody needs anymore.

This is why naive summarization fails as a compaction strategy. A generic summarizer doesn't know which details are load-bearing for future decisions. It produces a tidy summary that strips out exactly the information the agent needs three turns later. Good compaction requires understanding the task structure: what's the agent trying to accomplish, and what specific details does it need to remember to get there?

If you've looked at Netflix's Headroom approach to AI agent cost optimization, you know that token efficiency isn't just a cost problem. Every unnecessary token in the window is attention the model wastes on irrelevant information. Cost and coherence are the same optimization.

How to Design Your Context Architecture

Here's the architecture that actually works in production. After building systems that handle multi-step agent workflows, I've converged on a layered approach that maps directly to the four pillars.

Layer 1: The Active Window

This is the model's working memory. Everything currently in the context window. It should contain:

  • The system prompt (compact, focused on the current task)
  • The current user goal or instruction
  • Recent conversation turns (compacted, not raw)
  • Retrieved context relevant to the current step
  • Tool outputs from the current action chain

The discipline here is aggressive curation. Not everything that happened belongs in the active window. After each tool call, ask: does this output need to persist, or can it be summarized and pushed to external storage? Most teams never ask this question. They just let the window fill up and wonder why the agent gets confused.

Layer 2: Session Memory

External short-term storage. A database or cache that holds the full conversation history, tool outputs, and intermediate results for the current session. The active window pulls from session memory via retrieval, not by keeping everything loaded.

Anthropic's engineering team identifies four memory storage types: in-context (the active window), external storage (databases with fuzzy or exact search), in-weights (baked into model parameters via fine-tuning), and in-cache (saved KV computation states). Any production context layer worth its salt uses at least the first two.

Layer 3: Persistent Memory

Long-term storage. Vector embeddings in a vector database, structured data in Postgres, knowledge graphs. Information that persists across sessions and can be retrieved when relevant.

This is where Lilian Weng's 2023 insight becomes operational: long-term memory can't live in the context window. It requires external stores with fast retrieval. That's the retrieve pillar doing its job.

Layer 4: The Orchestration Layer

This layer decides what moves between the other three. It handles compaction (when and how to compress the active window), retrieval (what to pull from session or persistent memory), and isolation (when to spawn a sub-agent with a clean window).

In my experience building multi-agent systems, the orchestration layer is where most teams under-invest. They build the retrieval pipeline, set up the vector store, write the system prompt, then wire it all together with duct-tape logic and hope for the best. The orchestration layer deserves the same architectural rigor as everything else. Probably more.

Why Your Context Layer Can't Live Inside the Model

The industry learned this one the hard way.

Jonathan Murray of Backboard.io documented what happened when a government export-control directive pulled the Claude Fable 5 model offline for all customers globally. Zero deprecation notice. One day it was there; the next it wasn't. Cohere CEO Aidan Gomez called it "a massive wake-up call."

If your agent's memory and context live inside a vendor's model context window, they're rented. They can disappear overnight due to geopolitical factors completely outside your control. I think that's an unacceptable risk for any production system, and after the Fable 5 incident, I'm not sure how anyone argues otherwise.

Murray lays out four requirements for a production context layer that I consider non-negotiable at this point:

  • Model-agnostic — works with any LLM provider
  • Off-model — persists in infrastructure you own, not in the vendor's window
  • Portable — can move across providers and regions
  • Programmatically accessible — API and CLI, not buried in a vendor dashboard

This matters a lot if you're building with agent frameworks or agent orchestration tools. Many frameworks couple memory tightly to a specific model provider. The Fable 5 incident proved that coupling is a business continuity risk, not a technical preference.

For teams running local LLM infrastructure, this is actually an advantage you might not have appreciated yet. Your context layer already lives on hardware you control.

The Omniscient Agent Anti-Pattern

Here's the thing nobody's saying about most agent architectures: they're built backwards.

The default approach is to build one magnificent system prompt that contains every capability, bolt on every tool the agent might need, and feed the entire conversation history into a single context window. It feels elegant. It's actually a disaster.

Maneshwar calls this the omniscient agent anti-pattern, and his description is painfully accurate: "The SQL instructions bleed into the email-drafting instructions. The tool the model needs is buried under nine tools it doesn't." The result is an agent that is mediocre at everything rather than excellent at anything.

I've built agents like this. You probably have too. The fix is isolation — the fourth pillar.

In a properly isolated architecture, the orchestrator passes sub-agents a compressed task description, not the full conversation history. The sub-agent gets a fresh window, completes focused work, and returns a single compact result. The orchestrator's window stays clean. The sub-agent's window stays focused. Everyone wins.

Two implementation strategies that work:

  1. Sub-agent spawning — the heavy approach. Each subtask gets its own agent instance with a dedicated context window. Good for complex, multi-step subtasks where the sub-agent needs room to think.
  2. Task-scoped context windows — the lighter approach. You clear and rebuild the context window for each task phase within a single agent. Works well for sequential workflows where tasks don't overlap.

Most tutorials on building AI agents with Python default to the omniscient pattern. That works for demos. It breaks in production. Context engineering is what bridges that gap.

The context window is not a hard drive. It's a desk. And a desk with forty open folders on it doesn't make you productive — it makes you paralyzed.

Context Engineering in Practice: A Decision Framework

Theory is clean. Production is messy. Here's the decision framework I use when building agentic AI systems, developed through trial and a lot of error:

When to compact:

  • The conversation has exceeded 30-40 turns
  • Tool outputs are accumulating faster than the agent is using them
  • The agent starts repeating earlier suggestions or contradicting prior decisions (the clearest signal)
  • You're spending more than 50% of your token budget on conversation history

When to isolate:

  • The agent handles more than 3 distinct task types
  • Your system prompt exceeds ~800 tokens of instructions
  • Tools from one domain are interfering with another domain's reasoning
  • You notice the agent selecting wrong tools for the task at hand

When to retrieve instead of store:

  • Reference material exceeds 2,000 tokens
  • The information is only relevant to specific query patterns
  • You're working with knowledge that updates frequently
  • Multiple agents need access to the same information

When to write directly to context:

  • The information is critical for the immediate next step
  • It's small (under 500 tokens)
  • It's likely to be referenced in the next 3-5 turns
  • Retrieval latency would break the user experience

This framework maps to the prompt engineering patterns I've developed over time, but extends them into a full lifecycle model. A good prompt is still necessary. It's just no longer sufficient.

What This Means for AI Agent Architecture Going Forward

Context engineering isn't a trend. It's the formalization of lessons that every serious agent builder has learned through painful production experience. I've been through enough of those lessons to know they don't go away just because a new model drops.

The implications are concrete. Teams building production AI systems need to budget real engineering time for context management, not just model selection and prompt tuning. If you're evaluating agent frameworks, ask how they handle compaction and isolation. Not just tool calling and chain-of-thought. If you're hiring AI engineers, context engineering should be a core competency on the job description, not a nice-to-have.

The Anthropic team's insight holds: keep the patterns simple and composable. Don't reach for a framework that abstracts away your context management. You need to see what's on the desk at every step. Abstraction layers that hide the context are abstraction layers that hide the bugs.

My prediction: by the end of 2026, "context engineer" will be as common a title in AI security and agent teams as "prompt engineer" was in 2024. The builders who internalize the four pillars now — write, retrieve, compact, isolate — will be the ones shipping agents that actually survive contact with real users across thousands of turns.

Stop whispering to the model. Start furnishing its brain.

Frequently Asked Questions

What is the difference between context engineering and prompt engineering?

Prompt engineering focuses on crafting the right instructions to send to an LLM. Context engineering is broader — it manages the entire information environment the model has access to at each decision point, including conversation history, retrieved documents, tool outputs, and memory across sessions. Prompt engineering is one piece of context engineering, but context engineering also includes memory architecture, retrieval strategies, and information lifecycle management.

Why do AI agents get worse over long conversations?

AI agents degrade over long conversations because their context window — the information they can reason about at once — fills up with stale tool outputs, old reasoning chains, and irrelevant details. The model doesn't get dumber; its working memory gets polluted. This is called the "dumb at turn 80" problem. Solving it requires deliberate compaction (compressing history while preserving meaning) and isolation (giving subtasks their own clean context windows).

What are the four pillars of context engineering?

The four pillars are Write (deciding what information enters the context window), Retrieve (pulling relevant information from external storage on demand), Compact (lossy compression of conversation history that preserves meaning), and Isolate (giving each subtask its own dedicated context window to prevent cross-contamination). Together, they form a complete framework for managing an AI agent's information environment.

How do you prevent context contamination in multi-agent systems?

The primary strategy is isolation. Instead of passing full conversation history to every sub-agent, the orchestrator passes a compressed task description. Each sub-agent gets a fresh context window with only the instructions, tools, and information that specific job needs. When the sub-agent finishes, it returns a compact result — not its entire working history. This keeps each agent's context small, focused, and free from irrelevant information.

Should your AI agent's memory live inside the model's context window?

No. Memory stored solely in a vendor's context window is rented, not owned. The Fable 5 crisis in 2026 demonstrated this when an export-control directive pulled a frontier model offline overnight. A production memory layer should be model-agnostic, persistent in infrastructure you control, portable across providers, and accessible via API. External memory stores — vector databases, structured databases — give you durability and portability that in-context memory cannot.


Originally published on kunalganglani.com

Top comments (0)