WonderLab

Posted on Jun 18

Agent Series (22): Context Engineering Deep Dive — Quantifying Three Context Management Strategies

#ai #agents #langchain #opensource

The Linear Cost Problem

Agents aren't stateless API calls — they need to remember conversation history. Every turn accumulates in the context window until two problems emerge:

Turn 1:   ~1K tokens   ← cheap
Turn 10:  ~5K tokens   ← manageable
Turn 50:  ~25K tokens  ← getting expensive
Turn 100: ~50K tokens  ← replaying the entire history on every call

This isn't theoretical. A 30-turn project discussion takes ~2,500 tokens of full history; after 100 turns that number is ~8,000 — growing linearly with every exchange.

Three common responses to this problem:

Strategy	Approach	Intuitive trade-off
Naive	Pass full history every time	Expensive, but accurate
Sliding Window	Keep only the last N messages	Saves tokens, may lose info
Rolling Summary	LLM compresses old messages + keeps recent	Balanced?

This article benchmarks all three with real numbers to test whether the intuitive trade-offs hold.

Demo Design

Conversation Construction

30 turns of project discussion covering database choice, cache config, migration ownership, deployment platform, CI/CD, authentication, and 24 other technical decisions. Key design choice: the important decisions are placed in turns 1–4 (the earliest), with recent turns containing less critical content. This forces context-loss failures to surface.

Three Strategy Implementations

Strategy 1: Naive (baseline)

def run_naive(history: list, query: str, keywords: list[str]) -> StrategyResult:
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + history + [HumanMessage(content=query)]
    tokens = count_messages_tokens(msgs)
    t0 = time.time()
    text = str(llm.invoke(msgs).content)
    return StrategyResult(text, tokens, time.time() - t0, recall_score(text, keywords))

Strategy 2: Sliding Window (truncation)

def run_sliding_window(
    history: list, query: str, keywords: list[str], window: int = 12
) -> StrategyResult:
    recent = history[-window:]   # keep only the last 12 messages
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + recent + [HumanMessage(content=query)]
    ...

Strategy 3: Rolling Summary

def summarize(messages: list) -> str:
    """Compress a block of conversation into a bullet-point decision list."""
    text = "\n".join(
        f"{'User' if isinstance(m, HumanMessage) else 'Assistant'}: {m.content}"
        for m in messages
    )
    prompt = (
        "Compress the following project discussion into concise bullet points.\n"
        "Preserve: every decision made, owner names, technical choices, exact numbers.\n"
        "Remove: conversational filler, redundancy.\n\n"
        f"Conversation:\n{text}\n\n"
        "Bullet-point summary:"
    )
    return str(llm.invoke([HumanMessage(content=prompt)]).content)


def run_rolling_summary(
    history: list, query: str, keywords: list[str],
    recent_window: int = 8, cached_summary: str | None = None,
) -> tuple[StrategyResult, str]:
    old = history[:-recent_window]
    recent = history[-recent_window:]
    summary = cached_summary if cached_summary is not None else summarize(old)

    # Inject summary into system prompt
    sys = SYSTEM_PROMPT + f"\n\n## Earlier Meeting Notes (Summary)\n{summary}"
    msgs = [SystemMessage(content=sys)] + recent + [HumanMessage(content=query)]
    ...

Key design: the cached_summary parameter lets the summary be built once and reused across all 4 test queries — the 38.2s build cost is paid only once.

Test Queries (all targeting the earliest decisions)

Query 1: What database did we choose? Who owns it, and why?
         Keywords: postgresql / timescaledb / david / acid / time-series

Query 2: What's our caching technology and TTL configuration?
         Keywords: redis / cluster / 1 hour / 5 minute / 16

Query 3: Who is responsible for database migrations, and what approvals are needed?
         Keywords: sarah / backend lead / 2 / senior / flyway

Query 4: What deployment platform and cluster configuration did we decide on?
         Keywords: kubernetes / eks / helm / argocd / 3-node

Recall = keywords found in response / total keywords. Simple but deterministic — no extra LLM judge calls needed.

Results

History: 30 turns  |  Full context: ~2,485 estimated tokens
Rolling summary build time: 38.2s (one-time, cached for all queries)

Per-Query Recall

Query                          Naive   Sliding   Rolling
─────────────────────────────────────────────────────────
DB decision (turn 1)            100%        0%        0%
Cache config (turn 2)            60%       40%       80%
Migration ownership (turn 3)     80%       20%       60%
Deployment platform (turn 4)     80%       20%       60%

Aggregate Metrics

Strategy                   Avg Tokens   Avg Latency   Avg Recall
─────────────────────────────────────────────────────────────────
Naive (full history)            2,513          9.6s        80%
Sliding Window (last 12)          604         17.4s        20%
Rolling Summary                 1,289          8.5s        50%

Token reduction vs Naive:
  Sliding Window: -76%
  Rolling Summary: -49%

Key insights:
  Highest recall:       Naive (full history)
  Most token-efficient: Sliding Window (last 12)
  Best quality/cost:    Rolling Summary

Three Counter-Intuitive Findings

Finding 1: Truncation's cost far exceeds intuition

Sliding Window saves 76% of tokens but drops recall from 80% to 20%.

This isn't surprising in principle — turns 1–4 are long gone past a 12-message window — but the magnitude is striking. Query 1 (database decision) scores 0%: zero out of five keywords found. The model isn't "fuzzy on the details." It has no idea the decision was ever made.

Takeaway: Sliding Window works for stateless short-horizon tasks. For anything requiring retrieval of early decisions, it's a trap.

Finding 2: Summaries occasionally outperform raw history

Query 2 (cache config): Rolling Summary 80% > Naive 60%.

Why does a compressed summary beat the original? In raw history, the caching discussion is scattered across 2,500 tokens — the model has to locate the relevant turn inside a sea of noise. The summary packs all decisions into a structured list; signal density is higher, extraction is easier.

This exposes a hidden Naive failure mode: longer context = sparser signal = model may miss information that's literally right there. Adding tokens doesn't always add accuracy.

Finding 3: Compression loss is a real bug, not noise

Query 1 with Rolling Summary scores 0%, yet the summary explicitly contains:

- Database: PostgreSQL with TimescaleDB extension (David, DB Lead)

Keywords postgresql, timescaledb, and david are in the summary, but none appeared in the model's response. Investigation reveals: the model answered the question about database choice but didn't mention ACID compliance or time-series — the technical reasons for the choice. The summary preserved the decision, not the rationale.

This is the fundamental cost of compression: summaries keep "what was decided" and lose "why it was decided". Queries that require reasoning over the rationale (not just recalling the fact) hit this gap hard.

When to Use Each Strategy

Task type                                       Recommended strategy
─────────────────────────────────────────────────────────────────────
Short, stateless, each turn independent         Naive (history is short anyway)
Long conversation, only recent turns matter     Sliding Window (save tokens)
Long conversation, need to recall early events  Rolling Summary (balanced)
Need precise recall of technical rationale      Naive, or Rolling + explicit
                                                "preserve reasons" in prompt

Rolling Summary production optimizations:

Summarization granularity: trigger once every 20–30 turns, not per message — frequent compression defeats the purpose
Prompt requires specifics: Preserve: every decision, owner names, exact numbers is non-negotiable
Two-layer structure: summary (old) + recent (last N) — do not remix the summary into the recent messages and re-compress
Lazy build: build the summary the first time it's needed, then cache; don't rebuild on every query

The Actual Rolling Summary Output

This is what the demo built from 22 turns of conversation — shows the compression ratio and information retention:

- Database: PostgreSQL with TimescaleDB extension (David, DB Lead)
- Caching: Redis Cluster, TTL: 1h sessions, 5m dashboards, 16GB max per node
- Database Migrations: Sarah (Backend Lead), Flyway, 2 senior-engineer approvals
- Deployment: Kubernetes on AWS EKS, Helm charts, 3-node prod, 1-node staging, ArgoCD
- API Versioning: URL path versioning, 2 major versions, 6m deprecation notice
- Authentication: JWT, 24h TTL users, 1h admins, 30-day Redis refresh TTL
- Logging: Structured JSON, Fluentd → ELK, 30-day hot, 1-year cold S3 Glacier
- Rate Limiting: Token bucket, 100 req/min standard, 1000 req/min premium, Redis
- CI/CD: GitHub Actions + ArgoCD, blue-green, 5m health check
- Internal Services: REST external, gRPC + Protocol Buffers internal
  ... (22 bullet points total; ~1,800 tokens original → ~600 tokens compressed, 3:1 ratio)

Design Checklist

Strategy selection

[ ] Identify how far back the task needs to recall
[ ] Sliding Window: only when recent context is self-sufficient
[ ] Rolling Summary: when early decisions matter but verbatim history isn't required
[ ] Naive: when history is inherently short, or when reasoning over rationale is required

Rolling Summary implementation

[ ] Prompt explicitly requires: decisions, owners, numbers, technical choices
[ ] Summary injected into system prompt, not mixed into the message list
[ ] Summary built once and cached — not rebuilt per query
[ ] Trigger threshold: summarize when messages exceed N turns (recommend 20–30)

Recall validation

[ ] Test queries must target early turns, not only recent ones
[ ] Keywords must include technical reasons, not just conclusions ("acid" not just "postgresql")
[ ] Run keyword-recall benchmark before choosing a production strategy

Summary

Three conclusions:

Truncation is a trap without measurement: Sliding Window saves 76% of tokens but drops early-turn recall to 0%. Without a benchmark, the cliff edge is invisible — you ship, and your agent silently forgets half the context
Summary compression loses "why," not "what": Decisions survive compression; their rationale often does not. Queries that need the reasoning chain are better served by Naive or a summary prompt that explicitly requires preserving reasons
Rolling Summary's build cost is a one-time payment: 38 seconds looks alarming, but it's an offline, cached operation — at query time you're adding ~600 tokens to each request instead of 2,500, buying a 4× token reduction on every subsequent call

References

LangChain Message History documentation
Anthropic Context Caching documentation
Article 8: Context Engineering — Context is the Most Important Input
Full demo code: agent-21-context-engineering-deep

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community