LLM Context Window Management: Strategies and Patterns

#ai #llm #python #tutorial

Managing context windows in production LLM applications is one of those problems that everyone underestimates until their app crashes or costs spiral out of control. Token limits are hard walls, not soft guidelines, and the strategies you choose upfront determine whether your system stays reliable at scale.

Why Context Windows Break Production Apps

Most developers hit context limits the same way: they chain a few prompts together, test with small inputs, ship, then watch the system fail when a user uploads a 50-page PDF. Context windows are finite buffers — once you exceed them, the API returns an error, not a graceful degradation.

The practical pain points:

Input truncation: silently cutting content means the model reasons on incomplete data
Cost explosion: naively stuffing the window on every call burns tokens on static content (system prompts, tool schemas)
Latency degradation: larger contexts mean slower TTFT (time to first token)

Understanding the trade-offs between different management strategies is the first step to avoiding these failure modes.

Strategy 1: Sliding Window with Hard Token Limits

The simplest approach is a sliding window that keeps only the most recent N tokens of conversation history. It is not smart — it discards early context — but it is predictable and cheap to implement.

import tiktoken

def truncate_messages(messages: list[dict], max_tokens: int, model: str = "gpt-4o") -> list[dict]:
    enc = tiktoken.encoding_for_model(model)

    total = 0
    result = []

    # Always keep the system message
    if messages and messages[0]["role"] == "system":
        system_msg = messages[0]
        total += len(enc.encode(system_msg["content"])) + 4  # +4 for message overhead
        result = [system_msg]
        messages = messages[1:]

    # Walk backwards through history, keep what fits
    for msg in reversed(messages):
        tokens = len(enc.encode(msg["content"])) + 4
        if total + tokens > max_tokens:
            break
        total += tokens
        result.insert(1 if result else 0, msg)

    return result

This works for conversational bots where recency matters more than total history. The critical constraint: count tokens the same way the API does. tiktoken handles this for OpenAI-compatible models; for other providers, check their tokenizer documentation.

Strategy 2: Semantic Compression with Summarization

When history matters — support tickets, code sessions, research threads — truncation loses information. A better pattern: periodically summarize older conversation segments and replace them with a compressed version.

import tiktoken
from typing import Callable

def compress_history(
    messages: list[dict],
    summarize_fn: Callable[[list[dict]], str],
    threshold_tokens: int = 3000,
    model: str = "gpt-4o"
) -> list[dict]:
    enc = tiktoken.encoding_for_model(model)

    def count_tokens(msgs: list[dict]) -> int:
        return sum(len(enc.encode(m["content"])) + 4 for m in msgs)

    if count_tokens(messages) <= threshold_tokens:
        return messages

    system = [m for m in messages if m["role"] == "system"]
    tail   = [m for m in messages if m["role"] != "system"]

    if len(tail) <= 4:
        return messages

    to_compress = tail[:-4]
    recent      = tail[-4:]

    summary     = summarize_fn(to_compress)
    summary_msg = {
        "role": "system",
        "content": f"[Earlier conversation summary]
{summary}"
    }

    return system + [summary_msg] + recent


def call_llm_for_summary(msgs: list[dict]) -> str:
    # Call your language model here with a summarization prompt.
    # Return the summary as a plain string.
    ...

# compressed = compress_history(conversation_history, call_llm_for_summary)

The cost: one extra LLM call per compression event. The benefit: the model retains the gist of the full conversation without unbounded context growth. For applications handling sensitive data, review your security hardening checklist before routing conversation history to external API endpoints — what goes into the prompt is often more sensitive than what comes out.

Strategy 3: Chunked Retrieval for Document-Heavy Workloads

For document-heavy workloads — legal review, codebase Q&A, long reports — you do not want to send the entire document on every call. You want only the chunks that are relevant to the current query.

The pattern: embed your documents at ingestion time, then at query time retrieve only the top-K chunks by semantic similarity.

import numpy as np
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    embedding: list[float]
    metadata: dict

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

def retrieve_relevant_chunks(
    query_embedding: list[float],
    chunks: list[Chunk],
    top_k: int = 5,
    max_tokens: int = 2000,
    model: str = "gpt-4o"
) -> list[Chunk]:
    import tiktoken
    enc = tiktoken.encoding_for_model(model)

    scored = sorted(
        chunks,
        key=lambda c: cosine_similarity(query_embedding, c.embedding),
        reverse=True
    )

    selected     = []
    total_tokens = 0

    # Fetch 2x top_k candidates, trim to token budget
    for chunk in scored[:top_k * 2]:
        chunk_tokens = len(enc.encode(chunk.text))
        if total_tokens + chunk_tokens > max_tokens:
            break
        selected.append(chunk)
        total_tokens += chunk_tokens

    return selected

Key observation: top_k is a starting point, not a ceiling. Always apply a secondary token budget — individual chunk sizes vary, and a fixed top_k can blow your context limit without warning.

Context Packing: Order and Structure Matter

Beyond which content to include, the order within the context window affects both cost and quality. A reliable structure:

[System prompt — static, always first]
[Retrieved context — RAG chunks, relevant docs]
[Compressed history summary — if applicable]
[Recent conversation turns — last 4-8 messages]
[Current user message]

A few non-obvious rules that trip people up:

Keep system prompts static — any variation between calls disables prompt caching, which can meaningfully cut costs on high-volume deployments
Do not inject user data into the system prompt — it defeats caching and introduces a prompt-injection attack surface
Separate tool schemas from instructions — some APIs cache tool definitions independently; mixing them into the main prompt body resets the cache

Measuring What You Actually Send

You cannot optimize what you do not measure. A simple decorator logs token usage per call:

import logging
from functools import wraps

logger = logging.getLogger(__name__)

def log_token_usage(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        response = func(*args, **kwargs)
        if hasattr(response, "usage"):
            logger.info(
                "llm_call_tokens",
                extra={
                    "prompt_tokens":     response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens":      response.usage.total_tokens,
                }
            )
        return response
    return wrapper

Feed this into whatever observability stack you use. Over time, the distribution of prompt token counts reveals whether your management strategies are working — or whether certain query patterns are silently bypassing them and sending far more than intended.

The Takeaway

Context window management is not a set-it-and-forget-it problem. The right strategy depends on your workload:

Short chat sessions: a sliding window with a hard token cap is sufficient
Long-running sessions with continuity requirements: periodic summarization keeps coherence without unbounded growth
Document or codebase Q&A: chunked retrieval with a secondary token budget cap
All production deployments: instrument token usage from day one

All three patterns compose naturally — sliding window for recent turns, summarization for older history, RAG for external documents. The added complexity pays off once you have been paged at 2 am because an LLM call returned a context overflow error on a payload that always worked fine in testing.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.