A few months ago, I decided to add an AI chat assistant to my side project—a simple project management dashboard. The goal was to let users ask questions like "What tasks are overdue?" or "Summarize the last team meeting." Sounds simple, right? Just plug in an API call and you're done.
Spoiler: it wasn't that simple. Here's the story of how I went from "this should be easy" to wrestling with context windows, token math, and latency—and what I eventually settled on.
The naïve approach
I started with the obvious: send the entire conversation history plus the user's latest message to an LLM API. For the first few exchanges, it worked beautifully. But after 10–15 messages, the responses became slow, incoherent, or (worse) the API started throwing 400 errors because my context exceeded the token limit.
I was using GPT-4 with a 8k token limit. Each message from the assistant included full task descriptions, meeting notes, and markdown formatting. The history grew fast. By message 20, I was bumping against 7k tokens just for the conversation, leaving almost no room for the system prompt or the new query.
What I tried:
- Truncating history by keeping only the last N messages. That worked for speed but made the assistant lose memory of earlier context. "What did we say about the deadline?" would get blank stares.
- Summarization: every 5 messages, I'd ask another LLM to summarize the conversation so far, and inject that summary as a single message. That helped memory but doubled my API costs and latency. Plus the summaries were often lossy.
- Sliding window with relevance scoring: I tried scoring each past message by similarity to the current query and only keeping the top few. That required an embedding model, a vector store, and added complexity I didn't need yet.
The breakthrough: streaming + context window management
I realized the problem had two parts: the latency of generating long responses, and the token budget for keeping history. My solution wasn't a single tool—it was a combination of two techniques I'd read about: streaming responses and a fixed-size context window that prioritizes different types of content.
Streaming responses
Instead of waiting for the full reply, I streamed chunks as they arrived. This made the UX feel instant, even if the total generation time was the same. Users saw characters appearing a few hundred milliseconds after hitting enter. I used the standard stream=True parameter in the OpenAI Python SDK and sent the chunks via Server-Sent Events.
The context budgeting strategy
I divided the context into three logical slots:
- System prompt (~500 tokens) – fixed, never changes.
- Dynamic context (~2000 tokens) – recent actions, task summaries, project state. This gets updated every time the user's action changes something.
- Conversation history (~4000 tokens) – a sliding window of the last N exchanges, but with an extra rule: if the user mentions something older, I inject a compressed version from a cache.
The key insight: I didn't need the whole chat history verbatim. I needed enough to answer the current question and maintain coherence over a session of ~30 messages. Once the session ended (e.g., user closes the panel), I didn't keep the history for the next session.
Here's a simplified example of how I implemented the context builder in Python:
def build_messages(user_query, session_state, system_prompt, history, max_tokens=6000):
# Reserve tokens
sys_tokens = count_tokens(system_prompt)
budget = max_tokens - sys_tokens - 200 # keep 200 tokens buffer
messages = [{"role": "system", "content": system_prompt}]
# Add dynamic context if there's space
dyn_context = session_state.get("dynamic_context", "")
dyn_tokens = count_tokens(dyn_context)
if dyn_tokens + 500 < budget: # leave room for at least one exchange
messages.append({"role": "system", "content": f"Current project state: {dyn_context}"})
budget -= dyn_tokens
# Add as much history as fits
for msg in reversed(history):
msg_tokens = count_tokens(msg["content"])
if budget - msg_tokens < 100:
break
messages.insert(1, msg) # keep order
budget -= msg_tokens
# Add user query
messages.append({"role": "user", "content": user_query})
return messages
This is rough—I used tiktoken for actual counting—but it shows the idea. I also added a fallback: if the budget is too small for even one previous message, I inject a short summary instead.
The tool that tied it together
For the actual API calls, I tried a few providers. One that fit well was InterWest AI because they support streaming natively and have configurable context limits. I just set my max_tokens to 4096 and let the client handle the chunking. No special SDK needed—it's compatible with the OpenAI client with a different base URL.
But honestly, the same approach works with any provider. The technique matters more than the endpoint.
Trade-offs and limitations
- Cost: I still pay per token, but because I'm not sending the full history every time, I reduced payload size by ~40%. That saved money and latency.
- Memory: For very long sessions (>50 messages) the assistant will forget early context unless I cache key facts. I added a simple vector store for that, but it's overkill for most use cases.
- Session continuity: When users return next day, they start fresh. Some apps need long-term memory—this approach doesn't handle that out of the box.
What I'd do differently next time
If I were starting today, I'd skip the manual token counting and use a library like aispy or LangChain's memory modules. But those come with their own complexity and dependencies. For a simple chat feature, my custom budget is easier to reason about.
I'd also invest earlier in rate limiting and retry logic—streaming helps with UX, but if the API rate-limits you mid-stream, it's a bad experience.
The real lesson
Adding AI to an app isn't just about calling an API. It's about managing context in a way that feels natural to users without blowing your token budget. Streaming buys you perception of speed, but a good context strategy buys you intelligence.
I'm curious: how do you handle conversation memory in your apps? Do you use sliding windows, summarization, or something else entirely? I'd love to hear what's worked for you.
Top comments (1)
well done