Originally published on PrepStack.
We took an enterprise RAG assistant from an 18% wrong-answer rate to 3% — without changing the model. The lever wasn't the prompt. It was the context we assembled and fed the model.
The mental shift
The model isn't your product; the context you assemble is. Prompt engineering tweaks the wording. Context engineering controls what data enters the window. Treat the context window like a CPU cache — a scarce, governed resource — not a junk drawer.
The pipeline
Naive top-k RAG dumped 8 fuzzy chunks into a ~14,000-token prompt and hoped. We replaced it with a real pipeline, split across ASP.NET Core (orchestration) and a Python FastAPI service (retrieval + ranking):
- Rewrite the vague user question into a self-contained query
- Hybrid retrieval — BM25 keyword + vector, not vector-only
- Cross-encoder re-rank a wide candidate pool down to the best 6
- Budget the window (~3,500 tokens, every byte allocated)
- Compress chunks to only the sentences that matter
- Ground + cite every claim — or refuse and route to a human
The results
| Metric | Before | After |
|---|---|---|
| Hallucination rate | 18% | 3% |
| Context tokens/request | ~14,000 | ~3,500 |
| Cost per query | $0.021 | $0.008 |
| Retrieval recall@5 | 0.71 | 0.94 |
The context window is a budget you spend on relevance, not a bucket you fill with hope.
Read the full breakdown — with all the C# and Python code — on PrepStack:
https://prepstack.co.in/blog/context-engineering-enterprise-genai-part-1-context-management
Top comments (0)