Skip to content

DEV Community

kirandeepjassal-crypto

Posted on Jun 17 • Originally published at prepstack.co.in

Context Engineering for Enterprise AI: Cutting RAG Hallucination from 18% to 3% (C# + Python)

#ai #rag #dotnet #python

Originally published on PrepStack.

We took an enterprise RAG assistant from an 18% wrong-answer rate to 3% — without changing the model. The lever wasn't the prompt. It was the context we assembled and fed the model.

The mental shift

The model isn't your product; the context you assemble is. Prompt engineering tweaks the wording. Context engineering controls what data enters the window. Treat the context window like a CPU cache — a scarce, governed resource — not a junk drawer.

The pipeline

Naive top-k RAG dumped 8 fuzzy chunks into a ~14,000-token prompt and hoped. We replaced it with a real pipeline, split across ASP.NET Core (orchestration) and a Python FastAPI service (retrieval + ranking):

Rewrite the vague user question into a self-contained query
Hybrid retrieval — BM25 keyword + vector, not vector-only
Cross-encoder re-rank a wide candidate pool down to the best 6
Budget the window (~3,500 tokens, every byte allocated)
Compress chunks to only the sentences that matter
Ground + cite every claim — or refuse and route to a human

The results

Metric	Before	After
Hallucination rate	18%	3%
Context tokens/request	~14,000	~3,500
Cost per query	$0.021	$0.008
Retrieval recall@5	0.71	0.94

The context window is a budget you spend on relevance, not a bucket you fill with hope.

Read the full breakdown — with all the C# and Python code — on PrepStack:
https://prepstack.co.in/blog/context-engineering-enterprise-genai-part-1-context-management

Top comments (0)

Subscribe