DEV Community

kirandeepjassal-crypto
kirandeepjassal-crypto

Posted on • Originally published at prepstack.co.in

Context Engineering for Enterprise AI: Cutting RAG Hallucination from 18% to 3% (C# + Python)

Originally published on PrepStack.

We took an enterprise RAG assistant from an 18% wrong-answer rate to 3% — without changing the model. The lever wasn't the prompt. It was the context we assembled and fed the model.

The mental shift

The model isn't your product; the context you assemble is. Prompt engineering tweaks the wording. Context engineering controls what data enters the window. Treat the context window like a CPU cache — a scarce, governed resource — not a junk drawer.

The pipeline

Naive top-k RAG dumped 8 fuzzy chunks into a ~14,000-token prompt and hoped. We replaced it with a real pipeline, split across ASP.NET Core (orchestration) and a Python FastAPI service (retrieval + ranking):

  1. Rewrite the vague user question into a self-contained query
  2. Hybrid retrieval — BM25 keyword + vector, not vector-only
  3. Cross-encoder re-rank a wide candidate pool down to the best 6
  4. Budget the window (~3,500 tokens, every byte allocated)
  5. Compress chunks to only the sentences that matter
  6. Ground + cite every claim — or refuse and route to a human

The results

Metric Before After
Hallucination rate 18% 3%
Context tokens/request ~14,000 ~3,500
Cost per query $0.021 $0.008
Retrieval recall@5 0.71 0.94

The context window is a budget you spend on relevance, not a bucket you fill with hope.

Read the full breakdown — with all the C# and Python code — on PrepStack:
https://prepstack.co.in/blog/context-engineering-enterprise-genai-part-1-context-management

Top comments (0)