This article covers the third layer of the full-stack architecture: the Hybrid Retrieval Layer. Core engineering challenge: general-purpose embedding models drift on domain-specific terminology, and single-path vector retrieval cannot distinguish fine-grained semantic differences.
📦 Source code: production-rag-engineering —
esg/services/embedding_service.py,esg/services/search_service.py
0. The Pain Point
Part 1 built the knowledge base. Part 2 handled chunking. The first version of the system used text-embedding-ada-002 for retrieval — OpenAI's most mainstream embedding model at the time.
The results:
- Recall rate: 82% — 18% of relevant content simply wasn't found
- False positive rate: 12% — querying "Scope 1 emission intensity" returned "Scope 3 emissions"
- "Low-carbon" and "zero-carbon" were close together in vector space — the system couldn't tell them apart
The first instinct was to tune the similarity threshold: drop from 0.85 to 0.75? To 0.65?
After a full round of testing, recall went up — but false positives went up in lockstep. Lower threshold = cast a wider net = pull in more irrelevant content.
This wasn't a threshold problem. It was a model problem.
More precisely: it was a semantic drift problem caused by a general-purpose model operating on specialized domain text. ada-002's training corpus is predominantly general text. ESG domain terminology is poorly encoded in its vector space — related terms end up far apart, unrelated terms end up close together.
This problem isn't unique to ESG. Legal statutes, medical diagnostics, financial compliance — any domain with dense specialized terminology will hit the same semantic drift when using a general-purpose embedding model.
1. What Retrieval Needs to Solve
Vector retrieval in domain-specific scenarios has three core tensions:
Tension 1: General-purpose models drift on specialized terminology
"Carbon footprint" and "carbon accounting" have similar meanings in general text, but in ESG compliance they refer to different things — the former is product lifecycle emissions, the latter is a data measurement methodology. They are not interchangeable. General-purpose models can't distinguish this fine-grained difference.
Tension 2: High similarity score ≠ semantic relevance
Vector similarity measures "distance in vector space," not "business semantic relevance." "Energy consumption" and "spill incidents" may be close in a general vector space (both are environment-related), but they map to completely different compliance clauses.
Tension 3: Single-path vector retrieval can't distinguish fine-grained variants of the same concept
GRI has three emission scopes: Scope 1, Scope 2, and Scope 3. In vector space, all three are close together. Single-path retrieval easily returns Scope 3 content when querying for Scope 1.
The solution isn't a single fix — it's three progressive layers: model selection → semantic drift mitigation → dual validation.
2. Embedding Model Selection: Not "Pick the Most Expensive"
Test methodology:
We sampled 200 ESG domain terms as queries — covering Environmental, Social, and Governance categories, including long-form terms like "Scope 1 emission intensity calculation" and short terms like "carbon intensity." We ran each query against the GRI knowledge base, manually annotated ground truth, and compared Top-3 recall accuracy across four models.
Four-model comparison:
| Model | Recall Rate | Cost per item | Deployment | Elimination reason |
|---|---|---|---|---|
| text-embedding-3-large | 91% | $0.0001 | API | ✅ Final selection |
| text-embedding-ada-002 | 85% | $0.00006 | API | Unstable long-text encoding; Scope term confusion |
| BGE-M3 | 82% | $0 (local) | Self-hosted | Limited ESG training data; poor fine-grained term distinction |
| Tongyi Qianwen Embedding | 83% | Low | API | Acceptable Chinese ESG terms; poor cross-language consistency |
Why not BGE-M3 (self-hosted)?
The intuition is that self-hosting is cheaper — but when you run the full cost calculation:
| Dimension | text-embedding-3-large | BGE-M3 self-hosted |
|---|---|---|
| Monthly API / server cost | ~$8/mo (100K items, batch discount) | ~$50/mo (GPU instance) |
| Development adaptation cost | 0 (out of the box) | 2 weeks (domain adaptation + fine-tuning) |
| Recall rate | 91% | 82% |
| Long-text encoding stability | Stable | Noticeable drift on long terms |
Self-hosting costs 6x more per month, requires 2 weeks of adaptation work, and delivers 9% lower recall.
This isn't "expensive = better." It's model selection based on a clear ROI calculation.
How is data security handled?
Text is desensitized before upload — regex identifies and replaces sensitive information (company names, revenue figures, client data). Only ESG terminology and report fragments are uploaded, with no corporate identity information. We also signed OpenAI's Data Processing Agreement, satisfying compliance requirements.
3. Semantic Drift Mitigation: Disambiguate Before Retrieval
Switching to a better model improved recall from 82% to 91% — but false positive rate remained at 12%.
Root cause analysis: Even with 3-large, fine-grained ESG term distinction is still insufficient. "Low-carbon" and "zero-carbon" have similarity 0.85. "Scope 1 emission intensity" and "Scope 3 emissions" have similarity 0.78. The model treats them as semantically close — but in business terms they are completely different.
The solution is a three-layer augmentation strategy that layers domain knowledge on top of the model:
Layer 1: Domain term dictionary (500+ entries)
The dictionary maps professional terms, abbreviations, and synonyms:
ESG_TERM_DICT = {
"Scope 1": {
"definition": "Direct GHG emissions from sources owned or controlled by the organization",
"synonyms": ["direct emissions", "direct carbon emissions", "Scope 1 emissions"],
"domain": "Environmental",
"distinct_from": ["Scope 2", "Scope 3"] # explicit disambiguation
},
"low-carbon": {
"definition": "Reduced carbon emissions, but emissions still exist",
"distinct_from": ["zero-carbon", "net-zero emissions"], # key: explicitly not zero-carbon
"domain": "Environmental"
},
# 500+ entries...
}
Dictionary data sourced from three layers:
- GRI official standard documents → 200+ core terms extracted
- 10 industry ESG reports → 300+ commonly used terms extracted
- ESG domain experts → synonyms and fine-grained disambiguation relationships annotated
Layer 2: Domain hints embedded in prompt
At encoding time, dictionary information is embedded in the prompt to give the model precise semantic context:
def build_embedding_prompt(text: str, term: str = None) -> str:
base_prompt = f"Encode text: {text}"
if term and term in ESG_TERM_DICT:
term_info = ESG_TERM_DICT[term]
domain_hint = f"""
Domain context:
- {term} is an ESG {term_info['domain']} domain term
- Definition: {term_info['definition']}
- Synonyms: {', '.join(term_info.get('synonyms', []))}
- Distinct from: {', '.join(term_info.get('distinct_from', []))}
"""
return base_prompt + domain_hint
return base_prompt
Layer 3: Post-retrieval reranking
After retrieving Top 5 candidates, the term dictionary is used to rerank results — chunks containing standard synonyms get a score boost; chunks containing terms in the "distinct_from" relationship get downweighted:
def rerank_results(query_term: str, results: list) -> list:
for result in results:
# Contains standard synonym → boost score
if any(syn in result["text"] for syn in
ESG_TERM_DICT.get(query_term, {}).get("synonyms", [])):
result["rerank_score"] += 0.1
# Contains "distinct_from" term → penalize score
if any(dt in result["text"] for dt in
ESG_TERM_DICT.get(query_term, {}).get("distinct_from", [])):
result["rerank_score"] -= 0.15
return sorted(results, key=lambda x: x["rerank_score"], reverse=True)
Two real incident cases:
Case 1: Low-carbon vs. zero-carbon
- Problem: querying "low-carbon" returned zero-carbon content with similarity 0.85
- Root cause: model treats both as "reducing carbon emissions" — semantically close
- Fix: dictionary explicitly marks
distinct_fromrelationship; prompt emphasizes "low-carbon ≠ zero-carbon" - Result: similarity dropped from 0.85 to 0.65; retrieval now distinguishes them precisely
Case 2: Scope 1 emission intensity vs. Scope 3 emissions
- Problem: querying "Scope 1 emission intensity" returned Scope 3 content with similarity 0.78
- Root cause: model treats Scope 1 and Scope 3 as both "emissions-related" — close in vector space
- Fix: dictionary gives each Scope its own precise definition and mutual
distinct_fromrelationships - Result: similarity dropped from 0.78 to 0.55; Scope confusion false positive rate < 1%
Three-layer augmentation results: false positive rate 12% → 3%, term matching accuracy 82% → 90%.
4. Dual Validation: A High Score on One Path Isn't Enough
After semantic drift mitigation, one problem remained: high vector similarity, but business semantics are unrelated.
Typical case: querying for GRI 306 waste management clauses returned a report chunk about "spill incident handling" with similarity 0.82. In vector space, the two are genuinely close (both are environmental incident-related) — but "waste management" and "spill incidents" are completely different compliance clauses.
The fundamental limitation of single-path vector retrieval: vector similarity is a statistical measure of "text distance in vector space" — not a business measure of "semantic relevance."
The solution is dual validation: keyword hard match + vector similarity — both must pass to count as a hit.
def dual_verify(query: dict, candidate_chunk: dict) -> bool:
# Condition 1: vector similarity threshold met
vector_match = candidate_chunk["similarity_score"] >= 0.7
# Condition 2: keyword hard match (core keywords from the queried clause must appear)
required_keywords = query.get("required_keywords", [])
keyword_match = sum(
1 for kw in required_keywords
if kw in candidate_chunk["text"]
) >= max(1, len(required_keywords) // 2) # at least half the keywords must match
return vector_match and keyword_match
Three-layer false positive filter (complete flow):
Layer 1 — Keyword hard match (millisecond-level)
When querying for GRI 305 (greenhouse gas emissions),
retrieved chunks must contain at least 2 of:
["Scope 1", "Scope 2", "emissions volume", "calculation method"]
→ Filters out chunks like "spill incidents" that score high but fail keyword match
→ Eliminates ~60% of obvious false positives
Layer 2 — LLM semantic cross-validation (< 1s)
For chunks passing Layer 1, ask the LLM:
"Does this content actually answer the disclosure points required by the clause?"
→ Filters out chunks that "mention emissions but lack calculation method and data source"
→ Eliminates ~30% of remaining semantically irrelevant chunks
Layer 3 — Manual spot-check calibration (monthly)
Monthly spot-check of 100 retrieval results, manually judged for false positives
If false positive rate > 5%, trigger keyword library update or threshold adjustment
→ Continuous calibration to prevent system degradation as business evolves
Dual validation results: accuracy 70% → 94%, false positive rate 15% → 3%.
5. Vector Store Selection and Parameter Tuning
Why Milvus?
Three options compared:
| Option | Performance | Multi-condition filtering | Ecosystem | Elimination reason |
|---|---|---|---|---|
| Milvus | Million-scale vectors at 50ms | ✅ Single query handles it | Mature Python SDK | ✅ Final selection |
| Pinecone | Comparable performance | ⚠️ Weak filtering capability | Good | Multi-condition filtering requires multiple queries — high cost |
| FAISS | Strong performance | ❌ Not supported | Average | Pure vector library, no metadata filtering support |
Milvus's core advantage: multi-condition filtering in a single query:
search_params = {
"metric_type": "COSINE",
"params": {"nprobe": 20}
}
# Single query filters simultaneously: similarity + word count + model version
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=3, # top_k=3
expr="char_count >= 20 and embedding_model == 'text-embedding-3-large'",
output_fields=["chunk_id", "page_range", "similarity_score"]
)
The three retrieval parameters:
| Parameter | Value | Design rationale |
|---|---|---|
| top_k | 3 | Retrieve 3 candidates for LLM judgment — more introduces noise, fewer risks missing content |
| Similarity threshold | 0.7 | Calibrated against 500 reports — 0.7 is the balance point between recall and false positives |
| nprobe | 20 | IVF_FLAT search scope — at nlist=128, nprobe=20 balances accuracy and speed |
Real incident: concurrency above 10 caused latency to spike from 50ms to 200ms
Early after launch, when concurrent queries exceeded 10, latency jumped from 50ms to 200ms with occasional timeouts.
Diagnosis:
- Checked Milvus server resources — CPU and memory were not saturated. Not a resource bottleneck.
- Checked index parameters — nprobe=10 gave too narrow a search scope; queue backlog built up under concurrency.
- Checked caching — high-frequency queries (e.g., "GRI 305-1 carbon emissions") were re-executing full searches every time.
Two-step fix:
# Fix 1: increase nprobe for better stability under concurrency
search_params = {"params": {"nprobe": 20}} # increased from 10 to 20
# Fix 2: cache high-frequency query results (Redis, TTL=1 hour)
import redis
cache = redis.Redis()
def cached_search(query_vector: list, query_key: str) -> list:
cached = cache.get(query_key)
if cached:
return json.loads(cached)
results = milvus_search(query_vector)
cache.setex(query_key, 3600, json.dumps(results)) # cache for 1 hour
return results
Result: latency dropped from 200ms to 80ms, cache hit rate 70%, stable support for 10+ concurrent queries.
6. Cost Control
Once model selection was finalized, cost control relied on two mechanisms:
Mechanism 1: Batch processing for volume discount
OpenAI Embedding API supports batch submission — 100 items per batch reduces per-item cost by 20%:
def batch_embed(texts: list[str], batch_size: int = 100) -> list:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-large",
input=batch # batch submission
)
all_embeddings.extend([item.embedding for item in response.data])
return all_embeddings
Mechanism 2: Cache embeddings for high-frequency terms
The GRI clause library is relatively static — vectors for 300+ clauses don't need to be regenerated on every request. Pre-compute and cache them at startup, saving 30% of API calls:
# Preload GRI clause vectors at startup
def preload_gri_embeddings():
clauses = get_all_gri_clauses() # ~300 clauses
embeddings = batch_embed([c["text"] for c in clauses])
for clause, embedding in zip(clauses, embeddings):
cache.set(
f"gri_embedding:{clause['disclosure_id']}",
json.dumps(embedding),
ex=86400 # 24-hour cache
)
Final cost comparison:
| Option | Monthly cost | Recall rate | Miss rate |
|---|---|---|---|
| ada-002 (original) | ~$6/mo | 85% | 12% |
| 3-large (unoptimized) | ~$10/mo | 91% | 5% |
| 3-large (batch + cache optimized) | ~$8/mo | 91% | 5% |
| BGE-M3 self-hosted | ~$50/mo | 82% | 15% |
3-large optimized costs only $2/month more than ada-002 — with 6% better recall and 7% lower miss rate.
7. Wrapping Up: The Retrieval Decision Tree
When facing a new retrieval scenario, two questions determine the approach:
Q1: Does the data contain domain-specific terminology?
├─ Yes (legal / medical / financial / ESG or other specialized domains)
│ → General-purpose models will drift
│ → Required: domain term dictionary + prompt domain hints + post-retrieval reranking
│ → Go to Q2
└─ No (general text)
→ General-purpose embedding model + single-path vector retrieval is sufficient
Q2: Does the query require fine-grained semantic distinction?
├─ Yes (e.g., Scope 1 vs. Scope 3, low-carbon vs. zero-carbon)
│ → Single-path vector retrieval is not enough
│ → Required: dual validation (keyword hard match + vector similarity)
│ → Add three-layer false positive filter (keywords → LLM cross-validation → manual spot-check)
└─ No (coarse-grained semantic distinction is sufficient)
→ Single-path vector retrieval + similarity threshold is sufficient
Transferability of this retrieval approach:
- Domain term dictionary → swap in legal / medical / financial terminology; the logic is identical
- Prompt domain hints → applicable to any specialized domain; just replace the dictionary content
- Dual validation → applicable to any scenario requiring high-precision recall; swap in the keyword library for your business domain
Source Code
All implementations referenced in this article are available here:
👉 github.com/muzinan123/production-rag-engineering
Relevant files for this part:
-
esg/services/embedding_service.py— multi-provider embedding + batch write + 4-layer metadata -
esg/services/search_service.py— Milvus vector retrieval, top_k + threshold dual-parameter filtering
Next up: Retrieval is solid. Relevant content is being surfaced. But a high semantic similarity score does not equal a correct business conclusion. Similarity 0.88 — but the company only disclosed total emissions volume, with no calculation method and no data source. Does that satisfy GRI 305-1? Between "retrieved content" and "a quantifiable, auditable conclusion," there are three gaps. → Part 4 — Judgment Engine
Top comments (0)