James Lee

Posted on Jun 18

Part 3 — Vector Retrieval in Domain-Specific Terminology Scenarios: From Model Selection to Dual Validation

#ai #architecture #rag #nlp

This article covers the third layer of the full-stack architecture: the Hybrid Retrieval Layer. Core engineering challenge: general-purpose embedding models drift on domain-specific terminology, and single-path vector retrieval cannot distinguish fine-grained semantic differences.

📦 Source code: production-rag-engineering — esg/services/embedding_service.py, esg/services/search_service.py

0. The Pain Point

Part 1 built the knowledge base. Part 2 handled chunking. The first version of the system used text-embedding-ada-002 for retrieval — OpenAI's most mainstream embedding model at the time.

The results:

Recall rate: 82% — 18% of relevant content simply wasn't found
False positive rate: 12% — querying "Scope 1 emission intensity" returned "Scope 3 emissions"
"Low-carbon" and "zero-carbon" were close together in vector space — the system couldn't tell them apart

The first instinct was to tune the similarity threshold: drop from 0.85 to 0.75? To 0.65?

After a full round of testing, recall went up — but false positives went up in lockstep. Lower threshold = cast a wider net = pull in more irrelevant content.

This wasn't a threshold problem. It was a model problem.

More precisely: it was a semantic drift problem caused by a general-purpose model operating on specialized domain text. ada-002's training corpus is predominantly general text. ESG domain terminology is poorly encoded in its vector space — related terms end up far apart, unrelated terms end up close together.

This problem isn't unique to ESG. Legal statutes, medical diagnostics, financial compliance — any domain with dense specialized terminology will hit the same semantic drift when using a general-purpose embedding model.

1. What Retrieval Needs to Solve

Vector retrieval in domain-specific scenarios has three core tensions:

Tension 1: General-purpose models drift on specialized terminology

"Carbon footprint" and "carbon accounting" have similar meanings in general text, but in ESG compliance they refer to different things — the former is product lifecycle emissions, the latter is a data measurement methodology. They are not interchangeable. General-purpose models can't distinguish this fine-grained difference.

Tension 2: High similarity score ≠ semantic relevance

Vector similarity measures "distance in vector space," not "business semantic relevance." "Energy consumption" and "spill incidents" may be close in a general vector space (both are environment-related), but they map to completely different compliance clauses.

Tension 3: Single-path vector retrieval can't distinguish fine-grained variants of the same concept

GRI has three emission scopes: Scope 1, Scope 2, and Scope 3. In vector space, all three are close together. Single-path retrieval easily returns Scope 3 content when querying for Scope 1.

The solution isn't a single fix — it's three progressive layers: model selection → semantic drift mitigation → dual validation.

2. Embedding Model Selection: Not "Pick the Most Expensive"

Test methodology:

We sampled 200 ESG domain terms as queries — covering Environmental, Social, and Governance categories, including long-form terms like "Scope 1 emission intensity calculation" and short terms like "carbon intensity." We ran each query against the GRI knowledge base, manually annotated ground truth, and compared Top-3 recall accuracy across four models.

Four-model comparison:

Model	Recall Rate	Cost per item	Deployment	Elimination reason
text-embedding-3-large	91%	$0.0001	API	✅ Final selection
text-embedding-ada-002	85%	$0.00006	API	Unstable long-text encoding; Scope term confusion
BGE-M3	82%	$0 (local)	Self-hosted	Limited ESG training data; poor fine-grained term distinction
Tongyi Qianwen Embedding	83%	Low	API	Acceptable Chinese ESG terms; poor cross-language consistency

Why not BGE-M3 (self-hosted)?

The intuition is that self-hosting is cheaper — but when you run the full cost calculation:

Dimension	text-embedding-3-large	BGE-M3 self-hosted
Monthly API / server cost	~$8/mo (100K items, batch discount)	~$50/mo (GPU instance)
Development adaptation cost	0 (out of the box)	2 weeks (domain adaptation + fine-tuning)
Recall rate	91%	82%
Long-text encoding stability	Stable	Noticeable drift on long terms

Self-hosting costs 6x more per month, requires 2 weeks of adaptation work, and delivers 9% lower recall.

This isn't "expensive = better." It's model selection based on a clear ROI calculation.

How is data security handled?

Text is desensitized before upload — regex identifies and replaces sensitive information (company names, revenue figures, client data). Only ESG terminology and report fragments are uploaded, with no corporate identity information. We also signed OpenAI's Data Processing Agreement, satisfying compliance requirements.

3. Semantic Drift Mitigation: Disambiguate Before Retrieval

Switching to a better model improved recall from 82% to 91% — but false positive rate remained at 12%.

Root cause analysis: Even with 3-large, fine-grained ESG term distinction is still insufficient. "Low-carbon" and "zero-carbon" have similarity 0.85. "Scope 1 emission intensity" and "Scope 3 emissions" have similarity 0.78. The model treats them as semantically close — but in business terms they are completely different.

The solution is a three-layer augmentation strategy that layers domain knowledge on top of the model:

Layer 1: Domain term dictionary (500+ entries)

The dictionary maps professional terms, abbreviations, and synonyms:

ESG_TERM_DICT = {
    "Scope 1": {
        "definition": "Direct GHG emissions from sources owned or controlled by the organization",
        "synonyms": ["direct emissions", "direct carbon emissions", "Scope 1 emissions"],
        "domain": "Environmental",
        "distinct_from": ["Scope 2", "Scope 3"]  # explicit disambiguation
    },
    "low-carbon": {
        "definition": "Reduced carbon emissions, but emissions still exist",
        "distinct_from": ["zero-carbon", "net-zero emissions"],  # key: explicitly not zero-carbon
        "domain": "Environmental"
    },
    # 500+ entries...
}

Dictionary data sourced from three layers:

GRI official standard documents → 200+ core terms extracted
10 industry ESG reports → 300+ commonly used terms extracted
ESG domain experts → synonyms and fine-grained disambiguation relationships annotated

Layer 2: Domain hints embedded in prompt

At encoding time, dictionary information is embedded in the prompt to give the model precise semantic context:

def build_embedding_prompt(text: str, term: str = None) -> str:
    base_prompt = f"Encode text: {text}"

    if term and term in ESG_TERM_DICT:
        term_info = ESG_TERM_DICT[term]
        domain_hint = f"""
Domain context:
- {term} is an ESG {term_info['domain']} domain term
- Definition: {term_info['definition']}
- Synonyms: {', '.join(term_info.get('synonyms', []))}
- Distinct from: {', '.join(term_info.get('distinct_from', []))}
"""
        return base_prompt + domain_hint

    return base_prompt

Layer 3: Post-retrieval reranking

After retrieving Top 5 candidates, the term dictionary is used to rerank results — chunks containing standard synonyms get a score boost; chunks containing terms in the "distinct_from" relationship get downweighted:

def rerank_results(query_term: str, results: list) -> list:
    for result in results:
        # Contains standard synonym → boost score
        if any(syn in result["text"] for syn in
               ESG_TERM_DICT.get(query_term, {}).get("synonyms", [])):
            result["rerank_score"] += 0.1

        # Contains "distinct_from" term → penalize score
        if any(dt in result["text"] for dt in
               ESG_TERM_DICT.get(query_term, {}).get("distinct_from", [])):
            result["rerank_score"] -= 0.15

    return sorted(results, key=lambda x: x["rerank_score"], reverse=True)

Two real incident cases:

Case 1: Low-carbon vs. zero-carbon

Problem: querying "low-carbon" returned zero-carbon content with similarity 0.85
Root cause: model treats both as "reducing carbon emissions" — semantically close
Fix: dictionary explicitly marks distinct_from relationship; prompt emphasizes "low-carbon ≠ zero-carbon"
Result: similarity dropped from 0.85 to 0.65; retrieval now distinguishes them precisely

Case 2: Scope 1 emission intensity vs. Scope 3 emissions

Problem: querying "Scope 1 emission intensity" returned Scope 3 content with similarity 0.78
Root cause: model treats Scope 1 and Scope 3 as both "emissions-related" — close in vector space
Fix: dictionary gives each Scope its own precise definition and mutual distinct_from relationships
Result: similarity dropped from 0.78 to 0.55; Scope confusion false positive rate < 1%

Three-layer augmentation results: false positive rate 12% → 3%, term matching accuracy 82% → 90%.

4. Dual Validation: A High Score on One Path Isn't Enough

After semantic drift mitigation, one problem remained: high vector similarity, but business semantics are unrelated.

Typical case: querying for GRI 306 waste management clauses returned a report chunk about "spill incident handling" with similarity 0.82. In vector space, the two are genuinely close (both are environmental incident-related) — but "waste management" and "spill incidents" are completely different compliance clauses.

The fundamental limitation of single-path vector retrieval: vector similarity is a statistical measure of "text distance in vector space" — not a business measure of "semantic relevance."

The solution is dual validation: keyword hard match + vector similarity — both must pass to count as a hit.

def dual_verify(query: dict, candidate_chunk: dict) -> bool:
    # Condition 1: vector similarity threshold met
    vector_match = candidate_chunk["similarity_score"] >= 0.7

    # Condition 2: keyword hard match (core keywords from the queried clause must appear)
    required_keywords = query.get("required_keywords", [])
    keyword_match = sum(
        1 for kw in required_keywords
        if kw in candidate_chunk["text"]
    ) >= max(1, len(required_keywords) // 2)  # at least half the keywords must match

    return vector_match and keyword_match

Three-layer false positive filter (complete flow):

Layer 1 — Keyword hard match (millisecond-level)
  When querying for GRI 305 (greenhouse gas emissions),
  retrieved chunks must contain at least 2 of:
  ["Scope 1", "Scope 2", "emissions volume", "calculation method"]
  → Filters out chunks like "spill incidents" that score high but fail keyword match
  → Eliminates ~60% of obvious false positives

Layer 2 — LLM semantic cross-validation (< 1s)
  For chunks passing Layer 1, ask the LLM:
  "Does this content actually answer the disclosure points required by the clause?"
  → Filters out chunks that "mention emissions but lack calculation method and data source"
  → Eliminates ~30% of remaining semantically irrelevant chunks

Layer 3 — Manual spot-check calibration (monthly)
  Monthly spot-check of 100 retrieval results, manually judged for false positives
  If false positive rate > 5%, trigger keyword library update or threshold adjustment
  → Continuous calibration to prevent system degradation as business evolves

Dual validation results: accuracy 70% → 94%, false positive rate 15% → 3%.

5. Vector Store Selection and Parameter Tuning

Why Milvus?

Three options compared:

Option	Performance	Multi-condition filtering	Ecosystem	Elimination reason
Milvus	Million-scale vectors at 50ms	✅ Single query handles it	Mature Python SDK	✅ Final selection
Pinecone	Comparable performance	⚠️ Weak filtering capability	Good	Multi-condition filtering requires multiple queries — high cost
FAISS	Strong performance	❌ Not supported	Average	Pure vector library, no metadata filtering support

Milvus's core advantage: multi-condition filtering in a single query:

search_params = {
    "metric_type": "COSINE",
    "params": {"nprobe": 20}
}

# Single query filters simultaneously: similarity + word count + model version
results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param=search_params,
    limit=3,  # top_k=3
    expr="char_count >= 20 and embedding_model == 'text-embedding-3-large'",
    output_fields=["chunk_id", "page_range", "similarity_score"]
)

The three retrieval parameters:

Parameter	Value	Design rationale
top_k	3	Retrieve 3 candidates for LLM judgment — more introduces noise, fewer risks missing content
Similarity threshold	0.7	Calibrated against 500 reports — 0.7 is the balance point between recall and false positives
nprobe	20	IVF_FLAT search scope — at nlist=128, nprobe=20 balances accuracy and speed

Real incident: concurrency above 10 caused latency to spike from 50ms to 200ms

Early after launch, when concurrent queries exceeded 10, latency jumped from 50ms to 200ms with occasional timeouts.

Diagnosis:

Checked Milvus server resources — CPU and memory were not saturated. Not a resource bottleneck.
Checked index parameters — nprobe=10 gave too narrow a search scope; queue backlog built up under concurrency.
Checked caching — high-frequency queries (e.g., "GRI 305-1 carbon emissions") were re-executing full searches every time.

Two-step fix:

# Fix 1: increase nprobe for better stability under concurrency
search_params = {"params": {"nprobe": 20}}  # increased from 10 to 20

# Fix 2: cache high-frequency query results (Redis, TTL=1 hour)
import redis
cache = redis.Redis()

def cached_search(query_vector: list, query_key: str) -> list:
    cached = cache.get(query_key)
    if cached:
        return json.loads(cached)

    results = milvus_search(query_vector)
    cache.setex(query_key, 3600, json.dumps(results))  # cache for 1 hour
    return results

Result: latency dropped from 200ms to 80ms, cache hit rate 70%, stable support for 10+ concurrent queries.

6. Cost Control

Once model selection was finalized, cost control relied on two mechanisms:

Mechanism 1: Batch processing for volume discount

OpenAI Embedding API supports batch submission — 100 items per batch reduces per-item cost by 20%:

def batch_embed(texts: list[str], batch_size: int = 100) -> list:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-large",
            input=batch  # batch submission
        )
        all_embeddings.extend([item.embedding for item in response.data])
    return all_embeddings

Mechanism 2: Cache embeddings for high-frequency terms

The GRI clause library is relatively static — vectors for 300+ clauses don't need to be regenerated on every request. Pre-compute and cache them at startup, saving 30% of API calls:

# Preload GRI clause vectors at startup
def preload_gri_embeddings():
    clauses = get_all_gri_clauses()  # ~300 clauses
    embeddings = batch_embed([c["text"] for c in clauses])

    for clause, embedding in zip(clauses, embeddings):
        cache.set(
            f"gri_embedding:{clause['disclosure_id']}",
            json.dumps(embedding),
            ex=86400  # 24-hour cache
        )

Final cost comparison:

Option	Monthly cost	Recall rate	Miss rate
ada-002 (original)	~$6/mo	85%	12%
3-large (unoptimized)	~$10/mo	91%	5%
3-large (batch + cache optimized)	~$8/mo	91%	5%
BGE-M3 self-hosted	~$50/mo	82%	15%

3-large optimized costs only $2/month more than ada-002 — with 6% better recall and 7% lower miss rate.

7. Wrapping Up: The Retrieval Decision Tree

When facing a new retrieval scenario, two questions determine the approach:

Q1: Does the data contain domain-specific terminology?
  ├─ Yes (legal / medical / financial / ESG or other specialized domains)
  │   → General-purpose models will drift
  │   → Required: domain term dictionary + prompt domain hints + post-retrieval reranking
  │   → Go to Q2
  └─ No (general text)
      → General-purpose embedding model + single-path vector retrieval is sufficient

Q2: Does the query require fine-grained semantic distinction?
  ├─ Yes (e.g., Scope 1 vs. Scope 3, low-carbon vs. zero-carbon)
  │   → Single-path vector retrieval is not enough
  │   → Required: dual validation (keyword hard match + vector similarity)
  │   → Add three-layer false positive filter (keywords → LLM cross-validation → manual spot-check)
  └─ No (coarse-grained semantic distinction is sufficient)
      → Single-path vector retrieval + similarity threshold is sufficient

Transferability of this retrieval approach:

Domain term dictionary → swap in legal / medical / financial terminology; the logic is identical
Prompt domain hints → applicable to any specialized domain; just replace the dictionary content
Dual validation → applicable to any scenario requiring high-precision recall; swap in the keyword library for your business domain

Source Code

All implementations referenced in this article are available here:

👉 github.com/muzinan123/production-rag-engineering

Relevant files for this part:

esg/services/embedding_service.py — multi-provider embedding + batch write + 4-layer metadata
esg/services/search_service.py — Milvus vector retrieval, top_k + threshold dual-parameter filtering

Next up: Retrieval is solid. Relevant content is being surfaced. But a high semantic similarity score does not equal a correct business conclusion. Similarity 0.88 — but the company only disclosed total emissions volume, with no calculation method and no data source. Does that satisfy GRI 305-1? Between "retrieved content" and "a quantifiable, auditable conclusion," there are three gaps. → Part 4 — Judgment Engine

DEV Community