Utku Catal

Posted on Jun 19 • Originally published at utkucatal.com

Building a RAG From Scratch — Collect, Clean, Embed, Refuse

#ai #rag #programming #productivity

The first version of this thing told me we sold a hydraulic excavator. We don't. It said it with full confidence — a clean part number, a price, a tidy little description. All invented.

That was the moment I stopped trusting "just put the data in the prompt" as a plan.

So I rebuilt it with one rule: it answers from the catalog, or it doesn't answer at all. The catalog is about 411 industrial products — motors, pumps, gears, that kind of thing. Ask for that excavator and it should say it's not there. Ask it to write a poem and it should politely decline.

This post walks through the whole thing in the order I built it: collect the data, clean it, embed it, and make the assistant refuse anything it can't back with a real product. Nothing fancy — one Postgres table and a few hundred lines of Python.

Here's the whole system on one page before we dig in:

The data prep and indexing happen once. Everything below the line runs per question.

New to RAG? RAG (Retrieval-Augmented Generation) means: before the model answers, you search your own data for relevant pieces and paste them into the prompt. The model answers from those pieces instead of from memory. The "retrieval" is the search; the "generation" is the model writing the answer.

Collect

The data lived as product pages on a website. I didn't want to hit the site every time I changed my mind about parsing, so I split collection into two stages: download once, parse as many times as I want.

Stage one downloads the HTML and saves each page to disk. Two details made this painless:

It's idempotent. If a file already exists and isn't tiny, it's skipped. Re-running only fetches what's missing.
No separate manifest. The listing pages already know a few things about each product (URL, brand, name, price, id). Instead of saving that in a side file, I write it as a comment at the top of each HTML page:

def download_one(url, meta):
    fpath = PAGES / slugify(url)
    if fpath.exists() and fpath.stat().st_size > 1000:
        return url  # already have it, skip
    h = s.fetch(url)
    if not h:
        return None
    header = s.build_meta_header({"url": url, **meta})
    fpath.write_text(header + h, encoding="utf-8")
    return url

So every saved file carries its own metadata. The parser later reads it straight from the file. The site gets hit once, and everything after that is offline.

This sounds like a small thing, but it's the part most tutorials skip. You will get the parsing wrong on the first try. You want to fix it and re-run without touching the network again.

Clean

Stage two never goes online. It reads the saved pages and turns them into one clean catalog.json:

for fpath in sorted(PAGES.glob("*.html")):
    h = fpath.read_text(encoding="utf-8")
    meta = s.extract_meta(h)        # the comment header from stage one
    url = meta.get("url") or fpath.stem
    products.append(s.parse_product(url, meta, h))

Each product comes out as a flat record: id, title, manufacturer, oem_pn, condition, price_eur, weight_kg, dimensions_cm, category, url, and a description.

One habit that saved me a lot of guessing: after parsing, print how many records actually have each field filled in.

--- field coverage ---
  manufacturer   411/411
  price_eur      411/411
  weight_kg      405/411
  ...

If a field is mostly empty, you find out now, not when a user's filter silently returns nothing. Clean data is boring to talk about and it's where most of the real work is.

Embed

Now the part people think of as "the AI part," which is actually the smallest.

Each product becomes one block of text — title, brand, part number, category, condition, and description joined together. One product, one chunk. The descriptions are short, so there's no reason to split them.

def product_text(p):
    parts = [p.get("title"), p.get("manufacturer"), p.get("oem_pn"),
             p.get("category"), p.get("condition"), p.get("description")]
    return " | ".join(x for x in parts if x)

Then I turn each block into a vector with bge-m3, a model that runs locally. (A vector is just a list of numbers that captures meaning — two texts about similar things get vectors that point the same way. To find products like a query, you embed the query too and look for the closest vectors.)

def embed(texts):
    return embedder().encode(texts, normalize_embeddings=True)

I picked bge-m3 for three reasons: it's free, it runs on my machine (no embedding API key), and it's multilingual — I can ask in one language and match English product text. It outputs 1024 numbers per product, which matters in a second.

The vectors and the metadata go into pgvector, a Postgres extension that stores vectors right next to normal columns:

cur.execute(f"""
    CREATE TABLE IF NOT EXISTS products (
        url           text PRIMARY KEY,
        title         text,
        manufacturer  text,
        price_eur     double precision,
        weight_kg     double precision,
        category      text,
        description   text,
        embedding     vector({DIM})
    )
""")

Then an HNSW index for fast similarity search:

cur.execute("""
    CREATE INDEX IF NOT EXISTS products_emb_idx
    ON products USING hnsw (embedding vector_cosine_ops)
""")

This is why 1024 dimensions matters: pgvector's HNSW index caps at 2000, so bge-m3 fits with room to spare. Inserts use ON CONFLICT ... DO UPDATE, so re-running on a changed catalog just updates rows instead of failing.

The big win of putting vectors inside Postgres shows up next.

Retrieve

Now the real question: how do you find the right products for a query like "Siemens motors under €2000"?

Pure vector search struggles here. The descriptions are templated and look a lot alike, so "motor" matches dozens of things. And the model has no real sense of "under €2000." That's a number, not a meaning.

So I use both. First I ask Claude to pull the structured parts out of the query and hand them back as plain JSON:

FILTER_SYSTEM = (
    "Extract structured search filters from the user's product query and "
    "return ONLY this JSON:\n"
    '{"semantic_query": str, "manufacturer": str|null, "category": str|null, '
    '"max_price": number|null, "min_weight": number|null}'
)

For "Siemens motors under €2000" that gives manufacturer: "Siemens", max_price: 2000, and a leftover semantic_query for the fuzzy part. If the model ever fails or returns junk, the code falls back to treating the whole query as a plain semantic search — so a bad parse degrades quietly instead of crashing.

Then the structured filters and the vector search run as one SQL query. The filters become WHERE clauses; the vector similarity becomes the ORDER BY:

sql = f"""
    SELECT id, title, manufacturer, price_eur, weight_kg, url, description,
           1 - (embedding <=> %s) AS sim
    FROM products
    {clause}                       -- manufacturer ILIKE, price_eur <= ...
    ORDER BY embedding <=> %s       -- cosine distance to the query vector
    LIMIT %s
"""

This is the payoff of keeping vectors in Postgres. The exact filters (brand, price, weight) do the hard cutting; the vector order handles the fuzzy "which of these is most relevant" part. One statement. No second system to keep in sync.

Refuse

Good retrieval gets you the right products. But the assistant still needs to know when there are no right products — and stop. This is where most RAG systems quietly hallucinate, and it's the part I cared about most.

I used two guardrails, and the first one is the important one.

1. A similarity threshold that runs before the model. Every match comes back with a sim score between 0 and 1. Anything below a floor gets dropped:

results = [r for r in results if r["sim"] >= MIN_SIM]   # MIN_SIM = 0.45

If nothing clears the bar, the model is never called at all:

def answer(client, query, results):
    if not results:
        return "I couldn't find a relevant product in the catalog."
    ...

That's the whole trick. You can't hallucinate an answer the model was never asked to write. Ask for a hydraulic excavator that isn't in the data, the search comes back empty, and the reply is a flat "not found" — no LLM involved. This is much stronger than asking a model to behave, because it removes the chance to misbehave.

2. A system prompt that locks the model to what it was given. When there are matches, the prompt is strict:

GEN_SYSTEM = (
    "You are an industrial product catalog assistant. Answer ONLY from the "
    "PRODUCTS provided in the message. If none of them are relevant, say "
    "\"I couldn't find a relevant product in the catalog.\" Never invent "
    "products or specs, and don't answer from general knowledge. Politely "
    "refuse general chat. Always show the id and url of every product you use."
)

So a "write me a poem" gets a polite refusal, and a real answer always cites the product id and URL you can check.

People reach for the prompt first. It's the weaker of the two. The prompt asks for good behavior; the threshold makes bad behavior impossible. Use both, but lean on the threshold.

Seeing it work

Put the query side together and a single question takes this path:

The front end is a small Streamlit chat, but with one thing I'd add to any RAG: it shows its work. Under each answer there are two panels — the filters it pulled from your question with the scored products it retrieved, and the exact prompt it sent to the model.

That transparency is not a nice-to-have. When an answer looks wrong, you want to know which half broke: did retrieval bring back the wrong products, or did the model misread good ones? Showing the retrieval and the prompt tells you in one glance.

A few real examples:

You ask	What happens
`vacuum pumps over 300 kg`	weight filter + semantic match
`Siemens motors under €2000`	brand + price filter
`VTLF 2.250/0-79.1`	exact part lookup
`hydraulic excavator`	refused — empty retrieval, model never called
`write me a poem`	refused — off-topic

What I'd change

It's only fair to say what this doesn't do yet.

The threshold is one number. MIN_SIM = 0.45 fits this catalog. Too high and good answers vanish; too low and junk slips through. A smarter version would tune it from real queries instead of by feel.
No reranker. For a few hundred products, cosine order is plenty. Scale up and you'd add a reranking pass to tidy the top results.
Eval is manual. I check a few known queries by hand. The real fix is a small set of (question, expected_id) pairs, so a prompt tweak can't quietly break retrieval.

None of that changes the shape of it, though. Collect carefully, clean honestly, embed simply, refuse by design. The refusing is what makes it trustworthy — and that came from the architecture, not from asking the model nicely.

DEV Community