DEV Community

Cover image for Build Your First RAG System From Scratch - No Frameworks, Just Python
Navas Herbert
Navas Herbert

Posted on

Build Your First RAG System From Scratch - No Frameworks, Just Python


You have probably heard the term RAG thrown around in every AI discussion lately. Retrieval-Augmented Generation. It sounds complicated, but the idea is beautifully simple once you see it in action.

In this article you will build a working RAG system from scratch using pure Python and local models running on your own machine. No LangChain. No complex frameworks. Just the raw concepts so you actually understand what is happening at every step.

By the end you will have a FastAPI server you can query with questions, and you will understand exactly how it works well enough to explain it to someone else.


What is RAG and Why Should You Care?

Imagine you hire a very smart assistant who has read millions of books. If you ask them a question, they will answer confidently - but sometimes they will guess, and sometimes they will be wrong. This is what a standard LLM does. It answers from memory, and memory can be wrong or outdated.

Now imagine before asking your question, you hand that assistant the exact pages from the right book. Suddenly they are not guessing anymore. They are reading and summarising. That is RAG.

RAG fixes the two biggest problems with plain LLMs:

Hallucination - the model invents facts it does not know.
Staleness - the model's training data has a cutoff date.

With RAG, your model answers from documents you control. You can update those documents any time without touching the model itself.


The Big Picture

Before writing any code, understand the two phases of RAG:

Phase 1 - Indexing (done once)

Your document
    ↓
Split into chunks
    ↓
Convert each chunk to a vector (embedding)
    ↓
Store vectors in a vector database
Enter fullscreen mode Exit fullscreen mode

Phase 2 - Retrieval and Generation (happens every question)

User asks a question
    ↓
Convert question to a vector
    ↓
Find the most similar chunks in the database
    ↓
Combine chunks + question into a prompt
    ↓
Send prompt to LLM
    ↓
Get a grounded answer
Enter fullscreen mode Exit fullscreen mode

That is the entire system. Everything else is implementation detail.


What You Need

  • Python 3.11+
  • Ollama installed on your machine

Pull these two models into Ollama:

ollama pull nomic-embed-text
ollama pull llama3.2
Enter fullscreen mode Exit fullscreen mode

Then start Ollama in a separate terminal:

ollama serve
Enter fullscreen mode Exit fullscreen mode

Install Python dependencies:

pip install fastapi uvicorn numpy requests
Enter fullscreen mode Exit fullscreen mode

Step 1 : Chunking

We cannot send an entire document to the LLM every time someone asks a question. Context windows have limits and it is wasteful. Instead we split the document into small overlapping pieces called chunks.

Create a file called chunking.py:

def chunk_text(text: str, chunk_size: int = 100, overlap: int = 20) -> list[str]:
    """
    Split text into overlapping chunks by word count.

    chunk_size: number of words per chunk
    overlap:    words repeated between consecutive chunks
    """
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end]).strip()
        if chunk:
            chunks.append(chunk)
        # Move forward by (chunk_size - overlap) to create the overlap
        start += chunk_size - overlap

    return chunks


if __name__ == "__main__":
    with open("document.txt", "r") as f:
        document = f.read()

    chunks = chunk_text(document)
    print(f"Document split into {len(chunks)} chunks")

    for i, chunk in enumerate(chunks):
        print(f"\nChunk {i + 1} ({len(chunk.split())} words):")
        print(chunk)
Enter fullscreen mode Exit fullscreen mode

Why word-based splitting? It is more natural than character-based splitting. A chunk of 100 words is roughly 2 to 4 sentences regardless of how long those words are.

Why overlap? A sentence that falls at the boundary between two chunks would get cut in half without overlap. The last 20 words of one chunk appear at the start of the next, so no sentence is ever lost.

Run it:

python chunking.py
Enter fullscreen mode Exit fullscreen mode

You should see your document split into 6 to 8 chunks. Try changing chunk_size to 50 and notice how many more chunks you get.


Step 2 - Embeddings

An embedding is a list of numbers that represents the meaning of a piece of text. Similar text produces similar numbers. This is what makes search possible.

Create embedding.py:

import numpy as np
import requests


def embed_text(text: str) -> np.ndarray:
    """
    Convert text to an embedding vector using Ollama.
    Returns a numpy array of floats.
    """
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text}
    )
    response.raise_for_status()
    vector = response.json()["embedding"]
    return np.array(vector, dtype=np.float32)


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Measure how similar two vectors are.
    1.0 = identical meaning, 0.0 = unrelated, -1.0 = opposite
    """
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


if __name__ == "__main__":
    pairs = [
        (
            "RAG retrieves documents to help an LLM answer questions.",
            "Retrieval-Augmented Generation uses external knowledge.",
            "Similar topic"
        ),
        (
            "RAG retrieves documents to help an LLM answer questions.",
            "My favourite food is jollof rice.",
            "Unrelated"
        ),
    ]

    for a, b, label in pairs:
        score = cosine_similarity(embed_text(a), embed_text(b))
        print(f"{label}: {score:.4f}")
        print(f"  A: {a}")
        print(f"  B: {b}\n")
Enter fullscreen mode Exit fullscreen mode

Run it:

python embedding.py
Enter fullscreen mode Exit fullscreen mode

You will see the similar pair score close to 1.0 and the unrelated pair score close to 0.0. This is the mathematical foundation of RAG search.


Step 3 Vector Search

Now we embed every chunk and store it. When a question arrives, we embed it and find the chunks with the highest similarity score. This is your mini vector database.

Create vector_search.py:

import numpy as np
from importlib import import_module

chunk_text        = import_module("1_chunking").chunk_text
embed_text        = import_module("2_embedding").embed_text
cosine_similarity = import_module("2_embedding").cosine_similarity


def build_vector_store(chunks: list[str]) -> list[dict]:
    """Embed every chunk and store it with its text."""
    print(f"Embedding {len(chunks)} chunks...")
    store = []
    for i, chunk in enumerate(chunks):
        vector = embed_text(chunk)
        store.append({"text": chunk, "vector": vector})
        print(f"  [{i+1}/{len(chunks)}] done")
    return store


def search(query: str, store: list[dict], top_k: int = 3) -> list[dict]:
    """Find the top_k most relevant chunks for a query."""
    query_vector = embed_text(query)
    results = []
    for item in store:
        score = cosine_similarity(query_vector, item["vector"])
        results.append({"score": score, "text": item["text"]})
    results.sort(key=lambda x: x["score"], reverse=True)
    return results[:top_k]


if __name__ == "__main__":
    with open("document.txt", "r") as f:
        document = f.read()

    chunks = chunk_text(document, chunk_size=100, overlap=20)
    store  = build_vector_store(chunks)

    question = "What are the two phases of RAG?"
    results  = search(question, store, top_k=2)

    print(f"\nQuestion: {question}\n")
    for rank, result in enumerate(results, 1):
        print(f"#{rank} Score: {result['score']:.4f}")
        print(f"   {result['text'][:200]}...")
Enter fullscreen mode Exit fullscreen mode

Run it:

python vector_search.py
Enter fullscreen mode Exit fullscreen mode

The chunk about indexing and retrieval should rank first. Notice how the score reflects how relevant each chunk is.


Step 4 : Building the Prompt

This is the heart of RAG. We take the retrieved chunks and combine them with the question into a single prompt that tells the LLM to answer only from what we give it.

Create prompt_builder.py:

def build_rag_prompt(question: str, chunks: list[tuple[float, str]]) -> str:
    """
    Build the RAG prompt: instructions + context + question.
    The key instruction is: answer ONLY from the context below.
    This prevents hallucination.
    """
    context_block = "\n\n---\n\n".join(
        f"[Relevance: {score:.2f}]\n{text}"
        for score, text in chunks
    )

    return f"""You are a helpful assistant. Answer the user's question using ONLY the context below.
If the answer is not in the context, say "I do not have enough information to answer that."
Do not use any outside knowledge.

=== CONTEXT ===

{context_block}

=== QUESTION ===

{question}

=== YOUR ANSWER ==="""
Enter fullscreen mode Exit fullscreen mode

This prompt has three sections:

Instructions - tells the LLM how to behave. The anti-hallucination line is critical.

Context - the retrieved chunks. The LLM reads only these.

Question - what the user actually asked.

Run it standalone to see what the prompt looks like before it reaches the LLM:

python prompt_builder.py
Enter fullscreen mode Exit fullscreen mode

Seeing the raw prompt is one of the best teaching moments in RAG. Students immediately understand why the system cannot hallucinate when the instruction says answer only from the context.


Step 5 - Putting It All Together

Create rag_pipeline.py:

import requests
from importlib import import_module

chunk_text        = import_module("1_chunking").chunk_text
build_vector_store = import_module("3_vector_search").build_vector_store
search            = import_module("3_vector_search").search
build_rag_prompt  = import_module("4_prompt_builder").build_rag_prompt


def generate_answer(prompt: str) -> str:
    """Send the RAG prompt to Ollama and return the answer."""
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama3.2",
            "messages": [{"role": "user", "content": prompt}],
            "stream": False,
        }
    )
    response.raise_for_status()
    return response.json()["message"]["content"].strip()


def ask(question: str, store: list[dict], top_k: int = 2) -> str:
    results    = search(question, store, top_k=top_k)
    top_chunks = [(r["score"], r["text"]) for r in results]
    prompt     = build_rag_prompt(question, top_chunks)
    return generate_answer(prompt)


if __name__ == "__main__":
    with open("document.txt", "r") as f:
        document = f.read()

    chunks = chunk_text(document, chunk_size=100, overlap=20)
    store  = build_vector_store(chunks)

    questions = [
        "What are the two phases of RAG?",
        "How does cosine similarity work?",
        "Who invented RAG and when?",   # not in the document — watch what happens
    ]

    for question in questions:
        print(f"\nQuestion: {question}")
        print(f"Answer:   {ask(question, store)}\n")
        print("-" * 60)
Enter fullscreen mode Exit fullscreen mode

Run it:

python rag_pipeline.py
Enter fullscreen mode Exit fullscreen mode

The last question - who invented RAG - is not in the document. A well-built RAG system should say it does not have enough information. That moment is always the most memorable for students.


Adding a Web Interface

Running scripts in a terminal is fine for learning, but a real system needs an API. Add FastAPI in a single file called app.py:

from fastapi import FastAPI
from pydantic import BaseModel
from importlib import import_module

chunk_text        = import_module("1_chunking").chunk_text
build_vector_store = import_module("3_vector_search").build_vector_store
search            = import_module("3_vector_search").search
build_rag_prompt  = import_module("4_prompt_builder").build_rag_prompt

# Import generate_answer from pipeline
from importlib import import_module
generate_answer = import_module("5_rag_pipeline").generate_answer

app = FastAPI(title="MiniRAG")

# Build vector store once at startup
with open("document.txt") as f:
    document = f.read()

chunks = chunk_text(document, chunk_size=100, overlap=20)
STORE  = build_vector_store(chunks)


class Question(BaseModel):
    question: str
    top_k: int = 2


@app.get("/")
def root():
    return {"status": "ok", "message": "MiniRAG is running"}


@app.post("/ask")
def ask(request: Question):
    results    = search(request.question, STORE, top_k=request.top_k)
    top_chunks = [(r["score"], r["text"]) for r in results]
    prompt     = build_rag_prompt(request.question, top_chunks)
    answer     = generate_answer(prompt)
    return {
        "question": request.question,
        "answer":   answer,
        "sources":  [r["text"] for r in results],
    }
Enter fullscreen mode Exit fullscreen mode

Run it:

uvicorn app:app --reload
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8000/docs to get an interactive interface where you can type questions and see answers directly in your browser.


What You Just Built

Let us recap what each file does:

File Job Needs Ollama?
1_chunking.py Splits document into word chunks No
2_embedding.py Converts text to vectors Yes
3_vector_search.py Finds relevant chunks for a question Yes
4_prompt_builder.py Combines chunks and question into a prompt No
5_rag_pipeline.py Runs the full pipeline end to end Yes
app.py Exposes everything as a web API Yes

Files 1 and 4 need no API at all. This is useful if you are teaching — you can explain chunking and prompt design without any model setup.


What to Try Next

Once you have the basics working, experiment with these:

Change chunk size. Set chunk_size=50 and notice you get more chunks. Set it to 200 and you get fewer. Ask the same question and compare the retrieved chunks. Does the answer change?

Change top_k. Set top_k=1 and the LLM gets only one chunk. Set it to 5 and it gets five. More context does not always mean better answers.

Replace the document. Swap document.txt for any text file — a company policy, an article, your CV. The entire system works on any knowledge base.

Ask off-topic questions. Ask about something completely unrelated to your document. Watch the system say it does not know instead of inventing an answer. That is the anti-hallucination guarantee in action.


The Bigger Picture

What you built today is called naive RAG or basic RAG. It works well for getting started and for teaching. When you are ready to go deeper, the next steps are:

Persistent vector store - save embeddings to disk with ChromaDB or FAISS so you do not re-embed every time you restart.

Sentence-based chunking - split at sentence boundaries instead of word count for more natural chunks.

Re-ranking - after retrieval, use a second model to reorder the chunks by relevance before passing them to the LLM.

Evaluation - use tools like RAGAS to measure how well your system is actually retrieving and answering.

Each of these is a clear upgrade path from what you built today. And because your code is split into one file per concept, you can swap out any single piece without touching the others.


The Full Project

The complete project including environment switching between Ollama locally and Groq on Render for deployment is available on GitHub:

github.com/Navashub/rag-projects

Clone it, run through the lessons in order, and then replace document.txt with something from your own domain. That is the fastest way to make RAG feel real.


If this helped you understand RAG, drop a comment with what document you tested it on. The most interesting use cases always come from the community.

Top comments (0)