Introduction to RAG in LLM

#learnai #oxlo #ai

We are building a minimal retrieval-augmented generation pipeline that answers questions from a local text file. This helps developers ground LLM outputs in private documents without the overhead of a managed vector database. We will run embeddings and inference on Oxlo.ai using its OpenAI-compatible API and flat per-request pricing detailed at https://oxlo.ai/pricing.

What you'll need

Python 3.10 or newer, the openai and numpy packages, and an Oxlo.ai API key from https://portal.oxlo.ai. Install the dependencies with pip.

pip install openai numpy

Step 1: Chunk your documents

Start with a raw text file and split it into overlapping chunks. Small chunks improve retrieval precision while overlap preserves context across boundaries.

import os

SAMPLE_DOCS = """Product: OxloDocs
Password Reset: Click "Forgot password" on the login page. Enter your email. Wait up to five minutes. Check spam if missing.
Billing: Pro plans are billed monthly. Contact support for enterprise invoicing.
API Limits: Free tier includes 60 requests per day. Pro includes 1,000 requests per day.
"""

with open("docs.txt", "w", encoding="utf-8") as f:
    f.write(SAMPLE_DOCS)

def load_and_chunk(file_path, chunk_size=200, overlap=40):
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()
    chunks = []
    step = chunk_size - overlap
    for start in range(0, len(text), step):
        chunk = text[start:start + chunk_size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

chunks = load_and_chunk("docs.txt")
print(f"Loaded {len(chunks)} chunks")

Step 2: Generate embeddings

Send each chunk to the Oxlo.ai embeddings endpoint. Because Oxlo.ai uses request-based pricing, you pay a flat rate per batch call regardless of chunk length, which keeps costs predictable for long knowledge bases.

import os
import numpy as np
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key=os.environ["OXLO_API_KEY"])

def get_embeddings(texts, model="e5-large"):
    response = client.embeddings.create(model=model, input=texts)
    return np.array([item.embedding for item in response.data])

embeddings = get_embeddings(chunks)
print(f"Embeddings shape: {embeddings.shape}")

Step 3: Retrieve relevant chunks

Use cosine similarity via a dot product to find the top three chunks that best match the user question. This replaces a heavy vector database with a few lines of NumPy.

def retrieve(query_embedding, doc_embeddings, doc_chunks, top_k=3):
    scores = doc_embeddings @ query_embedding.T
    top_indices = np.argsort(scores.flatten())[-top_k:][::-1]
    return [doc_chunks[i] for i in top_indices]

question = "How do I reset my password?"
query_emb = get_embeddings([question])
retrieved = retrieve(query_emb, embeddings, chunks)

for i, text in enumerate(retrieved, 1):
    print(f"Chunk {i}: {text[:120]}...")

Step 4: Build the prompt and system instructions

Inject the retrieved chunks into the prompt and give the model strict instructions to rely only on the provided context. This reduces hallucinations.

SYSTEM_PROMPT = """You are a precise support assistant. Answer the user's question using only the provided context. If the context does not contain the answer, say you do not know. Do not make up facts."""

def build_messages(question, contexts):
    context_block = "\n\n".join(f"Context {i+1}:\n{c}" for i, c in enumerate(contexts))
    user_content = f"Context:\n{context_block}\n\nQuestion: {question}"
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_content},
    ]

Step 5: Generate the answer

Call Llama 3.3 70B through Oxlo.ai to synthesize the retrieved chunks into a concise answer. The flat per-request pricing means your total cost stays the same even when you pack multiple chunks into a long context prompt.

def ask(question):
    q_emb = get_embeddings([question])
    contexts = retrieve(q_emb, embeddings, chunks, top_k=3)
    messages = build_messages(question, contexts)

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=messages,
    )
    return response.choices[0].message.content

answer = ask("How do I reset my password?")
print(answer)

Run it

Put the pieces together and ask a few questions. The script below rebuilds the index if needed and prints both the retrieved evidence and the final answer.

if __name__ == "__main__":
    questions = [
        "How do I reset my password?",
        "What are the API limits?",
    ]
    for q in questions:
        print(f"Q: {q}")
        print(f"A: {ask(q)}")
        print()

Example output:

Q: How do I reset my password?
A: You can reset your password by clicking "Forgot password" on the login page, entering your email, and waiting up to five minutes for the reset email. If you do not see it, check your spam folder.

Q: What are the API limits?
A: The free tier includes 60 requests per day, and the Pro tier includes 1,000 requests per day.

Wrap-up

This pipeline gives you a working RAG agent in under a hundred lines. If you need stronger reasoning, swap llama-3.3-70b for deepseek-v3.2 or kimi-k2.6 on Oxlo.ai. For multilingual documents, qwen-3-32b handles retrieval and generation across languages without changing the embedding step.

Two concrete next steps: add a re-ranking pass with a cross-encoder to boost retrieval accuracy, and persist the embeddings to disk so you do not rebuild the index on every startup. If you scale to thousands of chunks, you can keep the same Oxlo.ai inference logic and swap the NumPy search for a vector database later.