Using LLM for Text Summarization: Best Practices and Strategies

#learnai #oxlo #ai

Here is how I built a document summarizer that ingests long articles and emits structured briefs. The tool is useful for research teams, support queues, or anyone who needs to distill reports into actionable takeaways. I wired it to Oxlo.ai so that each pass costs one flat request, even when the input text grows.

What you'll need

Python 3.10 or newer
An Oxlo.ai API key from https://portal.oxlo.ai
The OpenAI SDK: pip install openai

Step 1: Configure the client and verify the connection

First I pointed the OpenAI SDK at Oxlo.ai and verified that my API key was active. I used Llama 3.3 70B as the workhorse because it handles long context reliably and starts instantly.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)

# Smoke test
test = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Say OK"}]
)
assert test.choices[0].message.content is not None
print("Client ready:", test.choices[0].message.content)

Step 2: Define the system prompt

I treat the system prompt as the product spec. The instructions below force the model to stay faithful to the source text and emit a predictable structure.

SYSTEM_PROMPT = """You are a precise document summarizer. Your job is to read the provided text and produce a structured summary containing:
- A one-sentence headline.
- A 2-3 paragraph summary that captures the main arguments, findings, or narrative.
- A bulleted list of key takeaways (max 5).
Do not introduce facts that are not in the text. Do not use flowery language. If the text is ambiguous, state that plainly."""

Step 3: Build the core summarization function

With the prompt in place, I wrapped the API call in a function that accepts any string and returns the generated summary. This single-pass version works for texts that fit comfortably in the context window.

def summarize(text: str, model: str = "llama-3.3-70b") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Summarize the following article:\n\n{text}"},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

article = (
    "Artificial intelligence has transformed modern software development. "
    "Large language models can now generate code, review pull requests, and debug errors. "
    "However, integrating these capabilities into production pipelines requires careful attention to latency, cost, and accuracy. "
    "Teams that treat LLMs as deterministic engines often face surprise bills and inconsistent output. "
    "A better approach is to design robust prompts, validate outputs with unit tests, and fall back to smaller models when full reasoning is unnecessary."
)

print(summarize(article))

Step 4: Handle long documents with recursive chunking

Real articles often exceed what can be processed in one pass. I split the text into chunks, summarize each chunk, and then summarize the resulting summaries. Because Oxlo.ai charges a flat rate per request rather than per token, this multi-step pipeline stays economical even when the source material is long. See https://oxlo.ai/pricing for details.

def chunk_text(text: str, max_chars: int = 3000) -> list[str]:
    paragraphs = text.split("\n\n")
    chunks = []
    current = ""
    for p in paragraphs:
        if len(current) + len(p) < max_chars:
            current += p + "\n\n"
        else:
            if current:
                chunks.append(current.strip())
            current = p + "\n\n"
    if current:
        chunks.append(current.strip())
    return chunks

def summarize_long(text: str, model: str = "llama-3.3-70b") -> str:
    chunks = chunk_text(text)
    if len(chunks) == 1:
        return summarize(chunks[0], model=model)

    # Map step: summarize each chunk
    partials = [summarize(c, model=model) for c in chunks]

    # Reduce step: summarize the summaries
    combined = "\n\n---\n\n".join(partials)
    return summarize(combined, model=model)

# Simulate a long document by repeating the article
long_article = "\n\n".join([article] * 10)
print(summarize_long(long_article))

Step 5: Lock the output format with JSON mode

Machine-readable output makes it easier to pipe the result into emails, dashboards, or databases. I switched the system prompt to request JSON and set response_format accordingly. For this step I used Qwen 3 32B, which follows structured instructions reliably.

import json

JSON_SYSTEM_PROMPT = """You are a precise document summarizer. Read the provided text and return a JSON object with exactly these keys:
- headline: string, one sentence.
- summary: string, 2-3 paragraphs.
- takeaways: array of strings, max 5 items.
Do not include markdown code fences. Output raw JSON only. Do not hallucinate."""

def summarize_json(text: str, model: str = "qwen-3-32b") -> dict:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": JSON_SYSTEM_PROMPT},
            {"role": "user", "content": f"Summarize the following article:\n\n{text}"},
        ],
        temperature=0.2,
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

result = summarize_json(article)
print(json.dumps(result, indent=2))

Run it

Here is the complete script I run from the command line. It feeds a sample article into both the plain and JSON pipelines.

import json
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)

SYSTEM_PROMPT = """You are a precise document summarizer. Your job is to read the provided text and produce a structured summary containing:
- A one-sentence headline.
- A 2-3 paragraph summary that captures the main arguments, findings, or narrative.
- A bulleted list of key takeaways (max 5).
Do not introduce facts that are not in the text. Do not use flowery language. If the text is ambiguous, state that plainly."""

JSON_SYSTEM_PROMPT = """You are a precise document summarizer. Read the provided text and return a JSON object with exactly these keys:
- headline: string, one sentence.
- summary: string, 2-3 paragraphs.
- takeaways: array of strings, max 5 items.
Do not include markdown code fences. Output raw JSON only. Do not hallucinate."""

def summarize(text: str, model: str = "llama-3.3-70b") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Summarize the following article:\n\n{text}"},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

def chunk_text(text: str, max_chars: int = 3000) -> list[str]:
    paragraphs = text.split("\n\n")
    chunks = []
    current = ""
    for p in paragraphs:
        if len(current) + len(p) < max_chars:
            current += p + "\n\n"
        else:
            if current:
                chunks.append(current.strip())
            current = p + "\n\n"
    if current:
        chunks.append(current.strip())
    return chunks

def summarize_long(text: str, model: str = "llama-3.3-70b") -> str:
    chunks = chunk_text(text)
    if len(chunks) == 1:
        return summarize(chunks[0], model=model)
    partials = [summarize(c, model=model) for c in chunks]
    combined = "\n\n---\n\n".join(partials)
    return summarize(combined, model=model)

def summarize_json(text: str, model: str = "qwen-3-32b") -> dict:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": JSON_SYSTEM_PROMPT},
            {"role": "user", "content": f"Summarize the following article:\n\n{text}"},
        ],
        temperature=0.2,
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

if __name__ == "__main__":
    sample = (
        "The history of the internet begins in the 1960s with government-funded research into packet switching. "
        "ARPANET, the precursor network, sent its first message in 1969. Through the 1970s and 1980s, academic institutions adopted TCP/IP, laying the groundwork for a global network. "
        "The invention of the World Wide Web by Tim Berners-Lee in 1989 introduced browsers and hyperlinks, democratizing access to information. "
        "Today the internet underpins commerce, communication, and entertainment, yet it still grapples with issues of privacy, security, and equitable access."
    )

    print("=== Plain Summary ===")
    print(summarize_long(sample))

    print("\n=== JSON Summary ===")
    print(json.dumps(summarize_json(sample), indent=2))

When I ran this, the plain summary returned a concise three-paragraph brief with a headline and bullet points. The JSON pipeline returned structured output like this:

{
  "headline": "The internet evolved from a 1960s government research project into a global infrastructure that reshaped society.",
  "summary": "The article traces the internet's origins to 1960s packet-switching research and the 1969 ARPANET. The adoption of TCP/IP by universities in the 1970s and 1980s created the technical foundation for a worldwide network. Tim Berners-Lee's World Wide Web in 1989 made the internet accessible to the public, and today it is central to modern life even as it faces ongoing challenges with privacy and equity.",
  "takeaways": [
    "ARPANET sent the first packet-switched message in 1969.",
    "TCP/IP adoption by academics was critical to scaling the network.",
    "The World Wide Web democratized internet access starting in 1989.",
    "Modern internet use spans commerce, communication, and entertainment.",
    "Privacy, security, and equitable access remain unresolved issues."
  ]
}

Next steps

If you want to productize this, add a caching layer for repeated documents so you do not burn requests on identical inputs. You could also swap in DeepSeek V3.2 for coding-heavy articles, or Kimi K2.6 when you need vision support for summarizing slide decks. Both are available on Oxlo.ai with the same flat per-request pricing, detailed at https://oxlo.ai/pricing.