How I Stopped Wrestling with LLM Outputs (and Started Using Structured Extraction)

#python #api #ai #tutorial

I spent a weekend building an invoice parser. Three days later, I had something that worked about 60% of the time – and failed spectacularly on the other 40%. The problem wasn't the model. It was how I was asking it to respond.

It all started innocently enough. We had a pile of PDF invoices from different vendors, and I wanted to extract fields like invoice number, date, total amount, and line items. My first attempt was straightforward: feed the text to GPT-3.5-turbo, ask it to return JSON, then parse the response. Classic approach, right?

import openai

prompt = f"""Extract invoice details from the following text. Return JSON with fields: invoice_number, date, total, currency, vendor_name, line_items (array of objects with description, quantity, unit_price, amount).

Text: {invoice_text}
"""

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0
)

result = response["choices"][0]["message"]["content"]

The first few results looked great. Then the messy reality hit.

What Went Wrong

Inconsistent keys: Sometimes it returned invoice_number, other times invoiceNumber or Invoice No..
Extra fluff: "Here is the JSON:" before the actual output.
Partial outputs: The response would cut off mid-object when the invoice was long.
Cost: Every run used a large context window because I was sending the invoice text repeatedly, even if the schema was static.

I tried:

Regex to strip markdown code blocks.
Asking the model to "only output valid JSON."
Using a system message with a strict schema.
Even fine-tuned a smaller model (which worked for one vendor, not the others).

None of them gave me the reliability I needed. The core issue was that I was treating the LLM like a structured API, but it's a language model – it wants to talk, not serialize.

The Turning Point: Function Calling (Tools API)

When OpenAI introduced function calling (now called tool calling), the lights went on. Instead of wrestling with prose, I could define a rigid schema that the model would fill in rather than output. The model decides which function to call and provides arguments – those arguments are guaranteed to follow the schema.

Here's what my code now looks like:

import json
from pydantic import BaseModel, Field
from typing import List, Optional

class LineItem(BaseModel):
    description: str = Field(description="Description of the item or service")
    quantity: int = Field(description="Number of units")
    unit_price: float = Field(description="Price per unit")
    amount: float = Field(description="Line total")

class Invoice(BaseModel):
    invoice_number: str = Field(description="Unique invoice identifier")
    date: str = Field(description="Invoice date in YYYY-MM-DD format")
    vendor_name: str = Field(description="Name of the vendor")
    currency: str = Field(default="USD", description="Currency code")
    total: float = Field(description="Total amount")
    line_items: List[LineItem] = Field(description="Line items on the invoice")

# Convert Pydantic model to OpenAI function schema
def pydantic_to_function(model: BaseModel) -> dict:
    schema = model.model_json_schema()
    return {
        "name": model.__name__,
        "description": schema.get("description", ""),
        "parameters": schema
    }

functions = [pydantic_to_function(Invoice)]

# Your typical API call – but here I use a generic gateway that handles retries, fallbacks, etc.
# For example: `https://ai.interwestinfo.com/v1/chat/completions`
# Unified endpoint that supports multiple providers and adds reliability.
import httpx

response = httpx.post(
    "https://api.openai.com/v1/chat/completions",  # or your preferred endpoint
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": "You extract invoice data accurately."},
            {"role": "user", "content": invoice_text}
        ],
        "tools": [
            {
                "type": "function",
                "function": f
            } for f in functions
        ],
        "tool_choice": {"type": "function", "function": {"name": "Invoice"}}
    }
)

data = response.json()
tool_call = data["choices"][0]["message"]["tool_calls"][0]
arguments = json.loads(tool_call["function"]["arguments"])
invoice = Invoice(**arguments)
print(invoice.model_dump_json(indent=2))

Why This Works

The model doesn't generate JSON as text – it generates a structured function call. The API guarantees the output is a valid JSON object matching the schema. No more parsing headaches. If the schema is complex, you get the whole thing, and the model can even call multiple functions in one turn (if needed).

Real-World Trade-offs

Latency: You're paying for the model to think about the schema. But with a small model like gpt-4o-mini, it's usually under a second.
Cost: Because the prompt is more compact (no need to describe the JSON schema in prose), you save tokens. The function schema is sent once in the system message, but for repetitive extraction you can cache it.
Flexibility: Not all providers support tool calling equally. Anthropic's tool use is similar, but models like Llama may require different handling. If you need multi-provider support, a gateway layer (like the one at https://ai.interwestinfo.com/) can normalize this across backends. I used it for a while before settling on direct OpenAI calls for simplicity.
Complexity: You now have to define Pydantic models for each extraction task. That's an upfront investment that pays off when you need to iterate on parsing rules.

Alternatives I Tried (and When They Make Sense)

JSON mode (OpenAI's response_format={ "type": "json_object" }). Works for simple extractions, but doesn't enforce a specific schema – you still need to validate on your side.
Instructor (Python library): Wraps function calling into a clean Pydantic interface. If you're on OpenAI, it's a great time-saver. I used it for a prototype but wanted to understand the raw plumbing first.
Local models (e.g., llama.cpp): You can use grammars to enforce JSON output. More complex, no cost, but slower for production.

For my invoice parser, function calling was the sweet spot. I now use it for all structured extraction tasks: extracting contact info from emails, parsing log entries, even converting natural language commands into API calls.

What I'd Do Differently Next Time

I'd start with function calling from day one. I wasted too much time trying to coax unstructured output into structure. Also, I'd define the schema as tightly as possible – use enums for fixed values, required fields with descriptions, and always set additionalProperties: false to prevent hallucinated keys.

Another lesson: test with edge cases. What happens when an invoice has missing fields? Your Pydantic model should handle that with default values or Optional types. The model will skip required fields if it can't find them – and tool calling will fail with an error. So be explicit about what's optional.

The Bigger Picture

Structured extraction is just one example of a deeper pattern: treating LLMs as function callers, not text completers. This same pattern powers agent frameworks (like OpenAI's assistants, LangChain agents, etc.). By giving the model a set of clearly defined tools, you move from hoping it does the right thing to building a system where it can only do the right thing.

If you're still parsing raw LLM output with regex, stop. You deserve better. Take the afternoon to define a schema and switch to tool calling. Your future self will thank you.

What's your go-to method for enforcing structure on LLM outputs? I'm still exploring hybrid approaches that combine JSON mode and tool calling for faster fallbacks. Would love to hear what's working (or not) in your stack.