Integrating LLMs into Chatbots: Best Practices and Examples

#aiinfrastructure #oxlo #ai

Building a production chatbot around a large language model requires more than calling a chat completions endpoint. You need to manage growing context windows, handle multi-step tool calls, stream tokens to reduce perceived latency, and keep costs predictable as user sessions deepen. This guide covers the architectural patterns, memory strategies, and implementation details that separate prototypes from reliable conversational agents.

Architecture Patterns for LLM Chatbots

Most chatbots use one of three architectures. A thin proxy forwards user messages directly to the LLM and returns the reply. An orchestration layer adds pre-processing, memory retrieval, and post-processing stages. A fully agentic loop lets the model decide when to call tools, reflect on results, and generate a final answer. Oxlo.ai supports all three through standard OpenAI SDK-compatible endpoints, so you can point your existing client at https://api.oxlo.ai/v1 without rewriting request logic.

Managing Context and Memory

Long conversations exceed context limits or dilute attention. Implement a sliding window to retain the last N messages, summarize older turns into a system prompt, or store conversation history in a vector database and retrieve relevant snippets via embedding search. Because token-based providers scale cost with every word in the prompt, a long system prompt or retrieved context can quickly inflate your bill. Oxlo.ai uses request-based pricing, so the cost stays flat per turn even when you pack the context window with retrieved documents or lengthy prior turns.

Tool Use and Function Calling

Function calling lets a chatbot query APIs, check inventory, or trigger actions. Define your tools with JSON Schema, pass them in the tools parameter, and execute the model-chosen function in your backend. Feed the result back into the messages array as a tool message before requesting the final response. Models available on Oxlo.ai, including Llama 3.3 70B, Qwen 3 32B, and DeepSeek R1, support this workflow with fully OpenAI-compatible function signatures.

Latency and Streaming Responses

Perceived latency matters more than total generation time. Enable streaming so tokens arrive as they are generated rather than waiting for the full completion. This is especially important for multi-turn chats where users expect immediate feedback. Oxlo.ai serves popular models with no cold starts, so the time-to-first-token remains consistent even after periods of low traffic.

Cost Control with Request-Based Pricing

Token-based billing makes cost forecasting difficult for chatbots. A single user session with a long system prompt, multi-turn history, and retrieved documents can consume thousands of tokens per request. Oxlo.ai charges one flat rate per API request regardless of prompt length or output size. For chatbots that maintain long context or run agentic loops, this model can be significantly cheaper than token-based alternatives. See the exact rates on the Oxlo.ai pricing page. You can start building on the free tier, which includes 60 requests per day across more than 16 models, then scale to Pro or Premium plans as traffic grows.

Example Implementation

The following Python snippet shows a minimal streaming chat request against Oxlo.ai using the official OpenAI SDK. Because Oxlo.ai is fully OpenAI API compatible, the only change is the base_url.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain request-based pricing."}
]

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=messages,
    stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

To add tool use, include the tools parameter in the same request. The model will emit tool_calls chunks that your loop can capture, execute, and return as follow-up messages.

Evaluation and Guardrails

Before shipping, measure hallucination rates on your domain-specific questions, test end-to-end latency under load, and verify that tool call arguments are schema-compliant. Log every request and response so you can trace failures. If you are comparing providers, run an A/B test using the same OpenAI SDK code by swapping the base_url and API key. Oxlo.ai fits naturally into this workflow because it uses the same request shapes and response formats.

Conclusion

Reliable LLM chatbots need careful context management, streaming, tool use, and cost controls. Oxlo.ai gives you an OpenAI-compatible API with request-based pricing, no cold starts, and a broad model catalog including Llama 3.3 70B, Qwen 3 32B, and DeepSeek R1. If your chatbot relies on long conversations or large prompts, switching to a flat per-request model can simplify both your architecture and your budget.