Voice assistants built on large language models have moved beyond simple command recognition. Instead of mapping utterances to fixed intents, a modern assistant transcribes speech to text, reasons over context with an LLM, and synthesizes a natural reply. The result is a flexible system that handles follow-up questions, tool use, and multi-turn conversations without retraining rigid dialogue graphs. Oxlo.ai provides the inference backbone for this stack: OpenAI-compatible endpoints for audio transcription, chat completion, and text-to-speech, all on a flat per-request pricing model that stays predictable as conversations grow.
Architecture Overview
A minimal LLM voice assistant has three stages:
-
Speech-to-text (STT): Capture microphone audio and transcribe it. Oxlo.ai hosts Whisper Large v3, Whisper Turbo, and Whisper Medium under the
audio/transcriptionsendpoint. -
Reasoning and generation: Send the transcript to a chat model. Oxlo.ai carries 45+ models, from fast general-purpose workhorses like Llama 3.3 70B to agentic reasoners like Qwen 3 32B and Kimi K2.6. All are reachable through the standard
chat/completionsendpoint with full tool-use and JSON-mode support. -
Text-to-speech (TTS): Stream the assistant’s reply into audio. Oxlo.ai offers Kokoro 82M TTS on the
audio/speechendpoint.
Because Oxlo.ai exposes all three stages through a single OpenAI-compatible base URL, you can build the entire pipeline with one client and one API key. There are no cold starts on popular models, so the assistant begins responding immediately after the user stops speaking.
Prerequisites and Setup
You need Python 3.10 or newer, a working microphone, and an Oxlo.ai API key. Install the client and audio utilities:
pip install openai sounddevice numpy scipy
Configure the client to point to Oxlo.ai:
import openai
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
Capturing and Transcribing Audio
Use the computer’s default microphone to record a short clip, then send the WAV bytes to the Oxlo.ai transcription endpoint. Whisper Turbo is a strong default for real-time assistants because it balances speed and accuracy.
import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import tempfile
SAMPLE_RATE = 16000
DURATION = 5 # seconds
def record_audio():
audio = sd.rec(int(DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype=np.int16)
sd.wait()
return audio
def transcribe(audio_np):
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
write(f.name, SAMPLE_RATE, audio_np)
with open(f.name, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3-turbo",
file=audio_file
)
return transcript.text
If your assistant runs in a noisy environment, you can preprocess the buffer with a voice-activity detector (VAD) before sending it, or switch to Whisper Large v3 for higher fidelity at slightly higher latency.
Generating the Response
Pass the transcript to an Oxlo.ai chat model. For a general voice assistant, Llama 3.3 70B is a reliable flagship. If the assistant must reason over prior conversation state or call smart-home APIs, Qwen 3 32B and Kimi K2.6 excel at agentic tool use.
The example below uses a system prompt that keeps replies concise, which reduces TTS latency and sounds more natural in spoken form.
def generate_reply(user_text):
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful voice assistant. Keep answers under two sentences unless asked for detail."},
{"role": "user", "content": user_text}
],
stream=False
)
return response.choices[0].message.content
For multi-turn conversations, append each exchange to the messages list. Because Oxlo.ai uses request-based pricing, long context windows and extended system prompts do not inflate the per-interaction cost the way token-based billing would. This makes the platform especially economical for agents that carry large memory buffers or few-shot examples. See https://oxlo.ai/pricing for plan details.
If your assistant needs to control devices or query APIs, enable function calling in the same request. Oxlo.ai models support parallel tool execution and JSON mode, so you can route the assistant to real-world actions before it speaks.
Speaking the Reply
Oxlo.ai hosts Kokoro 82M TTS on the audio/speech endpoint. The model is lightweight and generates natural-sounding speech quickly.
import subprocess
def speak(text):
response = client.audio.speech.create(
model="kokoro-82m",
voice="af_bella",
input=text
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
f.write(response.content)
subprocess.run(["ffplay", "-nodisp", "-autoexit", f.name], check=True)
If you prefer not to spawn ffplay, write the bytes to a PyAudio stream or save the file and play it with any local audio player.
Wiring the Pipeline
Combine the three stages into a loop that listens, thinks, and speaks:
if __name__ == "__main__":
print("Voice assistant started. Speak now...")
while True:
print("Listening...")
audio = record_audio()
user_input = transcribe(audio)
print(f"You: {user_input}")
if user_input.lower() in ["exit", "quit", "stop"]:
break
reply = generate_reply(user_input)
print(f"Assistant: {reply}")
speak(reply)
This blocking loop is simple, but production assistants usually pipeline the stages. For example, you can stream the LLM response and feed sentence-length chunks into TTS as they arrive, cutting perceived latency by hundreds of milliseconds.
Choosing Models and Optimizing Latency
Oxlo.ai carries multiple options for each stage, so you can tune the assistant for speed, reasoning depth, or cost.
- Fast daily queries: DeepSeek V3.2 handles coding and reasoning and is available on the free tier.
- Deep reasoning: DeepSeek R1 671B MoE or Kimi K2 Thinking provide chain-of-thought reasoning for complex user requests.
- Long-context memory: Kimi K2.6 and DeepSeek V4 Flash support 131K to 1M tokens, letting the assistant reference entire meeting transcripts or codebases without truncation.
- Code-heavy assistants: Qwen 3 Coder 30B or Oxlo.ai Coder Fast generate precise snippets when the user asks for programming help.
Because there are no cold starts on popular models, the first request after idle time is just as fast as the tenth. This is critical for voice UIs where users expect immediate feedback.
Conclusion
Building a voice assistant today is primarily an integration exercise: wire a microphone to a transcription service, an LLM, and a speech synthesizer. Oxlo.ai consolid
Top comments (0)