Building a Voice Assistant with LLM: A Step-by-Step Guide

#product #oxlo #ai

Voice assistants built on large language models have moved beyond simple command recognition. Instead of mapping utterances to fixed intents, a modern assistant transcribes speech to text, reasons over context with an LLM, and synthesizes a natural reply. The result is a flexible system that handles follow-up questions, tool use, and multi-turn conversations without retraining rigid dialogue graphs. Oxlo.ai provides the inference backbone for this stack: OpenAI-compatible endpoints for audio transcription, chat completion, and text-to-speech, all on a flat per-request pricing model that stays predictable as conversations grow.

Architecture Overview

A minimal LLM voice assistant has three stages:

Speech-to-text (STT): Capture microphone audio and transcribe it. Oxlo.ai hosts Whisper Large v3, Whisper Turbo, and Whisper Medium under the audio/transcriptions endpoint.
Reasoning and generation: Send the transcript to a chat model. Oxlo.ai carries 45+ models, from fast general-purpose workhorses like Llama 3.3 70B to agentic reasoners like Qwen 3 32B and Kimi K2.6. All are reachable through the standard chat/completions endpoint with full tool-use and JSON-mode support.
Text-to-speech (TTS): Stream the assistant’s reply into audio. Oxlo.ai offers Kokoro 82M TTS on the audio/speech endpoint.

Because Oxlo.ai exposes all three stages through a single OpenAI-compatible base URL, you can build the entire pipeline with one client and one API key. There are no cold starts on popular models, so the assistant begins responding immediately after the user stops speaking.

Prerequisites and Setup

You need Python 3.10 or newer, a working microphone, and an Oxlo.ai API key. Install the client and audio utilities:

pip install openai sounddevice numpy scipy

Configure the client to point to Oxlo.ai:

import openai

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

Capturing and Transcribing Audio

Use the computer’s default microphone to record a short clip, then send the WAV bytes to the Oxlo.ai transcription endpoint. Whisper Turbo is a strong default for real-time assistants because it balances speed and accuracy.

import sounddevice as sd
import numpy as np
from scipy.io.wavfile import write
import tempfile

SAMPLE_RATE = 16000
DURATION = 5  # seconds

def record_audio():
    audio = sd.rec(int(DURATION * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1, dtype=np.int16)
    sd.wait()
    return audio

def transcribe(audio_np):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
        write(f.name, SAMPLE_RATE, audio_np)
        with open(f.name, "rb") as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-large-v3-turbo",
                file=audio_file
            )
    return transcript.text

If your assistant runs in a noisy environment, you can preprocess the buffer with a voice-activity detector (VAD) before sending it, or switch to Whisper Large v3 for higher fidelity at slightly higher latency.

Generating the Response

Pass the transcript to an Oxlo.ai chat model. For a general voice assistant, Llama 3.3 70B is a reliable flagship. If the assistant must reason over prior conversation state or call smart-home APIs, Qwen 3 32B and Kimi K2.6 excel at agentic tool use.

The example below uses a system prompt that keeps replies concise, which reduces TTS latency and sounds more natural in spoken form.

def generate_reply(user_text):
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": "You are a helpful voice assistant. Keep answers under two sentences unless asked for detail."},
            {"role": "user", "content": user_text}
        ],
        stream=False
    )
    return response.choices[0].message.content

For multi-turn conversations, append each exchange to the messages list. Because Oxlo.ai uses request-based pricing, long context windows and extended system prompts do not inflate the per-interaction cost the way token-based billing would. This makes the platform especially economical for agents that carry large memory buffers or few-shot examples. See https://oxlo.ai/pricing for plan details.

If your assistant needs to control devices or query APIs, enable function calling in the same request. Oxlo.ai models support parallel tool execution and JSON mode, so you can route the assistant to real-world actions before it speaks.

Speaking the Reply

Oxlo.ai hosts Kokoro 82M TTS on the audio/speech endpoint. The model is lightweight and generates natural-sounding speech quickly.

import subprocess

def speak(text):
    response = client.audio.speech.create(
        model="kokoro-82m",
        voice="af_bella",
        input=text
    )
    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
        f.write(response.content)
        subprocess.run(["ffplay", "-nodisp", "-autoexit", f.name], check=True)

If you prefer not to spawn ffplay, write the bytes to a PyAudio stream or save the file and play it with any local audio player.

Wiring the Pipeline

Combine the three stages into a loop that listens, thinks, and speaks:

if __name__ == "__main__":
    print("Voice assistant started. Speak now...")
    while True:
        print("Listening...")
        audio = record_audio()
        user_input = transcribe(audio)
        print(f"You: {user_input}")

        if user_input.lower() in ["exit", "quit", "stop"]:
            break

        reply = generate_reply(user_input)
        print(f"Assistant: {reply}")
        speak(reply)

This blocking loop is simple, but production assistants usually pipeline the stages. For example, you can stream the LLM response and feed sentence-length chunks into TTS as they arrive, cutting perceived latency by hundreds of milliseconds.

Choosing Models and Optimizing Latency

Oxlo.ai carries multiple options for each stage, so you can tune the assistant for speed, reasoning depth, or cost.

Fast daily queries: DeepSeek V3.2 handles coding and reasoning and is available on the free tier.
Deep reasoning: DeepSeek R1 671B MoE or Kimi K2 Thinking provide chain-of-thought reasoning for complex user requests.
Long-context memory: Kimi K2.6 and DeepSeek V4 Flash support 131K to 1M tokens, letting the assistant reference entire meeting transcripts or codebases without truncation.
Code-heavy assistants: Qwen 3 Coder 30B or Oxlo.ai Coder Fast generate precise snippets when the user asks for programming help.

Because there are no cold starts on popular models, the first request after idle time is just as fast as the tenth. This is critical for voice UIs where users expect immediate feedback.

Conclusion

Building a voice assistant today is primarily an integration exercise: wire a microphone to a transcription service, an LLM, and a speech synthesizer. Oxlo.ai consolid