What are the three stages of a voice loop?

Speech-to-text (Whisper transcribes the user's audio), language processing (an LLM reads the transcript and produces a reply), and text-to-speech (a TTS model turns the reply into audio). Each stage adds latency, so the felt responsiveness of a voice app is the sum of all three plus network time.

How do I make it feel fast?

Stream wherever you can — stream the LLM tokens and begin TTS on the first sentence rather than waiting for the whole reply, and stream the audio out. Also detect when the user stops speaking (voice activity detection) so you start transcribing immediately instead of waiting for a fixed timeout.

Can I run the whole loop in a browser?

You can capture audio and play audio in the browser, but for hosted Whisper/LLM/TTS you proxy the API calls through your own backend so your API key is never exposed to the client. Fully local in-browser models exist but are heavier and lower quality than hosted ones.

How much does a voice conversation cost?

Roughly the sum of Whisper transcription (priced per minute of audio), the LLM call (priced per token), and TTS (priced per character of reply). For a short back-and-forth this is typically a few cents per turn. The planner below estimates it from your turn length and conversation length.

How do I handle interruptions?

A natural conversation lets the user talk over the assistant. Detect incoming speech while audio is playing, stop playback, and cancel any in-flight LLM and TTS requests for the previous turn. This barge-in handling is what separates a robotic demo from a real assistant.

How to Build a Voice AI App with Whisper and TTS

What you are building

A voice AI app lets a person talk to an assistant and hear it talk back. Under the hood it is a three-stage loop: speech-to-text turns the user’s audio into a transcript, an LLM reads that transcript and decides what to say, and text-to-speech turns the reply into audio that plays back. Whisper handles the first stage, a chat model the middle, and a TTS model the last. The engineering challenge is not any single stage — it is making the whole loop feel fast and natural.

How the loop works

Audio is captured in the browser or app and sent to Whisper, which returns a transcript. That transcript, plus your system prompt and the conversation so far, goes to a chat model that produces a text reply. The reply is sent to a text-to-speech model, whose audio is streamed back and played. Because hosted models need an API key, the client never calls the providers directly — it sends audio to your backend, which proxies the three calls and streams results back, keeping the key secret.

Latency is everything in voice. Each stage adds delay, and the total — plus network round trips — is what the user feels as responsiveness. The tricks that matter most: use voice activity detection so you begin transcribing the instant the user stops, stream the LLM tokens, and start TTS on the first complete sentence instead of waiting for the full answer. The planner below estimates the per-turn latency and cost of your loop from the turn length and model choices so you can see where the time and money go.

Tips for a natural assistant

Stream every stage you can and pipeline them — do not wait for the full transcript or full reply when you can start the next stage early. Handle barge-in (the user interrupting): detect speech during playback, stop the audio, and cancel the in-flight requests. Keep replies short; long TTS output is slow and costly, and people interrupt long monologues anyway. Proxy all provider calls through your backend to protect your key. And measure each stage’s latency separately so you optimise the slowest one rather than guessing.