What you are building
A voice AI app lets a person talk to an assistant and hear it talk back. Under the hood it is a three-stage loop: speech-to-text turns the user’s audio into a transcript, an LLM reads that transcript and decides what to say, and text-to-speech turns the reply into audio that plays back. Whisper handles the first stage, a chat model the middle, and a TTS model the last. The engineering challenge is not any single stage — it is making the whole loop feel fast and natural.
How the loop works
Audio is captured in the browser or app and sent to Whisper, which returns a transcript. That transcript, plus your system prompt and the conversation so far, goes to a chat model that produces a text reply. The reply is sent to a text-to-speech model, whose audio is streamed back and played. Because hosted models need an API key, the client never calls the providers directly — it sends audio to your backend, which proxies the three calls and streams results back, keeping the key secret.
Latency is everything in voice. Each stage adds delay, and the total — plus network round trips — is what the user feels as responsiveness. The tricks that matter most: use voice activity detection so you begin transcribing the instant the user stops, stream the LLM tokens, and start TTS on the first complete sentence instead of waiting for the full answer. The planner below estimates the per-turn latency and cost of your loop from the turn length and model choices so you can see where the time and money go.
Tips for a natural assistant
Stream every stage you can and pipeline them — do not wait for the full transcript or full reply when you can start the next stage early. Handle barge-in (the user interrupting): detect speech during playback, stop the audio, and cancel the in-flight requests. Keep replies short; long TTS output is slow and costly, and people interrupt long monologues anyway. Proxy all provider calls through your backend to protect your key. And measure each stage’s latency separately so you optimise the slowest one rather than guessing.