Audio token cost estimator
GPT-4o Audio and the realtime API bill spoken audio as tokens, not as minutes, and the per-token rate is far higher than text. Because cost scales with audio duration, a short voice agent loop can become surprisingly expensive at volume. This tool converts seconds of audio into token equivalents and a dollar figure so you can size a voice feature before you ship it.
How audio tokens work
Audio-capable models charge a fixed number of tokens per second of audio — about 10 tokens per second in each direction — but the audio token price is set separately from text. A speech-to-speech turn therefore has three components: input audio tokens, output audio tokens, and any text tokens that ride along (system prompts, transcripts, function results). This tool computes the audio leg from your duration and adds text tokens at the cheaper text rate, giving you a realistic per-call total.
Tips and notes
- Trim silence and dead air before sending audio in — every second you send is billed, whether or not it carries speech.
- For transcription-only workloads, a dedicated speech-to-text model is usually far cheaper than routing audio through a full multimodal chat model.
- In a realtime voice loop, the model often re-hears prior audio context; cap the history window to avoid paying for the same seconds repeatedly.