How many tokens is one second of audio?

GPT-4o Audio bills audio at roughly 10 tokens per second for input and similarly for output, but pricing is set per token at a much higher rate than text. This tool uses the per-second token rates published for the audio-capable models.

Why is audio so much more expensive than text?

Audio tokens carry far more information per token than text and are priced higher per million tokens. A minute of audio input can cost the equivalent of thousands of text tokens, so duration drives the bill more than transcript length.

Does a speech-to-speech turn count both input and output audio?

Yes. A realtime voice round-trip bills the incoming audio as input and the model's spoken reply as output, plus any text tokens. Estimate each leg separately and add them.

What about text accompanying the audio?

Text tokens in the same request are billed at the model's normal text rate, which is much cheaper than audio. The tool adds them at the text rate so your total reflects both modalities.

Is my audio uploaded?

No. You only enter a duration in seconds. Nothing is uploaded or stored.

Audio Token Cost Estimator

Audio token cost estimator

GPT-4o Audio and the realtime API bill spoken audio as tokens, not as minutes, and the per-token rate is far higher than text. Because cost scales with audio duration, a short voice agent loop can become surprisingly expensive at volume. This tool converts seconds of audio into token equivalents and a dollar figure so you can size a voice feature before you ship it.

How audio tokens work

Audio-capable models charge a fixed number of tokens per second of audio — about 10 tokens per second in each direction — but the audio token price is set separately from text. A speech-to-speech turn therefore has three components: input audio tokens, output audio tokens, and any text tokens that ride along (system prompts, transcripts, function results). This tool computes the audio leg from your duration and adds text tokens at the cheaper text rate, giving you a realistic per-call total.

Tips and notes

Trim silence and dead air before sending audio in — every second you send is billed, whether or not it carries speech.
For transcription-only workloads, a dedicated speech-to-text model is usually far cheaper than routing audio through a full multimodal chat model.
In a realtime voice loop, the model often re-hears prior audio context; cap the history window to avoid paying for the same seconds repeatedly.