How is voice API cost usually billed?

Speech-to-text is almost always billed per minute of audio. Text-to-speech is billed per character or per minute of generated speech depending on the provider. This tool normalises everything to a per-minute basis for easy comparison.

Why is ElevenLabs more expensive than OpenAI for TTS?

ElevenLabs prices premium, highly natural and clonable voices at a higher rate, while OpenAI's TTS targets a lower price point. The right choice depends on whether voice quality or cost matters more for your use case.

Do the premium tiers really cost more?

Yes. Premium STT models (higher accuracy, diarization) and premium TTS voices typically cost noticeably more per minute than standard tiers. The tool applies a multiplier so you can see the gap.

Is my data sent anywhere?

No. The calculator runs entirely in your browser. Nothing you enter is uploaded, stored or logged.

What is the Voice API Cost Calculator?

Estimate text-to-speech and speech-to-text costs across OpenAI, ElevenLabs, Google and Deepgram for a given minutes-per-day workload. Pick a direction and quality tier and compare daily and monthly cost per provider side by side. It runs free in your browser on Gera Tools, with nothing uploaded.

Voice API Cost Calculator

Name: Voice API Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Compare voice API costs across providers

Adding speech to an app means paying for text-to-speech (TTS), speech-to-text (STT), or both — and prices vary widely between OpenAI, ElevenLabs, Google and Deepgram. This calculator takes your minutes of audio per day, a direction, and a quality tier, and shows the daily and monthly cost for each provider so you can choose on real numbers instead of marketing pages.

How it works

Costs are normalised to a per-minute rate. STT providers already bill per minute of input audio. For TTS, character-based pricing is converted using an average of roughly 150 spoken words (about 900 characters) per minute. The daily cost is then:

daily_cost = minutes_per_day × per_minute_rate × tier_multiplier

When you select “both”, the tool sums the TTS and STT rates for each provider. Premium tiers apply a multiplier to reflect higher-accuracy STT models and more natural TTS voices.

Worked example

Suppose you run a customer-support voicebot that transcribes 60 minutes of incoming calls per day (STT, standard tier) and generates 30 minutes of spoken responses (TTS, premium voices). Entering those numbers shows the daily and monthly totals for each provider side by side, making it easy to see whether the quality premium of a more expensive voice is worth the extra spend at your actual volume. A product doing 10 minutes a day barely notices the difference; one handling 500 minutes a day needs to model it carefully.

Billing models differ — here is what to watch

Different providers choose different billing units, which makes direct comparison tricky without normalisation:

Provider	TTS billing unit	STT billing unit
OpenAI	per character	per minute
ElevenLabs	per character (credits)	N/A (TTS-only)
Google Cloud	per character	per minute / per 15 s
Deepgram	per minute	per minute

Character-based TTS pricing is sensitive to how verbose your scripts are — dense technical text produces fewer words per character than conversational speech, so your actual per-minute cost varies slightly with content.

Choosing between standard and premium tiers

The quality gap between standard and premium tiers has narrowed considerably, but premium voices still win for customer-facing audio where naturalness builds trust. Standard tiers are a better fit for internal workflows — transcription pipelines, meeting notes, or back-office processing — where accuracy tolerance is higher and the voice never touches a user directly.

Tips for managing voice spend

Cache repeated TTS. Greetings, menu prompts and canned responses should be synthesised once and stored, not regenerated on every call.
Match the tier to the task. Use premium voices for customer-facing audio and standard tiers for internal transcription where accuracy tolerance is higher.
Trim silence before STT. Voice-activity detection removes dead air so you are not billed for minutes of silence.
Batch transcription. Offline batch STT is frequently cheaper than real-time streaming for recordings that do not need an instant transcript.
Model volume growth. Costs that seem negligible at prototype scale can become the largest line item in a production app. Run the monthly estimate at 10× your current volume to check future headroom before you commit to a provider.