Which TTS API sounds the most natural?

ElevenLabs is widely considered the most natural for English and offers expressive emotion control, with OpenAI TTS close behind. Google and Azure Neural voices are strong and far cheaper at high volume, while AWS Polly Neural is reliable but slightly less expressive.

Which voice APIs support voice cloning?

ElevenLabs offers instant and professional voice cloning, and Microsoft Azure offers Custom Neural Voice (gated for ethics review). OpenAI TTS, Google Cloud TTS, and AWS Polly Neural do not offer general-purpose cloning as of mid-2026.

How is price per character calculated?

TTS is billed per character of input text, usually quoted per 1 million characters. Prices are list-price estimates and clearly labelled; tiers, free quotas, and per-voice surcharges vary, so confirm in each provider's pricing page.

What latency should I expect for real-time voice?

For conversational agents you want sub-300ms first-byte latency. ElevenLabs Flash and OpenAI's streaming TTS hit low latency; standard high-quality voices are higher latency and better suited to pre-rendered audio.

How many characters is an hour of speech?

English narration runs about 150–165 words per minute and prose averages roughly 6 characters per word including spaces, so one hour of continuous audio corresponds to roughly 54,000–60,000 input characters. An 80,000-word audiobook is about 480,000 characters.

Do SSML tags count toward TTS billing?

It varies by provider. Some bill every input character including SSML markup; others bill only the synthesized text. If your pipeline adds heavy prosody or say-as tags, confirm the billing treatment on the provider's pricing page before estimating costs from raw text length.

What is the AI Voice & TTS Model Comparison?

Side-by-side reference for text-to-speech and AI voice models covering naturalness, language coverage, voice cloning support, latency, and price per million characters — ElevenLabs, OpenAI TTS, Google, AWS Polly, Azure, and more. It runs free in your browser on Gera Tools, with nothing uploaded.

AI Voice & TTS Model Comparison

Name: AI Voice & TTS Model Comparison
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Compare AI voice and TTS models at a glance

Choosing a text-to-speech model means trading off naturalness, language coverage, latency, price, and whether you need voice cloning. This table puts the major voice APIs side by side — ElevenLabs, OpenAI TTS, Google Cloud, AWS Polly, Azure, and more — so you can pick the right model for narration, IVR, audiobooks, or a real-time voice agent.

How to read the table

Naturalness is a 1–5 rating of how human the default voices sound.
Languages is the approximate number of supported languages and locales.
Cloning flags whether you can create a custom voice from a sample.
Latency is rough first-audio latency for streaming use; lower is better for conversational agents.
$/1M chars is a list-price estimate; high-volume tiers and free quotas vary.

Filter by cloning requirement and budget, search by model name, and click a column header to sort.

The naturalness-vs-cost trade-off in practice

The TTS market has a clear tiering. At the top end, ElevenLabs and OpenAI TTS produce output that most listeners cannot reliably distinguish from a professional voice actor on short samples. They achieve this through large neural models trained on diverse, high-quality audio. The cost is higher per character and, for real-time use, the latency of the highest-quality voices can be unsuitable for conversational interaction.

At the mid tier, Google Neural2 and Azure Neural voices are significantly more natural than older concatenative TTS, cover many languages, and are substantially cheaper per million characters. For high-volume applications — IVR systems, notification audio, e-learning at scale — the quality-to-cost ratio is often better than the premium options.

AWS Polly Neural occupies a similar position: reliable, scalable, well-integrated with AWS infrastructure, and less expressive than the premium neural models but appropriate for functional audio where naturalness is secondary to clarity.

The right tier depends on whether your use case puts the voice front and centre (branded content, audiobooks, consumer-facing narration) or treats it as infrastructure (IVR prompts, internal notifications, accessibility audio).

Voice cloning: what it means and what to check

Voice cloning creates a custom synthetic voice that sounds like a specific person, from a sample of their speech. It has legitimate uses — brand voice consistency, dubbing, accessibility for people who have lost their speaking voice — and serious misuse potential.

From a provider standpoint:

ElevenLabs offers instant cloning (from a short sample, lower quality) and professional cloning (longer sample, higher quality). Both are available on paid plans. The platform has use-policy restrictions on cloning voices without consent.

Azure Custom Neural Voice requires an explicit ethics review and approval process before it is unlocked. It is designed for enterprise brand voices and accessibility scenarios rather than general cloning.

OpenAI TTS, Google Cloud TTS, and AWS Polly Neural do not offer general-purpose voice cloning as of mid-2026. Google offers Custom Voice via a separate, gated programme not accessible to most developers.

If you need cloning, verify the provider’s consent requirements and terms before building on it — using a cloned voice for purposes the terms do not permit is both a legal and reputational risk.

Latency tiers for real-time use

For conversational voice agents, the target is sub-300ms first-audio latency: the time from sending text to the first audio chunk arriving. At above 500ms, conversations start to feel unnatural; at above 1000ms, users notice and disengage.

Low-latency streaming models (such as ElevenLabs Flash variants and OpenAI’s streaming TTS) are built for this requirement. High-quality standard voices are typically higher latency and better suited to pre-rendering audio where the full file can be generated before playback starts — for example, generating audio for a video or a podcast episode.

Choosing a high-quality model for a real-time use case, or a low-latency model for pre-rendered content, is a common mismatch. The latency column helps you match the right tier to your architecture.

Tips for picking a voice model

For expressive narration and audiobooks, ElevenLabs and OpenAI TTS deliver the most lifelike output.
For high-volume IVR or notifications, Google Cloud and AWS Polly Neural cut cost dramatically with acceptable quality.
For real-time voice agents, prioritise the low-latency streaming tiers (ElevenLabs Flash, OpenAI streaming) over the highest-quality voices.
If you need a branded custom voice, only ElevenLabs and Azure Custom Neural Voice support it here — and Azure requires an ethics review.
Test on your actual text. Naturalness ratings are for typical prose; technical content, names, and abbreviations behave differently across engines.

Estimating what a TTS project actually costs

Because TTS is billed per input character, you can budget a project with simple arithmetic before touching any API. English prose averages roughly five letters per word plus a space, so 1,000 words ≈ 6,000 characters. Spoken narration runs around 150–165 words per minute, which means:

Content	Approx. words	Approx. characters
1 minute of narration	~150–165	~900–1,000
10-minute video voiceover	~1,500–1,650	~9,000–10,000
1 hour of continuous audio	~9,000–10,000	~54,000–60,000
80,000-word audiobook	80,000	~480,000

Divide a provider’s price per 1 million characters accordingly: at a list price of $15 per 1M characters, an 80,000-word audiobook costs about 480,000 ÷ 1,000,000 × $15 ≈ $7.20 in synthesis fees; at $150 per 1M it is about $72. That 10× spread between the cheap neural tiers and the premium expressive tiers is the single biggest line item in most TTS budgets — and it is why high-volume products usually reserve the premium voices for customer-facing audio only.

Two billing details catch people out. First, some providers count SSML markup as billable input characters while others bill only the spoken text — if your pipeline injects heavy <prosody> or <say-as> tags, check the billing page before estimating from raw text length. Second, regeneration is not free: iterating on pronunciation or pacing re-bills the whole passage each attempt, so budget several times the final character count for content you will polish.

Failure modes to test before committing

Numbers, dates, and units. “1,024 KB” or “3/4” can be read as digits, fractions, or dates depending on the engine. Test your actual formats; SSML <say-as> fixes most cases but adds markup overhead.
Acronyms and product names. Engines differ on whether “SQL” is spelled out or pronounced “sequel”. Custom lexicons or phoneme tags are the reliable fix, and not every provider supports them.
Code-switching. Mixed-language text (an English sentence with a French name) is where naturalness ratings collapse. If your content mixes languages, test exactly that — a model’s headline language count says nothing about mid-sentence switching.
Long-form drift. Some voices sound excellent for a paragraph but develop pacing monotony over a 40-minute chapter. Evaluate on a full-length sample, not the provider’s demo widget.
Streaming chunk boundaries. In real-time use, sentence-boundary handling differs: some engines produce audible seams if you flush text mid-clause. Test with your actual token-streaming pattern from the LLM.

Sources and references

TTS capabilities and list prices move quickly. The feature flags and price ranges in the table are compiled from each provider’s own documentation and pricing pages, which are the authoritative source to confirm before you commit to a vendor:

Maintained by the Gera Tools editorial team. Naturalness and latency ratings reflect typical English prose on default voices and are directional, not benchmark scores — always trial a model on your own text and workload. Provider pricing and cloning policies change frequently; verify against the pages above. Last reviewed 2026-07-02.