Compare speech-to-text models at a glance
Choosing a transcription API means balancing accuracy, language coverage, real-time vs batch support, speaker diarization, and price per minute. This table puts the major speech-to-text models side by side — OpenAI Whisper, Deepgram, AssemblyAI, Rev.ai, Google, and Azure — so you can pick the right model for captions, meeting notes, call analytics, or media transcription.
How to read the table
- WER is approximate word error rate on clean English audio; lower is better.
- Languages is the number of supported languages.
- Real-time flags whether the API supports low-latency streaming for live captioning, versus batch-only async transcription.
- Diarization flags built-in speaker labelling.
- $/min is a list-price estimate per audio minute; add-on models can cost more.
Filter by real-time requirement and budget, search by model name, and click a column header to sort.
Tips for picking a model
- For live captions and voice agents, prioritise the real-time streaming providers — Deepgram and AssemblyAI lead on latency and accuracy.
- For bulk media or podcast transcription, batch APIs like Whisper and Rev.ai are cost-effective and accuracy is excellent on clean audio.
- For meetings and call analytics, make sure diarization and word-level timestamps are supported.
- If privacy or cost at scale dominates, self-hosting open Whisper removes per-minute fees but you pay for GPU compute and operate the pipeline yourself.