Question 1

What is automatic speech recognition?

Accepted Answer

Automatic speech recognition (ASR) is the task of converting spoken audio into written text. A model takes a waveform, breaks it into short frames, learns the acoustic patterns that correspond to speech sounds, and produces the most likely sequence of words. Modern systems are trained end to end on large datasets and approach human transcription accuracy on clear speech.

Question 2

What is CTC loss and why is it used?

Accepted Answer

Connectionist Temporal Classification (CTC) is a training objective that lets a model align audio frames to text without needing a per-frame label for every sound. It introduces a blank symbol and sums over all valid alignments, so the model can learn that many audio frames map to a single character. This solves the core problem that audio and text have different, unaligned lengths.

Question 3

How does OpenAI Whisper achieve such high accuracy?

Accepted Answer

Whisper is a transformer trained on a very large, diverse dataset of around 680,000 hours of multilingual audio with weak supervision. That scale and diversity make it robust to accents, background noise, and many languages, and it handles transcription, translation, and timestamping in one model. Its accuracy on clean speech rivals human transcribers in many settings.

Question 4

What is language-model rescoring?

Accepted Answer

An acoustic model proposes several candidate transcriptions, some of which sound plausible but are grammatically unlikely. Language-model rescoring re-ranks those candidates using a model of how likely each word sequence is in real language, favouring fluent, sensible text. It corrects errors the acoustic model alone would make, such as confusing similar-sounding words.

What Is AI Speech Recognition? How Machines Transcribe Audio

What speech recognition is

From audio waveform to features

The alignment problem and CTC loss

Transformers, Whisper, and language-model rescoring

Where speech recognition is used