Measuring the Accuracy of Automatic Speech Recognition Solutions
Korbinian Kuhn, Verena Kersken, Benedikt Reuter, Niklas Egger, Gottfried Zimmermann · 2024 · ACM Transactions on Accessible Computing · doi:10.1145/3636513
Summary
This study provides independent, comprehensive benchmarking of 11 common automatic speech recognition (ASR) services to assess their real-world accuracy for accessibility purposes. The research addresses a critical gap: while vendors claim "state-of-the-art accuracy" and scientific publications report ASR achieving "human parity," the d/Deaf and hard of hearing (DHH) community continues to report serious issues with caption quality. The researchers created a novel dataset of 120 audio samples from Higher Education lectures (90 YouTube recordings plus 30 LibriSpeech control samples), totaling 221 hours of audio and generating 3,840 individual transcriptions. They tested Amazon AWS, AssemblyAI, Deepgram, Google Cloud, IBM Watson, Microsoft Azure, Rev AI, Speechmatics, SpeechText.AI, Tencent, and OpenAI Whisper across multiple conditions: batch vs. streaming transcription, with and without custom vocabularies, and in English, German, and English as a Second Language (ESL) scenarios. The methodology was fully automated using a NodeJS script to ensure comparability across vendors, with extensive text normalization (using Whisper's normalizer) applied before calculating Word Error Rate (WER). The study deliberately used realistic Higher Education content rather than benchmark datasets to avoid the overfitting problem where models trained on public datasets show artificially low error rates.
Key findings
Average WER across all vendors for English content was 7.0%, but accuracy varied dramatically—from 0% to 53.8% WER across individual samples. OpenAI Whisper (open source, self-hosted) achieved the best results: 2.9% WER for English and 3.3% for LibriSpeech. Among commercial services, Speechmatics (3.3% English), Amazon (4.4%), Microsoft (4.4%), and AssemblyAI (4.5%) performed well, while Google showed surprisingly poor English performance (20.1% WER) despite strong German results. Streaming ASR—used for live events—showed significantly higher error rates (10.9% WER) compared to batch transcription (9.37%), a statistically significant difference (p < 0.01). This is critical because live captioning for meetings and events is where DHH users most need reliable accuracy. Adding custom vocabularies (technical terms, names, abbreviations) did not significantly improve overall accuracy, though vocabulary words appeared more often when provided. ASR confidence scores were not reliable indicators of actual accuracy—some services showed high confidence on incorrect transcriptions. No single vendor consistently achieved the lowest WER across all samples. Even the best-performing services showed high variance, meaning users cannot predict whether any given transcription will be accurate.
Relevance
This research provides essential evidence for accessibility practitioners and organizations evaluating ASR solutions. The key finding that accuracy is unreliable—varying widely even for the same vendor across different samples—means ASR cannot be trusted as a standalone accessibility solution without human review. For live events, the significantly higher error rate of streaming ASR (used for real-time captioning) reinforces why the DHH community has advocated for professional captioners or human-monitored ASR rather than fully automated solutions. The National Association of the Deaf's petition to the FCC for ASR quality standards is supported by this data. Practically, organizations should: (1) not rely solely on vendor accuracy claims, which are often based on benchmark datasets that don't reflect real-world performance; (2) prioritize batch transcription with human editing over live ASR when possible; (3) consider open-source Whisper for cost-effective, high-accuracy batch transcription; and (4) recognize that even 5% WER means roughly one error every 20 words—potentially multiple errors per sentence that compound comprehension difficulty for caption-dependent users.
Tags: automatic speech recognition · ASR · captions · deaf and hard of hearing · transcription · accessibility testing · word error rate · machine learning · higher education
Standards referenced: WCAG 2.2 · FCC Captioning Requirements