The Effects of Automatic Speech Recognition Quality on Human Transcription Latency
Yashesh Gaur · 2015 · ASSETS '15: Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility · doi:10.1145/2700648.2811331
Summary
This paper investigates a practical question for accessibility: when does providing automatic speech recognition (ASR) output help human captionists work faster, and when does it slow them down? Converting speech to text is fundamental for making audio content accessible to deaf and hard-of-hearing people, but human transcription is expensive (requiring 3-4 times the audio duration) while ASR alone remains unreliable in many real-world settings. A hybrid approach—having humans edit ASR output rather than transcribe from scratch—seems promising, but its effectiveness depends critically on ASR accuracy. The researcher conducted a between-subjects study with 160 participants on Amazon Mechanical Turk. Using the TEDLIUM dataset and Kaldi speech recognition toolkit, 16 one-minute audio clips were processed at varying quality levels by adjusting the speech decoder's beam-width parameter, producing ASR transcripts with Word Error Rates (WER) ranging from 15% to 55%. Ten groups of 16 workers each either transcribed clips from scratch (control) or edited ASR output at one of nine error rate levels. A web-based interface logged keystrokes and timing, with quality control requiring submitted transcripts to have less than 10% WER compared to ground truth.
Key findings
The critical threshold is 30% WER: when ASR accuracy is better than 70% (WER under 30%), editing ASR output is faster than transcribing from scratch. Above 30% WER, workers are faster typing everything themselves. Interestingly, latency does not continue increasing linearly with error rate. At approximately 45% WER, latency peaks and then begins decreasing. Log analysis revealed why: at very high error rates (above 50% WER), 42.5% of workers simply deleted large portions of the provided text and started over, effectively reverting to from-scratch transcription. In contrast, only 7.1% of workers cleared significant text when WER was under 30%, and 12.3% did so for WERs under 50%. This behavior suggests a psychological tipping point: when errors are sparse, workers engage in the cognitively demanding task of identifying and correcting individual errors; when errors are overwhelming, workers recognize the futility of editing and switch strategies. The 30-50% WER zone represents the worst of both worlds—too many errors to efficiently edit, but not enough to trigger abandonment of the editing approach.
Relevance
This research provides actionable guidance for designing hybrid human-AI captioning systems. The 30% WER threshold offers a practical decision point: systems could automatically assess ASR confidence and choose whether to present output to human editors or prompt them to transcribe from scratch. This could optimize both cost and latency in real-time captioning services. For accessibility practitioners, the finding reinforces that ASR quality matters enormously for practical deployment. A system achieving 25% WER provides genuine time savings for human editors, while a system at 35% WER actively harms productivity—a seemingly small accuracy difference with major operational implications. As ASR technology improves, periodic reassessment of these thresholds would be valuable. The study also illuminates the cognitive difference between transcription and error correction. Writing from scratch involves relatively automatic auditory-to-motor translation, while editing requires reading, comparing against audio, identifying discrepancies, and making targeted corrections—a more demanding process that explains why mediocre ASR output is worse than no ASR at all.
Tags: deaf · hard of hearing · automatic speech recognition · ASR · captioning · crowdsourcing · transcription · human computation