Enhancing Accessibility through Correction of Speech Recognition Errors
John-Mark Bell · 2007 · SIGACCESS Accessibility and Computing · doi:10.1145/1328567.1328572
Summary
This paper investigates methods for automatically correcting errors in speech recognition-generated captions of university lectures, aiming to improve accessibility for hearing-impaired students. The author notes that while ASR-based captioning can make lectures accessible by providing real-time text of what lecturers say, accuracy remains a significant barrier — research showed that only 40% of faculty participants reached the 85% accuracy benchmark, with a mean accuracy of 77% across all participants. At these error rates, captions can distort lecture content by inserting or removing critical words like "not," rendering them misleading rather than merely incomplete. Current alternatives such as sign-language interpreters, stenographers, and third-party note-takers are limited by cost, availability, and quality. The paper identifies five approaches to automatic post-processing of ASR output: domain-specific statistical models that "translate" between ASR output and intended speech using lecture-specific language patterns; linguistic analysis that applies grammatical knowledge beyond the narrow trigram models used by ASR engines; candidate list analysis that leverages the alternative hypotheses already generated by the ASR engine; phonemic re-segmentation to address mis-segmentation errors; and contextual error detection using statistical relationships between surrounding words.
Key findings
Three preliminary studies produced concrete results. First, analysis of multiple human editors showed that a single editor could correct on average 24% of ASR errors in real time, but adding more editors yielded diminishing returns — two editors corrected 44% at best, with additional editors providing little extra benefit, likely because editors tend to catch the same obvious errors. This establishes a benchmark for automatic correction. Second, analysis of ASR candidate lists showed promising theoretical potential: for a mean initial word error rate of 22%, an absolute reduction of 7% (to 15%) was achievable, corresponding to correction of 32% of errors — already exceeding single-editor human performance. Third, a machine translation approach using finite-state transducers trained on ASR input-output pairs corrected 27.7% of errors in the best model (type-ii extended symbols with trigram or four-gram models). The type-ii extended symbol approach consistently outperformed type-i, while trigram and four-gram models performed similarly, likely due to the small training corpus.
Relevance
This research addresses a problem that remains highly relevant to accessibility practitioners today: the gap between ASR accuracy and the level needed for reliable comprehension. While ASR technology has improved dramatically since 2007, the core insight — that post-processing correction can meaningfully improve caption quality even when the underlying recognition engine cannot be modified — remains applicable. The finding that human editors have limited real-time correction capacity (24%) provides a useful benchmark when evaluating whether automatic methods add value. For organizations implementing live captioning in educational or workplace settings, the paper highlights that error correction is a distinct and valuable layer in the captioning pipeline, separate from improving the speech recognizer itself. The candidate list approach is particularly elegant as it leverages information the ASR engine already generates but discards. The work also underscores the particular risk of caption errors in educational settings, where incorrect information can be worse than no information at all.
Tags: automatic speech recognition · captioning · deaf and hard of hearing · higher education · natural language processing · word error rate · caption quality metric