Reviewing Speech Input with Audio: Differences between Blind and Sighted Users

Jonggi Hong, Christine Vaing, Hernisa Kacorri, Leah Findlater · 2020 · ACM Transactions on Accessible Computing · doi:10.1145/3382039

Summary

This paper investigates how blind users identify automatic speech recognition (ASR) errors when reviewing dictated text through audio only—a critical yet understudied aspect of speech-based text entry. While speech input is a primary interaction method for blind mobile users (reported as 10 times faster than touchscreen keyboards with screen readers), the process of catching ASR errors without visual feedback poses unique challenges. The researchers hypothesized that blind users' extensive experience with synthesized speech and screen readers would give them an advantage in detecting ASR errors through audio. The study recruited 12 blind screen reader users (ages 23-67) and 12 sighted participants (ages 19-31) for a two-part evaluation. First, semi-structured interviews (~30 minutes) explored participants' experiences with speech input, synthesized speech, and ASR errors across different devices and contexts. Second, a controlled speech dictation task had participants compose short text messages and emails in response to 30 scenario prompts, then identify any ASR errors by listening to the text-to-speech output of their dictated text. The screen remained blank for all participants to ensure audio-only review. Participants could listen to the synthesized speech only once per trial and reported errors verbally. The study examined differences in usage patterns, concerns about errors, and actual ability to identify ASR errors. Blind participants used synthesized speech significantly more frequently (92% daily vs. 25% for sighted) and preferred faster speech rates (250-780 WPM compared to the default 200 WPM). Blind participants also used speech input for dictation more frequently and expressed deeper concerns about ASR errors, particularly in professional contexts or when communicating with colleagues.

Key findings

Counter to the researchers' hypothesis, blind participants were not significantly better at identifying ASR errors through audio. Both groups detected only about 40% of errors—blind participants achieved 42% recall (SD=0.13) while sighted participants achieved 38% (SD=0.16), with no statistically significant difference. This finding is striking given blind participants' substantially greater experience with synthesized speech and screen readers. Error identification became significantly harder with longer messages. In short scenario trials (1-2 sentences), average recall was 0.40; in open question trials requiring longer responses, recall dropped to 0.25. This suggests cognitive load increases substantially when reviewing longer dictated text through audio. The study identified three distinct strategies participants used: finding specific incorrect word(s) (most common, used in 156 trials by blind participants), counting total errors, and indicating error locations. When messages were shorter and contained fewer errors, participants pointed to specific words; with longer, more error-prone messages, they switched to counting or location-based strategies. Error types that sounded similar to intended words were hardest to catch. Homophones ("owe you" vs. "OU"), similar-sounding words, and spacing errors ("prototype" vs. "proto type") were identified only 27-36% of the time. Notably, blind participants spoke significantly slower than sighted participants (94.6 vs. 131.3 WPM), possibly as a compensatory strategy to reduce ASR errors—though this did not actually result in fewer errors.

Relevance

This research reveals a significant gap in accessible speech interfaces: even experienced screen reader users miss more than half of ASR errors when reviewing through audio alone. The finding that blind users perform no better than sighted users—despite far greater experience with synthesized speech—challenges assumptions that expertise with TTS automatically transfers to error detection tasks. For practitioners developing speech-based interfaces, several design implications emerge. First, simple audio playback is insufficient for accurate review; interfaces need mechanisms to support sentence-by-sentence review and explicit error highlighting. Second, message length significantly impacts error detection, suggesting dictation interfaces should encourage shorter segments or provide better navigation for longer texts. Third, the three strategies participants used (word-finding, counting, location-indicating) suggest different interface approaches may be needed depending on message length and error density. The study has limitations: the relatively low word error rate (~4%) of modern ASR may have contributed to complacency, and the single-listen constraint was artificial. However, the core finding stands: blind users need better tools to confidently generate accurate, error-free text through speech input—a need that will only grow as voice interfaces become more prevalent.

Tags: speech recognition · ASR errors · screen readers · text entry · blind users · synthesized speech · dictation · audio interface