From User Perceptions to Technical Improvement: Enabling People Who Stutter to Better Use Speech Recognition

Colin Lea, Zifang Huang, Jaya Narain, Lauren Tooley, Dianna Yee, Dung Tien Tran, Panayiotis Georgiou, Jeffrey P. Bigham, Leah Findlater · 2023 · Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23) · doi:10.1145/3544548.3581224

Summary

This paper investigates how people who stutter (PWS) experience consumer speech recognition systems and demonstrates technical improvements that can significantly reduce errors. The work combines user research with engineering interventions across the speech recognition pipeline. Study 1 was a comprehensive online survey of 61 PWS in the United States, screened by speech-language pathologists using the Andrew & Harris scale (12 mild, 31 moderate, 18 severe stuttering). The survey found that while most PWS are familiar with voice assistants (100%) and dictation (90.2%), they use these technologies at lower rates than the general population. The most common challenges were being cut off before finishing speaking (61%), not being understood (57.6%), and having dysfluencies appear as errors in transcribed text. Almost half of participants reported VAs cut them off always or often. Study 2 collected speech data from 91 PWS (25 mild, 44 moderate, 22 severe) who recorded 121 voice assistant commands and up to 10 dictation phrases each. Detailed dysfluency annotations revealed that 58.6% of utterances contained part-word repetitions, 36.4% had prolongations, 38.7% had blocks, 5.5% had whole word repetitions, and 2.8% had interjections. Three technical interventions were investigated: (1) tuning the endpointer model threshold to reduce premature cut-offs, (2) tuning ASR decoder parameters to better handle dysfluent speech, and (3) applying post-hoc dysfluency refinement to remove repeated words and filler words from transcriptions.

Key findings

The baseline endpointer truncated 23.8% of utterances from PWS — nearly 8 times the 3% target for the general population. Truncation rates correlated strongly with blocks (r=0.64, p<.001) and part-word repetitions (r=0.44, p=.004). Tuning the endpointer threshold reduced truncation by 79.1% for moderate stuttering severity, with a modest additional delay of 1.2-1.7 seconds. Baseline word error rate (WER) was 25.4% for the Phase 2 evaluation set, compared to approximately 5% for the general population. WER varied dramatically by severity: 4.8% for mild (comparable to general population), 13.6% for moderate, and 49.2% for severe. The dominant error type was word insertions (80.9% of all errors), strongly correlated with part-word repetitions (r=0.85, p<.001). ASR decoder tuning reduced WER from 25.4% to 12.4% (51.2% relative improvement). Combining all three interventions reduced WER to 9.9% — a 61.2% improvement from baseline. The percentage of participants with WER under 10% improved from 48.8% to 65.9%. A historical analysis of ASR models from 2017-2022 showed WER for PWS dropping from 29.5% to 19.9%, indicating general improvements also benefit this population but a gap persists.

Relevance

This is one of the most comprehensive studies to date on speech recognition accessibility for people who stutter, combining rigorous user research with practical engineering solutions. The finding that baseline systems cut off nearly a quarter of utterances from PWS quantifies a severe usability barrier that affects daily technology use for approximately 1% of the global population. Critically, the three proposed interventions are designed to integrate with existing production systems rather than requiring entirely new models, making them realistic to deploy at scale. The paper demonstrates that relatively small adaptations — adjusting a threshold, tuning decoder parameters, and adding a post-processing step — can dramatically improve performance. For accessibility practitioners and platform developers, this work provides a blueprint for making speech interfaces more inclusive: offer adjustable endpointer sensitivity, ensure ASR models are evaluated on diverse speech patterns, and consider post-processing refinements. The survey findings about social barriers (37.3% not wanting others to hear them use VAs) also remind us that technical improvements alone are insufficient — the design context matters too.

Tags: stuttering · speech recognition · voice assistants · dictation · speech accessibility · dysfluency · automatic speech recognition · speech input