Self-supervised learning using unlabeled speech with multiple types of speech disorder for disordered speech recognition
Ryoichi Takashima, Takeru Otani, Ryo Aihara, Tetsuya Takiguchi, Shinya Taguchi · 2024 · Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2024) · doi:10.1145/3663548.3688536
Summary
This paper tackles a critical barrier in accessible speech technology: automatic speech recognition (ASR) systems perform poorly for people with speech disorders because they are trained almost exclusively on typical speech. The authors from Kobe University and Mitsubishi Electric investigate a method to improve ASR accuracy for individuals with motor speech disorders caused by cerebral palsy, organic speech disorders from cleft lip and palate, and speech changes following tongue resection. The core challenge is that collecting labeled training data from people with speech disorders is extremely burdensome — speakers must read scripted sentences while their speech is recorded and transcribed. The researchers propose two strategies to reduce this burden. First, they use unlabeled speech (such as spontaneous daily conversation) that requires no transcription, incorporating it through self-supervised learning with the wav2vec 2.0 framework. This approach trains the model to learn speech representations by predicting masked portions of audio, without needing text labels. Second, they pool unlabeled speech data from speakers with different types of speech disorders, hypothesizing that disordered speech shares common acoustic characteristics — such as longer durations and missing phonemes — regardless of the specific condition. The training procedure follows three steps: pre-training wav2vec 2.0 on 660 hours of typical speech, further training it on unlabeled disordered speech from multiple disorder types, and finally fine-tuning an ASR model on a small amount of labeled speech from the target user.
Key findings
The results demonstrate clear improvements from both proposed strategies. A baseline ASR model trained only on typical speech produced phoneme error rates (PER) ranging from 30.2% to 73.2% across nine speakers with disorders — confirming the severity of the recognition gap. When the model was fine-tuned with labeled disordered speech (but without self-supervised pre-training on disordered data), PERs dropped substantially to between 4.9% and 16.4%. The proposed method — adding self-supervised pre-training on unlabeled disordered speech — reduced errors further, achieving PERs of 4.3% to 13.1%, with improvements across all nine speakers. Notably, pooling speech from speakers with different disorder types outperformed using only the target speaker's speech (8.7% vs. 9.2% PER for one cerebral palsy speaker), supporting the hypothesis that cross-disorder training data provides useful shared characteristics. The study validates that unlabeled, easily collected speech can meaningfully improve recognition accuracy without imposing additional burden on users.
Relevance
This research addresses a significant accessibility gap: voice-controlled devices and speech-to-text systems remain largely unusable for people with speech disorders. As voice interfaces become more prevalent in everyday technology — from smart assistants to dictation software — people who cannot produce typical speech patterns are effectively excluded. The practical contribution here is a training methodology that reduces the data collection burden on users with disabilities while improving recognition accuracy. The finding that speech data can be shared across different disorder types is particularly promising for building more inclusive ASR systems, as it suggests that a relatively small pool of diverse disordered speech could benefit many users. Limitations include the small sample size (ten speakers), restriction to Japanese phoneme-level recognition, and focus on read speech rather than spontaneous conversation. Future work extending to character-level recognition and additional disorder types would strengthen the practical applications.
Tags: speech recognition · speech disorders · machine learning · self-supervised learning · assistive technology · cerebral palsy · dysarthria