CLARIS: Clear and Intelligible Speech from Whispered and Dysarthric Voices

Neil Shah, Yash Sonkar, Shirish Subhash Karande, Vineet Gandhi · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791734

Summary

This CHI 2026 paper introduces CLARIS (Clear and Accessible Restoration of Impaired Speech), a compact end-to-end neural speech-to-speech restoration system designed to convert whispered speech and dysarthric speech into clear, natural, intelligible voice output. The authors position voice restoration as an accessibility problem: people who cannot produce fluent voiced speech — due to impaired vocal folds, laryngectomy, dysarthria, Parkinson's disease, stuttering, or situational constraints where speaking aloud is socially inappropriate or impractical — are excluded from mainstream voice interfaces that assume clean, fluent input. CLARIS uses a unified autoregressive transformer architecture with four components: an Atypical Speech-to-Unit Transformer (AS2UT) encoder operating on mel-spectrograms, a unit-prediction decoder producing HuBERT-style speech units, auxiliary character and CTC decoders that supervise linguistic content during training, and a unit-to-speech HiFi-GAN renderer to a fixed target voice. A key innovation is the Real-Synthetic Alignment Discriminator (RSAD), which uses a gradient reversal layer to prevent the encoder from overfitting to TTS-synthesised whispers during data augmentation. The model has 40.71M parameters (far smaller than baselines FreeVC 354M, WESPER 142M, QuickVC 134M) and runs in real time — 32ms GPU / 170ms CPU per second of input. Evaluated on wTIMIT English whispers, newly-collected Hindi and Indian-accent English whispers, and the TORGO dysarthric corpus, with objective metrics (WER, CER, BLEU, ROUGE-L) and 20-listener MOS studies.

Key findings

CLARIS achieved state-of-the-art intelligibility across all three conditions. On wTIMIT English whispers it reached 12.22% WER vs 23.40% for raw whispers, 24.38% for an ASR-TTS pipeline, 45.18% for WESPER, and 55.08% for DistillW2N; on unseen wTIMIT speakers it generalised to 12.04% WER. On Hindi whispers (10 speakers, ~8.9 hours) it hit 29.21% WER vs 43.95% raw and 91.11% for WESPER, demonstrating cross-lingual transfer without language-specific architecture. On TORGO dysarthric speech it dropped average WER from 69.17% (raw) to 31.43%, with the most dramatic gain on speaker M04 (250% → 31.43% WER). Personalisation with just 15-30 minutes of speaker-specific whispered audio cut WER from 76.78% (zero-shot) to 12.63% — a finding the authors frame as central to equitable voice technology: 'one-size-fits-all' models systematically fail atypical speakers. Listener MOS ratings confirmed CLARIS output as higher quality (4.50-4.64), more intelligible (4.38-4.59), more natural, and more prosodically consistent than all baselines. Ablations showed RSAD and the auxiliary character/CTC decoders were each necessary: removing CTC supervision raised WER from 12.22% to 14.85%.

Relevance

For accessibility practitioners working on speech, AAC, or voice interfaces, this paper is a compelling technical demonstration that user-specific adaptation — not just bigger models — is the lever for equity in speech technology. The finding that 15-30 minutes of personalised data can unlock usable intelligibility for dysarthric speakers has immediate implications for clinical deployment and for products that currently exclude people with atypical voices from voice assistants, dictation, captioning, and videoconferencing. The software-only, no-hardware requirement (standard microphone, mobile-capable inference) is a meaningful accessibility win over prior silent speech interfaces that required EMG, ultrasound, or bone-conduction sensors. Practitioners should note the authors' honest limitations: subjective evaluations used non-disabled listeners rather than members of the target community; the system does not preserve the speaker's original vocal identity (outputs use a normalised target voice), which raises identity and self-representation concerns; autoregressive inference has latency floor; and longitudinal usability studies with dysarthric speakers remain future work.

Tags: speech accessibility · dysarthria · voice conversion · whispered speech · silent speech · speech disorders · transfer learning · voice user interface · machine learning · AAC