Speech AI for All: The What, How, and Who of Measurement

Kimi Wenzel, Alisha Pradhan, Maria Teleki, Tobias M. Weinberg, Robin Netzorg, Alyssa Hillary Zisk, Anna Seo Gyeong Choi, Jingjin Li, Raja Kushalnagar, Colin Lea, Abraham Glasser, Christian Vogler, Ly Xinzhen M. Zhangsun Brown, Nan Bernstein Ratner, Allison Koenecke, Karen Nakamura, Shaomei Wu · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) — Workshop · doi:10.1145/3772363.3778768

Summary

This CHI 2026 workshop proposal — the second in the organisers' 'Speech AI for All' series — assembles 17 researchers, practitioners, and community advocates to tackle a specific downstream problem in fair and accessible speech AI: measurement. The motivating claim is that today's speech AI systems (voice assistants, meeting-transcription services, medical transcription, AAC-integrated speech output, customer-service agents) are optimised for typical fluent speech from majority socioeconomic and vernacular backgrounds, and systematically underperform for people with stuttering, dysarthria, d/Deaf and Hard-of-Hearing (DHH) speech, aphasia, age-related voice changes, accented second-language speech, gendered speech patterns, and racial/ethnic vernaculars and dialects. These underperformances translate into concrete harms: worse medical transcription for people with aphasia, adverse hiring outcomes for DHH speakers, psychological harms when one's voice is systematically misheard, and AAC users being excluded from AI-mediated daily-life infrastructure. The workshop argues that standard metrics — principally Word Error Rate (WER) — collapse disparate error types and experiences into a single number that can show 'good' average performance (say, 7% WER) while masking catastrophic performance for specific user populations. Further, WER was designed for modular ASR and does not straightforwardly extend to end-to-end speech language models (SLMs) that integrate recognition, understanding, and generation in one framework. The workshop is structured as a 180-minute two-session event with a 30-minute keynote, 60-minute poster session, two 30-minute guided breakout discussions on (1) connecting existing metrics to downstream user impact, and (2) developing new holistic metrics for different stages of the speech-AI lifecycle.

Key findings

As a workshop proposal, the paper's contributions are a shared research agenda rather than empirical findings. The organisers identify several measurement gaps. First, WER is unidimensional and user-blind: it does not answer who experiences errors, how those errors manifest in interaction, or what emotional, social, and material costs follow. Promising alternative measures surfaced from user-studies literature — interaction-error rates, retries-to-success, abandonment rates, types of breakdown, psychological-impact scales, and subgroup-specific metrics — but none has been systematically adopted into mainstream speech-AI evaluation. Second, speech language models (SLMs) require different auditing strategies than modular ASR: errors can emerge in understanding or generation rather than transcription, and only a handful of studies have explicitly addressed fairness in SLMs. Third, synthetic voices and voice cloning raise new representational questions for AAC and gender-affirming voice work, where timing, rhythm, prosody, and cultural nuance matter as much as phonetic accuracy. Expected outputs include a research agenda for the CHI community, groundwork for new papers on speech-AI measurement, and a diversity-centred benchmark suite for external evaluators — intended to fill a gap where current benchmarks are 'sparsely populated leaderboards that overindex on singular metrics'.

Relevance

This workshop sits at the centre of several converging accessibility risks. As voice-enabled AI (home assistants, in-car interfaces, meeting platforms, medical transcription, hiring tools) becomes unavoidable daily infrastructure, its systematic failure for people with speech diversities moves from annoyance to exclusion from employment, healthcare, education, and civic participation. Practitioners procuring or evaluating speech-AI products should internalise the paper's central methodological critique: a vendor-reported WER of 5-10% tells you nothing about whether the system works for the specific user populations in your context. Audits should require subgroup-specific measurements, particularly for stuttering, DHH, aphasia, dysarthria, and accented speech; should include interaction-level metrics (abandonment, retries) and psychological-impact measures; and should be repeated for SLMs rather than assuming transfer from modular ASR benchmarks. The author roster also serves as a practical 'who to cite' directory for work on Deaf-accented speech (Glasser, Kushalnagar, Vogler), stuttering (Wu, Li at Almpower.org), AAC voices (Weinberg, Zisk at AssistiveWare), ASR fairness for Black speakers and aphasia (Koenecke), and lived disability-justice scholarship (Brown, Nakamura). Limitations: this is a workshop proposal, not a validated benchmark or empirical study; outputs will emerge after CHI 2026.

Tags: speech AI · automatic speech recognition · speech diversity · augmentative and alternative communication · disfluency · stuttering · deaf and hard of hearing · algorithmic fairness · accessibility research · workshop · AI FATE