Disability-First AI Dataset Annotation: Co-designing Stuttered Speech Annotation Guidelines with People Who Stutter

Xinru Tang, Jingjin Li, Shaomei Wu · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790405

Summary

Tang, Li, and Wu present the first study to push the 'disability-first' principle beyond dataset collection and into the dataset annotation stage of the AI pipeline. Their case is stuttered speech: despite a growing number of stuttering datasets (FluencyBank, UCLASS, KSoF, LibriStutter, AS-70, Boli, Sep-28k), annotation is usually performed by crowdworkers or SLP-trained listeners who are not themselves people who stutter (PWS), and inter-annotator agreement on core stuttering events in Sep-28k is strikingly low (0.11 for prolongations, 0.25 for blocks, 0.39 for disfluency presence). The study unfolds in three phases. Phase 1 (Formative) reviews Sep-28k against its re-annotated subset Sep-28k-SW, documenting disagreements on 2,621 clips (12.32% prolongation, 12.13% block, 8.74% sound repetition, 5.46% word repetition, 10.61% interjection) and interviewing two PWS AI professionals. Phase 2 (Co-design) runs three iterative Zoom workshops with two PWS AI professionals and one follow-up session with two SLPs (one PWS, one non-PWS), using Praat and a shared preliminary guideline built on FluencyBank, Sep-28k, and AS-70 conventions. Phase 3 (Evaluation) brings in four PWS speech-data contributors who review annotations of their own recordings and discuss five downstream scenarios - captioning, voice commands, smart speakers, speech-to-text, and therapy/education. The resulting guidelines ship in Appendix A and centre three qualities: contextualised annotation in speech flow; prioritisation of PWS embodied knowledge; and reflexive acknowledgement of the trade-offs in subjective speech perception. Markup uses /r (word/phrase repetition), /s (sound repetition), /b (block), /p (prolongation), /i (interjection), and explicit 'mixed stuttering events' notation.

Key findings

PWS annotators bring embodied cues that non-PWS listeners systematically miss: 'accessory signals' (breathing, tone of speech, speed of talking) and kinematic signals (teeth or vocal cord tension audible before a block). Charan used a door metaphor to distinguish a block ('a jam door you bump into again and again') from intentional back-tracking ('hiding' a stutter by pausing or substituting a word), a distinction nearly invisible in the audio alone. Non-PWS annotators of Sep-28k over-applied stuttering labels to natural disfluencies (ums, filler words, thinking pauses), producing substantially more false positives than false negatives versus the PWS-led Sep-28k-SW. Three persistent annotation challenges resist elimination: (1) the complex, organic, and contested nature of stuttering (avoidance behaviours, covert stuttering, individualised filler words); (2) subjectivity in human speech perception even among PWS - Rong estimated consistent agreement in 'roughly 80% of cases'; (3) socio-technical distortions introduced by annotation tools - Praat's segment-and-edit workflow causes annotators to miss prolongations that sit just outside the highlighted clip, and Sep-28k's strict 3-second segmentation obscures conversational flow. Evaluation-phase PWS preferred professional annotators familiar with stuttering, welcomed verbatim stuttering transcription for identity and advocacy reasons (Tatianna cited a job interview where her stutter was mistaken for incompetence), but wanted situated labels: accuracy expectations shift depending on whether annotations feed captioning, voice commands, therapy, or education. The team argues for a stewardship model replacing majority-vote ground-truth with PWS-led continuous re-review, and for multiplicity-aware pipelines that carry disagreement downstream rather than erasing it.

Relevance

For accessibility practitioners working on AI, this paper reframes 'label noise' as a symptom of epistemic injustice rather than a technical defect, and gives concrete ammunition for arguing that disabled subject-matter experts must be involved at the annotation stage of AI pipelines - not only data collection. The Sep-28k versus Sep-28k-SW disagreement figures (up to 12% of clips flipping label) are directly citable when pushing back on benchmark claims for stuttering ASR, captioning systems, or voice-command platforms. The published annotation guidelines, markup syntax, and iterative co-design protocol are reusable templates for sign-language video, aphasia, AAC, and other non-normative communication datasets. Limitations are candidly noted: a small, technically literate, primarily US-based participant pool; reliance on existing SLP and ML frameworks as the starting point; English podcast data only; and no downstream evaluation of whether models trained on PWS-annotated data perform better in production. The paper's most actionable contribution for practitioners is less the specific stuttering labels and more the argument that any accessibility AI dataset should document who annotated it, with what embodied knowledge, and how disagreement was resolved - or deliberately preserved.

Tags: AI dataset annotation · stuttering · speech recognition · disability-first design · embodied knowledge · data labeling · accessibility · speech technology · co-design · epistemic injustice · AI ethics · AI fairness