Behavioral Changes in Speakers who are Automatically Captioned in Meetings with Deaf or Hard-of-Hearing Peers
Matthew Seita, Khaled Albusays, Sushant Kafle, Michael Stinson, Matt Huenerfauth · 2018 · Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2018) · doi:10.1145/3234695.3236355
Summary
This study from Rochester Institute of Technology investigates a largely unexplored question: how does using an ASR-based captioning tool in meetings with deaf or hard of hearing (DHH) colleagues change the speaking behavior of hearing participants? While prior work has focused on DHH users' experiences with automatic captions, this paper examines the other side of the interaction — how the technology and social context together influence the hearing speaker. The researchers conducted an experiment with 21 participants (12 hearing, 9 DHH) at the National Technical Institute for the Deaf. Small groups of 2-3 people (always including one DHH participant) held collaborative discussions under three conditions: without ASR technology (using whatever communication methods they preferred, such as speech-reading, writing, or gesturing), with an ASR-based chat application where spoken words appeared as text messages, and with a "markup" version that italicized and underlined words the ASR had low confidence in recognizing. The ASR application worked like a chat room — hearing participants pressed a microphone button to speak, their words were transcribed by cloud-based ASR, and messages appeared as text visible to all participants. DHH participants communicated by typing. Audio recordings were meticulously annotated using ELAN software and analyzed with the Praat speech analysis tool to extract acoustic features including intensity, pitch, formant frequencies, harmonics-to-noise ratio, speech rate, and Speech Intelligibility Index.
Key findings
The analysis revealed statistically significant changes in hearing participants' speech when using ASR technology. Speakers talked significantly louder in both the ASR and Markup conditions compared to no ASR (p=0.004 and p=0.016 respectively), likely because they were aware they were dictating into a microphone. Voice quality improved in the Markup condition, with significantly higher harmonics-to-noise ratio compared to no ASR (p=0.008) — contrary to the hypothesis that louder speech would decrease HNR, suggesting speakers actively improved their voice clarity when given visual feedback about ASR confidence. Both F1 and F2 formant frequencies changed significantly between ASR conditions and no ASR (p<0.014 for all), indicating hyperarticulation — speakers were physically changing how they positioned their tongue and mouth to produce clearer vowel sounds. Surprisingly, speech rate was significantly faster in the Markup condition compared to no ASR (p=0.002), while speakers actually talked more slowly when directly addressing DHH participants without ASR. No significant differences were found in pitch or Speech Intelligibility Index across conditions, and Word Error Rate did not differ between ASR and Markup conditions.
Relevance
This research has direct implications for how ASR captioning systems should be designed and trained. The central finding — that hearing speakers produce acoustically different speech when using ASR tools with DHH colleagues — means that ASR training datasets built from standard telephone or lecture recordings may not be representative of the speech patterns ASR actually encounters in this accessibility context. ASR systems intended for DHH communication should be trained on data that includes this type of hyperarticulated, higher-intensity speech. For designers of captioning tools, the study suggests that visual confidence indicators (like the markup feature) can positively influence speaker behavior, potentially creating a beneficial feedback loop where speakers modify their speech to be more recognizable by the system. The finding that speakers naturally slow down when talking directly to DHH peers but speed up with ASR is also important — it suggests that ASR tools may need to encourage pacing that benefits both the technology and the DHH user who may be speech-reading alongside reading captions.
Tags: deaf and hard of hearing · automatic speech recognition · captioning · communication accessibility · speech behavior · workplace accessibility · ASR
Standards referenced: ANSI/ASA S3.5-1997