Preferred Appearance of Captions Generated by Automatic Speech Recognition for Deaf and Hard-of-Hearing Viewers

Larwan Berke, Khaled Albusays, Matthew Seita, Matt Huenerfauth · 2019 · Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA '19) · doi:10.1145/3290607.3312921

Summary

This CHI 2019 Late-Breaking Work (6 pages) investigates a practical question that has received surprisingly little research: when Automatic Speech Recognition (ASR) is used to caption small-group meetings for Deaf and Hard-of-Hearing (DHH) viewers, how should those captions actually be styled? Prior research had established that DHH users are receptive to ASR captioning at workplaces, classrooms, and small-group meetings now that ASR accuracy has improved, but styling guidance inherited from broadcast TV and movie subtitles had not been validated against the specific needs of ASR-captioned business-meeting video. The authors recruited a large sample (n=105 DHH adults: 69 Deaf, 36 Hard-of-Hearing, 58 male, 47 female, ages 18-30, mean 22.1) at Rochester Institute of Technology for a 60-minute in-person IRB-approved lab study (screened on self-identified DHH status and caption use, $40 compensation). Participants watched mock one-on-one business meeting videos with ASR captions at 23.2% word error rate — a realistic ASR accuracy level at the time — and answered five caption-appearance questions (Q1 type: TV-CC black box vs movie-style white text vs movie-style black text; Q2 appearance: one-word-at-a-time TV-CC scrolling vs full-line movie-subtitle style; Q3 location: inside-video bottom, top, left, right, outside above, below; Q4 number of lines: 1-5; Q5 font from a list including Arial, Helvetica, Comic Sans, Tiresias, etc.). Quantitative answers were analysed with chi-square goodness-of-fit, and 116 open-ended comments were thematically analysed by two DHH researchers using two-round consensus coding.

Key findings

All five questions showed statistically significant preferences (all p < 0.001). Participants were almost evenly split between traditional TV-CC style (n=39, black box, one-word-at-a-time scroll) and movie subtitle style (n=40, white text with black outline, full-line display) — no single style dominated. Location preferences were clearer: 72 participants wanted captions inside the video at the bottom, 21 outside and below, with very few wanting captions above or to the side. Two lines of captioning was strongly preferred (n=64), notably fewer than the three-or-more lines observed in Kushalnagar et al.’s classroom study — the authors attribute this to small-group meetings needing more video real estate for the speaker’s face. Arial (n=35) and Times New Roman (n=29) led the font preferences, with specialised accessibility fonts such as Tiresias much less popular. The open-ended themes revealed a core tension: DHH viewers want captions both readable (a black box increases text visibility) and non-occluding (that same box hides the speaker’s face, expressions, and gestures that carry meaning). Participants also wanted captions positioned near the speaker when possible to catch facial tone, and they expressed a strong desire for user customisation rather than a one-size-fits-all default.

Relevance

For accessibility practitioners building video-conferencing tools, captioning plug-ins, or live-meeting transcription services, the headline message is that there is no single correct caption style for DHH users — preferences split roughly evenly on type and appearance. The design response recommended by the authors is to let users customise caption appearance rather than dictate one default, and to offer at minimum the two dominant styles (TV-CC black-box vs movie-subtitle outlined). Two lines at the bottom of the video, in Arial or Times New Roman, is the most defensible default for small-group meeting contexts. The readability-vs-occlusion tension highlighted by participants is a tractable design problem: dynamic captions that follow or avoid the speaker (tracked captions), speech-bubble styles, or user-adjustable opacity all offer paths forward. Limitations include a young RIT-based sample (ages 18-30, mean 22), a small-group business-meeting context that may not generalise to classrooms, public kiosks, or entertainment video, and a format that prevented longer-form analysis. The paper is a solid entry point into the surprisingly under-researched area of ASR caption styling for DHH viewers.

Tags: captioning · deaf and hard of hearing · automatic speech recognition · user interface design · typography · readability · videoconferencing accessibility

Standards referenced: CEA-608 · CEA-708