Usability Evaluation of Captions for People Who Are Deaf or Hard of Hearing

Sushant Kafle, Matt Huenerfauth · 2018 · SIGACCESS Accessibility and Computing Newsletter (Issue 122) · doi:10.1145/3386410.3386411

Summary

This is a SIGACCESS Newsletter article summarizing a line of research by Kafle and Huenerfauth on building a caption-quality evaluation metric that actually reflects the experience of Deaf and Hard-of-Hearing (DHH) readers — rather than simply counting speech-recognition errors. The authors argue that the standard Automatic Speech Recognition (ASR) evaluation metric, Word Error Rate (WER), is a poor proxy for caption usability: WER treats every substitution, deletion, and insertion as equally bad, ignoring both the semantic importance of the word affected and how misleading the erroneous word is. They draw on prior reading research that suggests deaf readers often use a 'keyword' comprehension strategy — extracting meaning from a sparse set of high-importance content words and largely skipping the rest. Under that model, an ASR error on a low-importance function word barely affects comprehension, while an error on a keyword can derail understanding entirely. They introduce an Automatic Caption Evaluation (ACE) framework that combines two sub-scores: (a) a word-importance score for the reference (correct) word, estimated via n-gram and neural word-predictability language models, and (b) a semantic-distance score between the reference word and the ASR-produced word, computed using cosine similarity of word2vec embeddings. The two sub-scores are combined via a weighted sum with a tuning parameter α fit to a dataset of comprehension-test scores collected from 30 DHH participants (ages 20-32, 26 Deaf and 4 Hard-of-Hearing) who had answered information-retention questions about ASR-transcribed text passages.

Key findings

In the underlying ASSETS 2017 Best Paper on which this newsletter article reports, the ACE metric correlated more strongly with DHH participants' subjective preference ratings of caption text than WER did. The authors found that users' subjective judgements of caption quality are effectively uncorrelated with the raw count of ASR errors. Word predictability (higher-predictability words being less important to comprehension) and semantic distance between the correct and erroneous word emerged as the two useful predictors. The word-importance sub-score can be computed from n-gram language models or neural word-prediction models, with performance comparisons pointing toward neural models as promising; the semantic-distance sub-score is computed from pre-trained word2vec embeddings using cosine similarity. The α parameter is tuned empirically against DHH comprehension data rather than being set a priori. The framework is designed for both retrospective evaluation of captioning systems and as a training objective that ASR developers could optimise instead of WER when their downstream application is real-time captioning for DHH users. The authors flag future work on alternative word-importance and semantic-distance estimators and on further empirical validation studies, and note that the ACE framework complements rather than replaces caption-placement and caption-occlusion metrics.

Relevance

For captioning practitioners, ASR developers, and broadcasters procuring automated captioning services, this article is a clear, accessible summary of why WER alone is an inadequate procurement or QA yardstick when the end users are DHH. It makes the case — backed by comprehension-study data — that caption quality metrics should be weighted by word importance and error semantic impact, not just error count. The ACE framework is directly actionable: vendors could report ACE scores alongside WER for DHH-facing deployments, and procurement teams could require that automated captioning services be evaluated against DHH-calibrated metrics. For researchers, it is a useful entry point to a body of work that also includes Kafle and Huenerfauth's LREC 2018 corpus for modeling word importance and the ASSETS 2017 paper on usability evaluation. Limitations include the modest underlying participant pool (n=30 young DHH adults), reliance on English-language assumptions in word predictability modelling, and the fact that word-importance and semantic-distance estimators are proxies that can themselves encode biases.

Tags: automatic speech recognition · captioning · captions · caption quality · accessibility metrics · deaf and hard of hearing · real-time captioning · accessibility research · natural language processing