Are Synthesized Video Descriptions Acceptable?

Masatomo Kobayashi, Trisha O'Connell, Bryan Gould, Hironobu Takagi, Chieko Asakawa · 2010 · Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2010) · doi:10.1145/1878803.1878833

Summary

This paper from IBM Research Tokyo and WGBH National Center for Accessible Media investigates whether text-to-speech (TTS) synthesised narrations are an acceptable alternative to human-narrated audio descriptions for online videos. While accessibility standards like WCAG 2.0, Section 508, and JIS X 8341-3 recommend video descriptions, almost no online videos have them due to three barriers: writing description scripts requires expertise, matching human narration quality needs professional narrators and recording equipment, and popular video platforms offer no tools for adding descriptions. TTS-based descriptions could address the latter two barriers. The researchers conducted studies in both Japan and the U.S. to reduce cultural bias. The Japan study included an informal survey of 115 visually impaired attendees at an assistive technology exhibit, followed by in-depth interviews with three blind/low-vision participants using 18 video samples with three voice types (human, standard TTS, prototype TTS) and two description levels (normal, detailed/extended). The U.S. study comprised an online survey of 236 blind and low-vision respondents (197 blind, 39 low-vision) and in-depth interviews with eight participants. A follow-up study in Japan further examined long-term listening, emotional voices, describer expertise, and objective understandability.

Key findings

Across both countries, synthesised video descriptions were generally accepted. In the Japan survey, human voice was ranked first but about a quarter of participants preferred TTS voices as much or more, partly because TTS clearly distinguished descriptions from original dialogue. In the U.S. online survey, 68-80% rated TTS descriptions as neutral or comfortable. Comprehension rates were close between human (87%) and TTS (83%) descriptions. The follow-up study revealed several nuanced findings: for emotional content, mismatched emotional TTS voices seriously damaged the experience — a happy voice on tragic drama was rated significantly worse (p<0.001), while neutral voice performed comparably to human for most content. Crucially, novice describers using a prototype script editor could produce descriptions with effectiveness comparable to expert describers for extended descriptions (novice 100% vs expert 83% comprehension for extended level), though expertise mattered more for normal descriptions that had to fit within limited gaps. Extended descriptions — where the video pauses while the description plays — were significantly more effective than normal descriptions (F(1,22)=17.29, p<0.001) and allowed novices to achieve expert-level quality. Participants strongly desired customisation of TTS parameters (speed, volume, voice gender) and the ability to control playback.

Relevance

This cross-cultural study makes a compelling case that TTS-generated video descriptions are a viable, cost-effective way to dramatically increase the number of described online videos. The key practical insight is that extended descriptions (pausing the video) combined with TTS narration lower the bar enough that non-expert describers can produce effective descriptions — this has enormous implications for crowdsourced and user-generated video description at scale. For accessibility practitioners, the research offers three design principles for TTS description platforms: support extended descriptions to allow pausing for complex scenes, allow both describers and listeners to customise voice parameters, and enable iterative improvement of descriptions over time. The finding that emotional TTS voices can damage the experience cautions against naive application of expressive synthesis — neutral voices are safer for most content. The study remains highly relevant as online video continues to dominate digital content and the vast majority still lacks any description.

Tags: audio description · video accessibility · text-to-speech · speech synthesis · web accessibility · blind and low vision · online video · extended description

Standards referenced: WCAG 2.0 · Section 508 · JIS X 8341-3