Audio Description Customization

Rosiana Natalie, Ruei-Che Chang, Smitha Sheshadri, Anhong Guo, Kotaro Hara · 2024 · Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2024) · doi:10.1145/3663548.3675617

Summary

This paper investigates how audio descriptions (AD) for video content can be customized to meet the diverse preferences of blind and low-vision (BLV) users. Traditional ADs are fixed narratives created by sighted describers, offering no ability for users to adjust what information is conveyed or how it is presented. The researchers conducted two studies to understand and address this gap. Study 1 interviewed 15 BLV participants about their AD experiences and customization desires, revealing that users want control over multiple dimensions including detail level, emphasis on specific visual elements (facial expressions, scene settings, actions, on-screen text, colors), playback speed, voice characteristics, format (inline descriptions during dialogue pauses versus extended descriptions that pause the video), and tone. Based on these findings, the researchers developed CustomAD, a web-based prototype that enables BLV users to customize both content and presentation of audio descriptions. The system uses GPT-4 to generate AD scripts at five detail levels and with different emphasis categories, and ElevenLabs text-to-speech for multiple voice options. Study 2 evaluated CustomAD with 12 BLV participants watching videos across different genres, comparing the customizable system against traditional fixed AD and no-AD conditions.

Key findings

CustomAD significantly improved video understanding compared to both traditional AD and no-AD conditions (p<.001), and significantly enhanced immersion compared to no AD (p=.004). Participants navigated information more efficiently with customizable descriptions. Detail level was the most frequently adjusted feature, with users preferring lower detail for casual viewing and higher detail for important or complex content. Emphasis customization was highly valued, with facial expressions being the most selected category across genres, followed by scene and setting descriptions. The extended AD format, which pauses video to deliver longer descriptions, was preferred for visually dense content despite interrupting viewing flow. Participants appreciated voice selection for comfort during extended viewing sessions. However, the study also revealed challenges: managing multiple customization options increased cognitive load, particularly for users less familiar with technology. Some participants found switching between settings disruptive. The AI-generated descriptions occasionally contained inaccuracies, especially for facial expressions and nuanced visual details, highlighting limitations of current large language model capabilities for this application.

Relevance

This research has significant implications for how video accessibility is implemented in practice. The finding that BLV users have highly diverse and context-dependent AD preferences challenges the one-size-fits-all approach that dominates current AD production. For practitioners, the study suggests that AD systems should offer progressive disclosure of customization options, starting with sensible defaults and allowing users to gradually explore more settings. The use of AI to generate customizable descriptions at scale could dramatically reduce the cost and time barriers that currently limit AD availability, though human review remains important for accuracy. The emphasis categories identified in this study provide a practical framework for organizing AD content priorities. Organizations producing video content should consider how customization features can be integrated into their media players and AD workflows.

Tags: audio description · blind and low vision · customization · video accessibility · assistive technology · large language models

Standards referenced: WCAG 2.1 · Web Content Accessibility Guidelines