Beyond Subtitles: Captioning and Visualizing Non-speech Sounds to Improve Accessibility of User-Generated Videos

Oliver Alonzo, Hijung Valentina Shin, Dingzeyu Li · 2022 · Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '22) · doi:10.1145/3517428.3544808

Summary

This paper investigates a significant gap in video captioning: the representation of non-speech sounds. While automatic speech recognition (ASR) has become widely available for generating captions on platforms like YouTube, TikTok, and Zoom, these systems focus exclusively on spoken content. True captions — as distinct from subtitles — should include non-speech sounds such as music, environmental noises, laughter, and sound effects, which carry important narrative and contextual information. The researchers conducted two studies. First, formative semi-structured interviews with 11 DHH participants (5 Deaf, 6 hard of hearing, mean age 30) explored their experiences with online videos, captions, and non-speech sounds. Participants were shown three sample videos (sports, news, entertainment) in three conditions: without non-speech sound captions, with text-based captions for non-speech sounds (using the conventional bracket notation like [laughter]), and with graphic captions (using GIFs and animated stickers overlaid on the video). Second, the formative findings informed the design of a prototype authoring tool that used automatic sound event detection to help video creators add both text-based and graphic captions. This prototype was evaluated with 10 hearing video creators who interacted with it under three conditions: manual-only, a Wizard-of-Oz error-free automatic system, and a real automatic sound detection system with errors.

Key findings

DHH participants expressed strong interest in having important non-speech sounds captioned, but emphasized selectivity — sounds should be relevant to the storyline rather than exhaustively catalogued. The criteria for which sounds to include varied by video content, length, number of speakers, and individual viewer preferences, reflecting the diversity within the DHH community. DHH participants identified distinct trade-offs between text-based and graphic captions: text-based captions were familiar, unobtrusive, and better suited for serious content, while graphic captions provided richer detail and were more appropriate for entertainment and social media content. Several participants suggested graphic captions should be optional and overlaid on demand, analogous to closed versus open captioning. An important finding was that text-based captions of non-speech sounds can help DHH viewers distinguish captioned videos without dialogue from uncaptioned videos — a genuine usability problem since viewers cannot tell if a silent video is uncaptioned or simply lacks spoken content. From the creator study, hearing video creators wanted automatic systems to identify important sounds and provide accurate timestamps, which they found more valuable than perfect sound descriptions. Creators also found describing non-speech sounds in text challenging, navigating trade-offs between completeness and concision, and suggested that guidance for structuring descriptions would be helpful.

Relevance

This research has immediate practical implications for anyone involved in video captioning, content creation, or platform accessibility. The distinction between captions and subtitles is often misunderstood — many "auto-caption" systems actually produce subtitles only, leaving DHH viewers without access to crucial non-speech audio information. For accessibility practitioners, the study provides evidence-based guidance on captioning non-speech sounds: be selective about importance, include details about sound source and location, balance richness with distraction potential, and consider the video type and audience when choosing between text and graphic representations. The concept of graphic captions as an alternative or complement to text-based bracket notation opens new design possibilities for making audio content accessible. The finding that DHH participants drew comparisons between unimportant sounds and decorative images in alt-text guidelines demonstrates how accessibility principles can transfer across modalities. For platform developers, the study highlights the need for captioning technologies that go beyond ASR to include sound event detection, and for authoring tools that support creators in adding non-speech sound information efficiently.

Tags: deaf and hard of hearing · captioning · non-speech sounds · automatic captions · video accessibility · user-generated content · sound event detection · graphic captions