SoundNarratives: Rich Auditory Scene Descriptions to Support Deaf and Hard of Hearing People

Liang-Yuan Wu, Dhruv Jain · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746341

Summary

This paper introduces SoundNarratives, a real-time system that generates rich, contextual auditory scene descriptions tailored to deaf and hard of hearing (DHH) users. Existing sound recognition technologies typically classify sounds into predefined categories like "door opening" or "speech," which fail to capture the full complexity of real-world auditory environments including temporal variations, overlapping sounds, emotional cues, and spatial dynamics. The researchers conducted a two-phase study. First, a formative study with 10 DHH participants (ages 21-66, with varying hearing loss levels and identities including Deaf, deaf, and hard of hearing) identified nine key sound parameters that DHH individuals need for comprehensive auditory scene understanding: sound class, loudness, speaker dynamics, spatial dynamics, emotion, pace, prominence, pattern, and semantic description. These parameters were derived from semi-structured interviews where participants reflected on their current use of sound awareness technologies and what additional information would enhance their situational awareness. The team then built the SoundNarratives system using AudioFlamingo, a state-of-the-art audio-language model, with systematically designed prompts for each parameter. The system processes real-time audio through a sound processing engine that extracts information across all nine parameters, then uses GPT-4 to compress and synthesize the outputs into coherent, user-friendly descriptions. The interface offers customization options including sensing length, description word count, writing style (Narrative, Essential, or Storyline), and selectable sound parameters, all accessible through a React-based web interface.

Key findings

The prompt engineering experiments revealed that AudioFlamingo performed well on concrete, perceptually grounded parameters—sound class (93% accuracy with hierarchical prompts), emotion (80%), and prominence (85%)—but struggled with more abstract attributes like spatial dynamics (53%) and sound pattern (64%). Structured, constrained prompts consistently outperformed open-ended prompts across all parameters. In the user evaluation with 10 DHH participants, SoundNarratives was significantly preferred over raw AudioFlamingo output, with 84 out of 100 comparisons favoring the structured descriptions (p < .05). Participants rated accuracy at 4.1-4.2 out of 5, usefulness at 3.9-4.2, and overall satisfaction at 4.0-4.2. All participants reported enhanced confidence and situational awareness. A notable cultural difference emerged: Deaf participants (N=6) preferred the abstract, narrative writing style, while hard of hearing participants (N=4) favored the concise Essential style that lists key auditory elements without elaborate sentence structure. Nearly all participants (N=9) emphasized that SoundNarratives effectively complemented their visual cues, filling gaps that purely visual strategies missed. Concerns included occasional ambiguous wording, information overload during fast-paced scenes, and privacy implications of cloud-based audio processing.

Relevance

SoundNarratives represents a significant advancement in auditory accessibility by moving beyond simple sound classification toward holistic, context-aware scene understanding. For accessibility practitioners, this work demonstrates that DHH users need far more than sound labels—they need information about loudness, spatial dynamics, emotional tone, temporal patterns, and semantic context to achieve genuine situational awareness. The nine-parameter framework provides a structured vocabulary for thinking about what auditory information matters to DHH users. The finding that Deaf and hard of hearing participants prefer different description styles reinforces the importance of customizable outputs rather than one-size-fits-all solutions. The system is open-sourced on GitHub, making it available for further development. Privacy concerns raised by participants about continuous audio capture are important considerations for any real-world deployment of sound awareness technology, and the paper suggests selective audio filtering and on-device processing as potential mitigation strategies.

Tags: deaf and hard of hearing · sound awareness · generative AI · audio-language models · prompt engineering · situational awareness · auditory scene analysis