Text-to-Speeches: Evaluating the Perception of Concurrent Speech by Blind People

João Guerreiro, Daniel Gonçalves · 2014 · ASSETS '14: Proceedings of the 16th International ACM SIGACCESS Conference on Computers & Accessibility · doi:10.1145/2661334.2661367

Summary

This paper investigates whether blind people can leverage the Cocktail Party Effect—the human ability to focus on one speech source among several while still detecting relevant content in the background—to more efficiently scan digital information. Screen readers present content sequentially, which is a major bottleneck for blind users trying to skim or scan content the way sighted users do visually. The authors built Text-to-Speeches, a Java framework that positions pre-recorded audio files in 3D space using Head Related Transfer Functions (HRTFs) for spatial localisation, allowing multiple speech sources to play simultaneously from different directions. Twenty-three visually impaired participants (17 male, 6 female, ages 22-62) listened to two, three, or four concurrent news snippets in Portuguese and were asked to identify the relevant source based on keyword cues, report its content, and answer comprehension questions. Voice characteristics were varied across three conditions: same voice, small separation (human-like pitch/formant variation), and large separation.

Key findings

Participants successfully identified the relevant speech source 82% of the time across all conditions. With two simultaneous talkers, identification was nearly perfect—20 of 23 participants identified the source correctly in all six trials. With three talkers, 15 participants still identified it in at least five of six trials. Four talkers proved too challenging, with no participant achieving perfect identification. Spatial location was the dominant cue for identification, preferred by participants over voice characteristics. For content comprehension, seven participants could report more than half the sentence content with two talkers, and three maintained this with three talkers. Working memory (digit span scores) strongly correlated with comprehension performance for two and three talkers (p<0.01 to p<0.05), suggesting it is a key factor in determining how many concurrent sources a user can handle. Early-blind participants (congenital or onset before 18) were the only ones able to complete four-talker conditions, suggesting neuroplasticity enhances speech segregation abilities. Participants preferred and felt more confident with different voices, though voice variation did not significantly improve identification or intelligibility.

Relevance

This research proposes a fundamentally different approach to screen reader interaction: instead of presenting content sequentially, present multiple items simultaneously using spatial audio, allowing blind users to scan content more like sighted users skim visually. For accessibility practitioners, the practical implications are significant—two or three concurrent speech channels could accelerate tasks like scanning search results, news feeds, or email headings. The finding that working memory is a key limiting factor means that implementations should consider user characteristics and task demands when choosing two versus three simultaneous sources. The spatial audio approach (sound location as the primary identification cue) suggests that future screen reader interfaces could use spatialized audio to create an "audio landscape" of content. The neuroplasticity finding also highlights that early-blind users may have enhanced auditory processing capabilities that technology could better exploit.

Tags: visual impairment · blindness · screen readers · speech perception · cocktail party effect · spatial audio · information scanning · neuroplasticity