Comparing Effects of Navigational Interface Modalities on Speaker Prosodics

Julie Baca · 1998 · Proceedings of the Third International ACM Conference on Assistive Technologies (Assets '98) · doi:10.1145/274497.274499

Summary

This paper investigates whether speech-only (displayless) interfaces impose a measurable cognitive burden on users compared to multimodal interfaces that include visual or tactile components. The research uses an innovative methodology: rather than relying on subjective workload questionnaires, the author measures changes in speech prosodics — the nonverbal aspects of speech including pauses, pitch, and intonation — as an objective indicator of cognitive load. The experiment used the U.S. Army Corps of Engineers WES Auto Travel prototype, a navigational system that allowed users to query a spatial database about routes, traffic, and landmarks. Participants used the system in two conditions: a displayless (speech-only) interface and a multimodal interface augmented with an interactive audio-tactile map display. Over 90 subjects participated, including over 60 individuals with vision loss (split between congenital and adventitious) and over 30 sighted subjects, recruited across universities and rehabilitation agencies in Mississippi, Arkansas, and Louisiana over approximately three months. Speech was recorded during sessions and post-processed using ToBI (Tones and Break Indices) labeling for prosodic analysis, examining pauses, hesitation patterns, fundamental frequency (F0), boundary tones, and speech recognition errors.

Key findings

Statistical analysis revealed significant prosodic differences between the displayless and multimodal conditions across all three subject groups (congenital vision loss, adventitious vision loss, and sighted). In the displayless condition, all groups showed significantly more hesitation pauses (labeled "2p" in ToBI) at non-phrase-boundary locations, indicating increased cognitive processing demands. Sighted subjects and those with adventitious vision loss showed significantly lower minimum pitch (F0) values in the displayless condition. The speech recognition system made significantly more substitution errors during displayless sessions for sighted and adventitious vision loss groups, and more rejection errors for congenital vision loss subjects — suggesting that cognitively loaded speech is harder for recognition systems to process. Subjects with congenital vision loss showed some distinct patterns, including more pauses at phrase boundaries ("3p" in ToBI) rather than just mid-phrase hesitations, suggesting different cognitive processing strategies possibly related to their adaptation to non-visual information processing. Subjects with adventitious vision loss showed results most similar to sighted subjects, reinforcing that their visual memory may influence how they process spatial information.

Relevance

This research has significant implications for accessibility that extend well beyond its 1998 context. The finding that speech-only interfaces measurably increase cognitive load — detectable through involuntary changes in speech patterns — provides objective evidence for what many blind and visually impaired users experience when interacting with voice-only systems. This is directly relevant to the design of modern voice assistants, phone-based IVR systems, and any interface where spatial or structured information must be conveyed without visual support. The methodology itself is valuable: using prosodic analysis as an objective measure of cognitive burden could complement or replace subjective workload assessments in accessibility evaluations. The distinction between congenital and adventitious vision loss groups, showing different cognitive strategies, reminds practitioners that the blind user population is not homogeneous. For interface designers, the results argue strongly for multimodal approaches — adding tactile, spatial, or other non-speech channels alongside voice — rather than relying on speech alone for complex navigational or spatial tasks.

Tags: speech technology · cognitive load · non-visual interaction · navigation · prosody · displayless interface · multimodal interaction · blind and low vision · spatial data