SoundSpace: What and Where Through Sound

Amber Maimon, Iddo Yehoshua Wald, Rahaf Sobh, Carol Sliman, Yarah Nassar, Joel Lanir · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) · doi:10.1145/3772363.3798944

Summary

SoundSpace is a real-time sensory substitution system designed to give blind and visually impaired users simultaneous awareness of what objects are present in a scene and where they are located, without relying exclusively on verbal scene descriptions. The authors argue that existing accessibility tools fall into two camps: speech-based systems (screen readers, scene describers like SnapStick, wayfinding tools like NavCog) that interrupt ambient hearing and impose high cognitive load, and spatialized audio systems (Microsoft Soundscape, SWAN, Personal Guidance System) that convey orientation cues but not object identity. SoundSpace bridges the gap by combining brief spoken object names with continuous cross-modal audio mappings grounded in prior work on the TopoSpeech and TopoLanguageDepth algorithms. Horizontal position maps to stereo panning, vertical position to pitch, and depth to loudness with a low-pass filter (distant objects sound quieter and muffled). The implementation uses YOLO-World for open-vocabulary object detection and MiDaS for monocular depth estimation, running on commodity hardware (Apple M1, 200-400ms per frame) with a React/TypeScript frontend using Tone.js and a FastAPI/PyTorch backend. Users configure 'environment profiles' (kitchen, office, outdoor, navigation, or custom LLM-generated vocabularies) to restrict what gets announced. A dual-interval timing model separates scene scanning (1-2s) from audio readout sweeps (4-6s), with event-driven interrupts when objects shift more than 10% of frame width. This is a design-and-implementation paper; no user evaluation is reported.

Key findings

This is a system description rather than an empirical study, so the 'findings' are design contributions and implementation results rather than measured outcomes. Key contributions: (1) a sonification approach that integrates object naming with continuous spatial audio encoding of horizontal, vertical, and depth position, extending the prior 2D TopoSpeech/TopoLanguageDepth work into 3D real-time use; (2) a periodic-sweep-plus-event-interrupt timing model that the authors argue balances scene awareness against cognitive load and preserves ambient hearing; (3) context-aware filtering via open-vocabulary detection and user-editable environment profiles, with an optional LLM-powered vocabulary generator (Anthropic Claude Sonnet 4) that produces object lists from natural-language scene descriptions like 'busy coffee shop'; (4) a working real-time implementation on consumer hardware. The paper grounds its design choices in cross-modal correspondence research (the SMARC effect for pitch-height, ecological panning cues) and in neuroplasticity evidence for auditory-to-visual cortical recruitment during sensory substitution. The authors explicitly acknowledge that per-frame depth normalization forces users to re-establish reference points in each new scene, and flag this as a limitation that metric depth models could address.

Relevance

For accessibility practitioners building assistive tools for blind and low vision users, SoundSpace is a useful reference point for how to combine modern computer vision (open-vocabulary detection, monocular depth) with principled sensory substitution design rather than defaulting to verbose verbal scene descriptions. The dual-interval timing model and the environment-profile pattern for bounding what the system reports are both directly applicable to other AI-powered assistive tools struggling with output verbosity and hallucination. The major caveat is that the system has not been evaluated with blind or low vision participants - usability, learning curves, real-world cognitive load, and the trade-off between spatialized audio and environmental awareness all remain open. Practitioners should also note the desktop form factor; any deployment will require mobile hardware and careful headphone choice to avoid masking safety-critical environmental sounds.

Tags: blind and low vision · sensory substitution · spatial audio · sonification · assistive technology · computer vision · cognitive load · wearables