Check Now, Can You See It?: Exploring Voice and Video-Capable Language Models for Identifying and Spatially Locating Items of Interest for Blind and Low-Vision Travelers
Aziz N Zeidieh, JooYoung Seo · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3749833
Summary
This experience report documents the lived experiences of two blind travelers — Aziz (28, blind in left eye, 20/2200 in right) and JooYoung (35, blind in right eye, limited vision in left) — as they adapted commercially available voice and video-capable language models (VVLMs) for real-world spatial orientation and navigation (SON) tasks. Both authors are experienced blind travelers with over 30 combined years of independent travel experience, formal orientation and mobility training, and prior experience with guide dogs. Over six months of weekly Zoom meetings and a culminating three-hour in-person session on the University of Illinois Urbana-Champaign campus, they tested four VVLMs: OpenAI's ChatGPT 4o (Advanced Voice Mode), Google's Gemini 2.5 Pro (Live with Video), XAI's Grok (Voice Mode with Video), and Meta's Live AI on Ray-Ban smart glasses. They performed three representative SON tasks: identifying a point of interest (the School of Information Sciences building), a landmark of interest (a recycling bin), and a vehicle of interest (a specific bus route). The paper contextualizes this work within a historical evolution from crowdsourced visual question answering (VizWiz era) to AI-driven static image analysis (TapTapSee, Seeing AI, Be My Eyes era) to the current paradigm of dynamic, conversational video-stream interaction with VVLMs.
Key findings
The authors developed two novel prompting techniques for adapting VVLMs to SON tasks. "Just-in-time prompting" involves pre-prompting the VVLM with a role, context about the item being sought, and a trigger phrase (e.g., "check now") that the traveler speaks when they want the AI to analyze their current view. This technique produced reliable results when users stopped, pointed the camera toward the item of interest, and tilted it slightly upward. However, continuous walking-and-checking was unreliable, and the models exhibited temporal confusion — describing scenes from moments before rather than the current view, suggesting they sample individual frames rather than truly analyzing a continuous video stream. "Guide-by-pointing" involves extending a hand into the camera's view and asking the VVLM to identify what the hand is pointing at and provide spatial directions (left, right, etc.). While sometimes effective, this technique exposed critical problems with egocentric spatial reasoning — the models would give directions in the wrong direction, hallucinate with confidence (insisting an elevator was behind trash bins when it was not), and lack mechanisms to help users correct their camera aim. Across both techniques, xAI's Grok 3 and Google's Gemini 2.5 Pro provided the most accurate responses, while Meta's Ray-Ban glasses offered the best hands-free form factor. Key limitations included: models analyzing sampled frames rather than true real-time video, difficulty inputting detailed system prompts via voice, hallucinated responses delivered with high confidence, and the absence of interactive feedback to guide camera positioning.
Relevance
This paper provides essential ground-truth evidence about the current state of multimodal AI for blind navigation — a topic surrounded by considerable hype but limited first-person evaluation by actual BLV travelers. The finding that VVLMs sample individual frames rather than truly processing continuous video is a critical insight for developers building navigation aids, as real-time spatial accuracy is paramount for safety. The authors' proposed future direction — a centralized application querying multiple VVLMs simultaneously for verification and consensus — represents a practical architecture for building trustworthy AI navigation assistance. For accessibility practitioners, the key takeaways include: VVLMs are not yet reliable enough for independent SON tasks, hallucinations with high confidence are a safety concern, egocentric spatial reasoning remains a fundamental weakness, and hands-free form factors (smart glasses) offer clear advantages over phone-based interaction for travelers. The paper also raises important questions about hardware variability (different phone cameras producing different results) and the impact of user height on camera field of view. As a lived-experience report from expert blind travelers, it carries authority that lab-based evaluations cannot match.
Tags: artificial intelligence · navigation · blindness and visual impairment · multimodal AI · large language models · orientation and mobility · wearable technology · lived experience