SceneScout: Towards AI-Driven Access to Street Level Imagery for Blind Users

Gaurav Jain, Leah Findlater, Cole Gleason · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790449

Summary

Jain, Findlater and Gleason present SceneScout, a prototype web interface that uses a multimodal large language model (GPT-4o) to make street level imagery — the panoramic pedestrian-height photography behind Apple Maps Look Around and Google Street View — directly usable by blind and low-vision (BLV) screen-reader users. The authors frame the problem as a gap in pre-travel planning: existing BLV navigation tools (BlindSquare, Soundscape, GoodMaps, Oko, tactile maps) emphasise turn-by-turn directions and landmarks but rarely convey the environmental details — sidewalk continuity, curb cuts, tactile paving, crosswalk layout, signage, bus shelters — that sighted travellers glance at on Street View. SceneScout addresses this through two interaction modes. Route Preview fuses successive panoramas along a planned walking route into a sequential short/medium/long narrative with a detailed destination description to support last-few-meters wayfinding. Virtual Exploration lets users freely move through a neighbourhood in natural language, specifying an intent ('quiet residential area with parks') and keywords, and choosing among intent-ranked directional suggestions at each intersection. The computational pipeline slices each 360° panorama into orientation-labelled views (north/south/east/west plus a forward-facing 180° pedestrian crop), integrates map metadata (geocodes, heading vectors, POIs) from Apple Maps APIs, and prompts GPT-4o with few-shot examples, chain-of-thought reasoning, prompt chaining, and a three-description movement history to maintain spatial coherence. The web interface follows W3C accessibility guidelines and has been tested with VoiceOver.

Key findings

A mixed-methods study (N = 10 screen-reader users, 9 totally blind, 1 with low vision; all employees of a large US tech company) evaluated both modes across familiar and unfamiliar US downtown areas. Participants rated relevance (3.9 Route Preview, 4.4 Virtual Exploration) and usefulness (4.1 / 4.2) positively and said descriptions 'shortened the learning curve significantly' for unfamiliar areas and surfaced content sighted friends often omit. Virtual Exploration was valued for building confidence to visit future neighbourhoods, relocation planning, and assessing hometowns or friends' new homes. Four concerns dominated: (1) Route Preview lacked spatial precision — participants wanted curb-cut angles ('90° vs 45°') explicitly described; (2) vague adjectives ('landscaped area,' 'quiet') felt unverifiable; (3) descriptions sometimes made unwarranted assumptions about users' abilities or explicitly named disabilities (e.g. 'meant for blind people'), which participants found paternalistic and reductive; (4) trust was fragile when anything sounded off, with participants advocating physical verification ('I don't trust nothing until I touch that pedestrian signal'). An output-quality analysis of 550+ sentences across 40 logs found 72% fully correct and 8% incorrect, with errors evenly split among hallucinations, factual errors, spatial inaccuracies, and plausible-but-unverifiable additions; 95% of described elements were likely to stay temporally consistent, making older imagery still useful for stable features like buildings and streets.

Relevance

This is a concrete demonstration of how MLLMs can unlock a previously inaccessible data source for BLV users, with transferable implications for any accessibility tool built on generative AI. Practitioners should note several design lessons: spatial precision matters more than verbosity for navigation-critical details (curb cuts, crossing geometry, tactile paving placement); uncertainty should be surfaced, not smoothed over, to align user trust with actual model reliability; AI-generated descriptions must avoid prescriptive or disability-naming language that imposes a single 'correct' way to navigate; and map metadata (high-precision but sparse) and street-level imagery (rich but variable) should be clearly distinguished in output so users can weight each appropriately. Limitations worth flagging: participants were tech-company employees with high AI familiarity, only one had low vision, no one physically navigated using the descriptions, backtracking was not supported in Virtual Exploration, and computational cost currently blocks scaled deployment. Useful reading for anyone designing AI-powered descriptions, O&M tools, or digital maps.

Tags: accessibility · navigation · screen readers · AI · multimodal AI · blind and low vision · maps · wayfinding · orientation and mobility · image description · LLM

Standards referenced: W3C Web Accessibility Initiative (WAI)