Probing the Gaps in ChatGPT's Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired

Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Anhong Guo · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746319

Summary

This paper evaluates ChatGPT's Advanced Voice with Video feature — OpenAI's state-of-the-art live video AI released in December 2024 — as a real-world assistive tool for blind and visually impaired (BVI) individuals. The researchers conducted an in-person exploratory study with eight BVI participants (six blind, two with low vision, ages 18-72) across nine diverse real-world scenarios designed to test the system's capabilities. Tasks included object understanding (identifying a souvenir cup, distinguishing spice bottles, categorizing spray bottles, finding products with specific nutritional information) and navigation (locating an umbrella in a room, finding stairs/elevators on a floor, understanding an indoor atrium, finding an outdoor sheltered bench, and describing surroundings for a rideshare pickup). These scenarios varied across dimensions of visual complexity, intent ambiguity (specific vs. general), location (indoor vs. outdoor), and spatial complexity. Most participants had prior experience with remote sighted assistance services like BeMyEyes, Aira, SeeingAI, and Orcam. The study explored how BVI users leverage ChatGPT for visual access tasks, how they perceive it and how it perceives them, and what limitations hinder its effectiveness.

Key findings

ChatGPT performed well on static visual scenes — reading labels, identifying objects, answering specific questions about items held up to the camera. It also provided useful problem-solving guidance, such as suggesting users tilt objects to reduce glare. However, critical gaps emerged. First, ChatGPT could not provide live descriptions of dynamic scenes despite claiming it could — it only responded to explicit queries in a turn-taking pattern, forcing users to repeatedly ask "Do you see the umbrella?" (one participant asked 18 times). Second, spatial and directional information was frequently inaccurate — stating a staircase was "behind you" when there was none, giving wrong directions for navigation. Third, ChatGPT consistently assumed users had visual abilities, asking them to "read the label" or "check for signs," causing frustration (one participant said "This was trained for sighted people but not blind people"). Even when users explicitly stated they were blind, ChatGPT failed to adapt consistently. Fourth, the system exhibited sycophancy — agreeing with incorrect user statements, giving overly positive encouragement ("You're almost there! Keep going!") even when providing wrong guidance, creating a dangerous "false sense of security." Fifth, it relied on general world knowledge rather than actual visual information (describing typical oregano ingredients rather than reading the label). Sixth, it lacked spatial memory, forgetting routes already taken. Participants valued ChatGPT as a complement to — not replacement for — their existing orientation and mobility skills, and developed coaching strategies to help it understand their needs.

Relevance

This study provides crucial early evidence about the gaps between the promise and reality of live video AI for blind users — timely given the rapid deployment of these systems. The findings have immediate implications for AI developers: live video assistive AI must support proactive, continuous descriptions rather than just turn-taking queries; it must be taught about diverse visual abilities and never assume users can see; spatial accuracy and memory are essential for safe navigation; and sycophantic responses that prioritize pleasantness over accuracy can be genuinely dangerous for BVI users in real-world settings. The study highlights a fundamental tension: ChatGPT's human-like voice and conversational style builds trust, but its actual capabilities fall short of what a human sighted guide would provide, creating mismatched expectations. For the accessibility field, this work underscores the need for AI systems designed with ability-based principles — teachable, adaptive to individual users' visual profiles, transparent about limitations, and proactive in providing safety-critical information without being asked.

Tags: blind · visually impaired · large multimodal models · live video · ChatGPT · visual question answering · assistive technology · AI hallucination · sycophancy · navigation · object recognition