How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

Ricardo E. Gonzalez Penuela, Crescentia Jung, Sharon Lin, Ruiying Hu, Shiri Azenkot · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3793266

Summary

This CHI 2026 paper reports a two-week diary study with 20 Blind and Low Vision (BLV) participants (ages 19–75, 11 female/9 male, 13 blind/7 low vision) investigating how multimodal large language models (MLLMs) support real-world access to visual information. The authors built VisionPal, a custom iOS application mirroring the interaction flow of Seeing AI and Be My AI, powered by GPT-4o (model version gpt-4o-2024-08-06). Users open the app, capture a photo, receive an initial one-sentence description, then engage the MLLM in follow-up chat to get goal-relevant details. Participants were recruited through the LightHouse for the Blind and Visually Impaired in San Francisco; all were existing users of Seeing AI or Be My AI. They were asked to use VisionPal as often as they wanted across daily life and submit a six-question survey (satisfaction, trust, location, goal, issues, feedback) after each use. The study collected 551 diary entries, 375 follow-up conversations, and 626 follow-up questions, plus post-study semi-structured interviews. Analysis combined three stages: coding diary context (user goals, locations), categorising user questions using Chen et al.'s VQA taxonomy (identification, verification, instruction, localization, advice, plus visual-fact extraction types), and assessing MLLM response correctness (correct, partially correct, partially incorrect, incorrect, abstained, no response, ignored). Initial descriptions were scored for hallucinations on a 3-point scale. Interviews were analysed via inductive open coding and affinity diagramming. The paper frames its core contribution as introducing the 'visual assistant' skill — a set of nine behaviours beyond accurate captioning that visual interpretation systems must exhibit to serve BLV users well — alongside three intervention points (model training, application prompting, user-facing controls) for surfacing those behaviours.

Key findings

Participants rated VisionPal 'somewhat satisfying' (mean 4.13/5, SD 1.07) and 'somewhat trustworthy' (mean 3.76/5, SD 0.96), with initial photo descriptions highly accurate: 91.8% (505/550) had zero hallucinations and only 0.7% had two or more. A notable accuracy gap emerged during follow-up conversations, however: of 549 answerable user questions, only 56.6% were answered correctly, 6.4% partially correct, 7.8% partially incorrect, 14.4% incorrect, 10.8% abstained, 0.7% ignored, and 3.2% failed due to API errors. Text and graphics extraction — the single most common user goal (13.79%) after identification (21.47%) — had the worst outcomes: 34.6% of responses contained hallucinations (e.g., wrong cooking temperatures 145°F vs actual, wrong medication dosages 100mg vs 200mg, fabricated addresses). Localization questions also performed poorly (31.8% incorrect or partially incorrect). Abstentions were inconsistent: the app refused to read some sensitive content (envelope addresses for P4) while fully transcribing similar content for others (P11's salary letter); gender and eye-colour questions were refused for P10 but answered for P11 and P8. Participants used VisionPal primarily in living spaces (66%), followed by transit (9.79%), work (9.39%), and leisure (7.99%). The 20 participants ranged widely in engagement (4 to 46 conversations; 12 to 65 questions each). Qualitative findings surfaced the 'visual assistant' skill — nine behaviours: neutral factual communication, adaptive communication protocols, goal-oriented collaboration, content quality guidance, comprehensive information provision, contextual self-awareness, privacy protection, transparent uncertainty handling, and graceful hand-off — arguing MLLMs currently over-optimise for plausible-sounding responses because RLHF rewards helpful-seeming answers over calibrated uncertainty.

Relevance

For practitioners building or evaluating AI visual interpretation tools (Seeing AI, Be My AI, Envision, Aira, and successors), this is a foundational empirical benchmark of how MLLMs actually perform in BLV users' everyday lives — not on lab benchmarks. The 56.6% conversational accuracy figure and the 34.6% hallucination rate on text extraction are particularly consequential: users cannot safely rely on these systems for medication dosages, cooking temperatures, or financial documents without sighted verification, yet the systems' confident tone discourages that verification. The 'visual assistant' skill framework gives teams a concrete checklist for product review beyond captioning accuracy — especially the call for transparent uncertainty, consistent abstention, proactive capture guidance, and user-configurable privacy filters. Limitations the authors acknowledge matter for generalisation: participants were all existing visual-interpretation-app users (missing novice perspectives), primarily US-based, recruited through one nonprofit, and studied during a US holiday season. The paper is essential reading for accessibility engineers, AI safety researchers working on multimodal systems, and standards bodies considering reliability thresholds for disability-facing AI.

Tags: AI · accessibility · multimodal large language models · MLLM · visual question answering · blindness and low vision · diary study · visual interpretation · hallucination · conversational AI · trust · Seeing AI · Be My AI · assistive technology