Say It My Way: Exploring Control in Conversational Visual Question Answering with Blind Users

Farnaz Zamiri Zeraati, Yang Cao, Yuehan Qiao, Hal Daumé III, Hernisa Kacorri · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791834

Summary

This CHI 2026 paper investigates how blind users can exert control over responses generated by conversational visual question answering (VQA) systems built on vision-language models. While prompting and steering techniques are well established in general-purpose generative AI, assistive VQA tools for blind users still follow rigid interaction patterns with limited opportunities for customization. The authors argue that user control becomes especially consequential for blind users who may rely on these systems for access in everyday and high-stakes situations. The study recruits 11 blind participants (four totally blind, seven legally blind) and runs a three-phase multi-day protocol: an in-lab session where participants are introduced to five customization techniques (binary feedback, zero-shot style prompting, zero-shot intention prompting, chain-of-thought prompting, and image-as-text prompting) through engineered scenarios; a 10-day diary study using Be My AI in daily life; and a remote post-study interview. The techniques are drawn from Schulhoff et al.'s taxonomy of prompting strategies and adapted to short, mobile-based, blind-user interactions. The authors analyse 418 interactions across lab and diary sessions, triangulated with reflections and interview transcripts, and release the resulting Ask&Prompt dataset of images, dialogues, context metadata, and task-success annotations. The paper contributes empirical findings on real-world blind user adaptation to VQA systems, and design implications for future live VQA tools.

Key findings

Interactions were often lengthy and imbalanced: participants averaged 3 turns (up to 21), with user input typically about one-tenth the length of system responses (median 242 words in lab, 160 in diary). k-means clustering surfaced three interaction patterns: short successful rounds (median 2, n=286), short-medium unsuccessful rounds where participants gave up (n=53), and long effortful dialogues with mixed outcomes (median 10.5 rounds, n=24) where customization was heaviest. Customization was used more in familiar environments (55%) than unfamiliar (34%), with familiar items (54% vs 47%), around familiar people (60%) and in leisure (54%) versus hurried (46%) contexts. The most common strategy remained simply asking questions without any prompting; 77% of lab and 75% of diary interactions included at least one direct question. Failures clustered around inaccessible or missing image content, model limitations (the system avoided distance and time estimates), and model errors (outdated expiration dates, wrong prices). Participants developed new techniques not introduced in the lab, including decomposition, self-criticism, and action-oriented prompting ("I'm gonna take a picture of me touching a bottle..."). Trust was task-dependent: high for clear description tasks, low for localization under time pressure, and very low for high-stakes medication reading where participants preferred sighted verification or cross-checking across multiple AI tools.

Relevance

For accessibility practitioners designing or evaluating AI visual-assistance tools, this paper is a grounded look at how real blind users adapt to current VQA systems and where those systems fall short. The documented needs - on-demand verbosity control (a "hurry mode"), user-centred spatial references (relative to the user's body or a landmark rather than the image frame), proactive camera-framing guidance, persistent memory of preferences across sessions, and verification support for high-stakes tasks like medication identification - translate directly into concrete design requirements for products like Be My AI, Seeing AI, Aira, and emerging live VQA systems. The finding that customization rises in familiar contexts and drops under time pressure should shape default behaviours. Limitations include a small U.S.-based sample of 11 participants, the fixed platform (Be My AI) which could not retain preferences across sessions, and the absence of low-vision participants using residual sight alongside the system.

Tags: blind users · generative AI · visual question answering · VQA · personalization · customization · prompt engineering · Be My AI · large language models · assistive technology