Toward Independent Online Shopping of the Visually Impaired Through Voice-based Computer-Using Agent

Subin Shin, Jeesun Oh, Suhyun Kim, Seoyeon Eom, Sangwon Lee · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791681

Summary

This CHI 2026 paper investigates how visually impaired users might shop online independently by interacting with a voice-based Computer-Using Agent (CUA) — an AI agent built on a Large Multimodal Model (LMM) that can perceive a screen, reason about its contents, and manipulate a graphical interface on the user's behalf. The authors frame online shopping as an especially difficult accessibility context because product information is conveyed through images, patterns, and unstructured layouts, while alternative text is often missing or inadequate. Screen readers linearise this content but cannot interpret aesthetic or trend-based attributes, leaving blind shoppers dependent on sighted assistance or guesswork. To explore how voice-based CUA could change this, the researchers ran a Semi-Automatic Wizard-of-Oz study with 12 Korean adults who were either totally blind or had severe low vision (all with acquired visual impairment). Participants completed a real shopping task on Naver Shopping using OpenAI's Operator CUA, with a human wizard transcribing spoken utterances to bypass speech-to-text errors. Each session included a 20-minute briefing, a 50-minute shopping task, and a 60-minute debriefing interview. The authors collected 226 communication logs and 13 hours of interviews, then applied inductive thematic analysis to produce a codebook of 13 codes across seven themes. The contribution is qualitative: rather than benchmarking CUA performance, it maps the strategies, expectations, and frustrations of blind users engaging with an agent that replaces — rather than narrates — the visual interface.

Key findings

Participants used the CUA to compensate for the absence of visual and social cues in four consistent ways: looking through others' eyes (asking about trends and socially appropriate colour combinations), relying on social proof (all 12 participants requested popularity-ranked recommendations and review summaries), guessing the unseen (seeking visual translation, aesthetic impressions, and nuanced colour descriptions like 'melange gray'), and striving for the best deal (comparing prices across sellers). To manage the cognitive load of ephemeral voice output, users preferred tightly constrained interactions: the practical limit for simultaneously-presented options was three products (lower than the five suggested by prior VUI research), and users wanted key information — name, price, rating — delivered first with details on request. Participants struggled to formulate open-ended queries and valued proactive choice previews from the CUA (e.g., 'What jean styles can I choose from?'). Because voice interactions leave no visual trace, users repeatedly requested double-checks before adding items to the cart; Fleiss's Kappa for coder agreement across three coders on 12 participants' datasets was 0.865 (almost perfect agreement). Two risks emerged: users could not verify LMM outputs and feared hallucinations or misinterpreted product attributes, and delegating payment-stage decisions to the agent raised trust and privacy concerns.

Relevance

For accessibility practitioners, this study reframes AI agents from 'supplementary assistive add-ons' to primary interfaces that could guarantee equal user experience rather than minimal WCAG compliance. The seven design implications — provide trend context, convey aesthetic feel, give fine-grained colour descriptions with combination guidance, cap delivered information at three items, summarise reviews, guide proactively, and double-check selections as statements not questions — are directly actionable for teams building voice commerce or agent-mediated shopping. The paper also sharpens the conversation about hallucination and human-in-the-loop verification when users cannot see what the agent did. Limitations are worth noting: all participants had acquired (not congenital) visual impairment and prior assistive-tech familiarity, the study was conducted in Korean on one shopping platform, and actual payment was excluded, so findings about trust in high-stakes transactional steps remain speculative.

Tags: visual impairment · blindness · low vision · voice interface · conversational user interfaces · AI agents · large multimodal models · online shopping · e-commerce accessibility · human-AI interaction · wizard-of-oz · qualitative research

Standards referenced: WCAG