ImageExplorer Deployment: Understanding Text-Based and Touch-Based Image Exploration in the Wild

Ruolin Xu, Yuxuan Cai, Shuying Hou, Yu-Jung Chang, Anhong Guo · 2024 · Proceedings of the 21st International Web for All Conference (W4A) · doi:10.1145/3677846.3677861

Summary

This paper presents the real-world deployment and evaluation of ImageExplorer, an iOS application that enables blind and low-vision users to explore images through two complementary modalities: text-based sequential exploration and touch-based spatial exploration. The app was deployed on the Apple App Store for 12 months, attracting 371 users who uploaded 651 images and conducted 694 explorations. ImageExplorer uses a pipeline of computer vision models including Mask R-CNN for instance segmentation, Google Cloud Vision and AWS Rekognition for object labeling, and DenseCap for generating natural language descriptions of image regions. The text-based mode presents detected objects as a sequential list navigable with VoiceOver gestures, while the touch-based mode allows users to directly touch the screen to discover objects spatially across multiple layers of detail. The research addresses a critical gap in image accessibility: while alt text provides a single static description, many images contain rich spatial information that blind users cannot access. Previous evaluations of image exploration tools were limited to controlled lab studies with small participant pools. This deployment study provides the first large-scale understanding of how blind users interact with image exploration tools in everyday contexts, revealing usage patterns, preferences, and challenges that controlled studies cannot capture.

Key findings

Person images were the most commonly uploaded category at 27.7%, followed by settings (18.6%) and documents (14.9%). Users showed clear modality preferences based on image type: touch exploration was preferred for person images, settings, and documents where spatial layout matters, while text exploration was preferred for objects and animals where identifying what is present matters more than where. Touch exploration patterns revealed that users consistently start from the center of the screen, then fan outward, slowing down significantly when they encounter objects of interest. The mean first-layer object discovery rate was 39.1%, with a general discovery rate of 30.3% across all layers, indicating that while users find some objects, many remain undiscovered. Caption accuracy emerged as a critical factor for user retention: users who stopped using the app experienced a 37.8% inaccuracy rate in captions compared to 28.4% for users who remained active. The study also found that 57.9% of explorations used the text-based modality, suggesting that while touch exploration offers richer spatial understanding, the simplicity and familiarity of sequential text navigation remains appealing for many use cases.

Relevance

This research has direct implications for developers building image accessibility features. The finding that different image types benefit from different exploration modalities suggests that a one-size-fits-all approach to image descriptions is insufficient. Developers should consider offering multiple ways to access image content, particularly for complex images where spatial relationships matter. The strong correlation between caption accuracy and user retention underscores that AI-generated descriptions must meet a quality threshold to be useful — inaccurate descriptions are worse than no descriptions because they erode trust. The touch exploration patterns identified can inform the design of tactile interfaces and spatial audio descriptions. For organizations implementing WCAG compliance, this study demonstrates that even well-crafted alt text may not fully convey the information in spatially complex images, pointing toward richer, interactive alternatives as the next frontier in image accessibility.

Tags: image accessibility · screen readers · touch exploration · blind users · image description · mobile accessibility · computer vision · object detection

Standards referenced: WCAG 2.2