TouchScribe: Augmenting Non-Visual Hand-Object Interactions with Automated Live Visual Descriptions

Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Tiange Luo, Venkatesh Potluri, Anhong Guo · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791308

Summary

TouchScribe is a wearable, camera-based assistive system that delivers live, hierarchical visual descriptions of physical objects in response to a blind or low vision (BLV) user's hand-object interactions. The authors argue that existing AI assistants such as Seeing AI, Be My AI, and Aira rely on deliberate photo capture and turn-taking dialogue, which struggles to identify the specific object of interest, generates verbose descriptions, and disrupts the natural flow of tactile exploration. TouchScribe instead treats the hands themselves as information cursors. A neck-mounted iPhone with a wide field-of-view camera streams video to a local server, where a fine-tuned MediaPipe-based gesture recognizer classifies four hand states (hold, touch, point, out of view) and a finger-motion model detects swipe-up gestures. Six gestures are supported, drawn from common sighted behaviors (hold, touch, hold side-by-side) and from discreet gestures preferred by BLV users (hold-and-point to read color, hold-and-swipe-up to read text). A keyframe extraction layer with temporal smoothing identifies stable gesture states and object changes, then dispatches cropped images to multiple vision-language models in parallel: Hands23 for hand-object contact, Moondream for low-latency brief object labels, and GPT-4o for detailed descriptions, comparisons, color labels, text reading, and free-form visual question answering. Feedback is delivered hierarchically, beginning with hand-state confirmation, then a brief object label, then richer details only as the user lingers.

Key findings

An eight-participant lab study (3 male, 5 female, ages 18-72, six fully blind and two with low vision) saw participants complete 27 of 32 object-understanding tasks across cup exploration, distinguishing similar spice bottles, sorting four spray bottles by brand and scent, and locating snacks by nutritional information. Likert ratings were positive: coverage M=6.5 (SD=1.07), effectiveness M=6, accuracy M=5.5, intuitiveness M=5.63, usefulness M=5.5, and agency M=5.13. Technical evaluation found the gesture recognizer achieved an overall F1 of 0.77, with 'touch' best (F1=0.84) and 'point' weakest (F1=0.44, often confused with 'out of view'). VLM accuracy was strong for object labels (Moondream 91.59%), detailed descriptions (GPT-4o 93.27%), and comparisons (91.43%), but text reading on curved bottle surfaces dropped to 67.83%. End-to-end latency ranged from 0.09s for color labels and 0.56s for hand-state feedback up to 14s for comparative descriptions. Participants reported moderate cognitive load (NASA-TLX mental demand M=3.19) and a noticeable learning curve, with the hold-and-swipe-up gesture rated less intuitive than hold-and-point.

Relevance

For teams building or evaluating camera-based AI assistants for BLV users, this paper directly confronts a usability gap in tools like Seeing AI, Be My Eyes, and Aira: their reliance on deliberate photo capture and chat-style turn-taking breaks the rhythm of tactile exploration. By using hand gestures as natural intent cues, TouchScribe demonstrates a more responsive interaction model that practitioners should consider when designing wearable AT, smart-glasses experiences, or in-store shopping aids. The work also surfaces hard limitations worth flagging: hand occlusion, motion blur, social acceptability of neck-mounted cameras, the labor of learning a six-gesture vocabulary, and weak text recognition on curved surfaces. Future deployments should pair gesture recognition with complementary sensors (IMUs, contact microphones), offer customization for individual gesture preferences, and consider gesture-driven proxies for objects beyond physical reach.

Tags: blind and low vision · assistive technology · visual descriptions · hand-object interactions · gestures · large language models · vision-language models · egocentric vision · tactile exploration · wearable cameras