Revisiting Blind Photography in the Context of Teachable Object Recognizers

Kyungjun Lee, Jonggi Hong, Simone Pimento, Ebrima Jarjue, Hernisa Kacorri · 2019 · Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS) · doi:10.1145/3308561.3353799

Summary

This paper introduces a real-time audio-haptic feedback system to help people with visual impairments frame objects in their smartphone camera when training teachable object recognizers. The challenge is that teachable recognizers — which let users train personalized models to identify their own objects — require the user to take high-quality training photos, but blind users cannot see whether the object of interest is properly framed. Current object recognition apps provide no feedback on photo quality. The system uses a deep learning pipeline: first, a hand segmentation model (FCN-8s trained on four egocentric hand datasets totaling 9,241 examples) identifies the user's hand in the camera frame, then an object localization model (fine-tuned from the hand segmentation model on 6,239 object center annotations) estimates the center location of the object based on its proximity to the detected hand. The key insight is that blind users naturally hold or touch objects they want to photograph, making hand position a reliable proxy for object location. The estimated object center is mapped to a 3x3 grid overlaid on the camera frame, producing combined audio-haptic feedback: stereophonic sinusoidal waves (left/middle/right panning at 200Hz) indicate horizontal position, a higher frequency (500Hz) indicates well-centered objects, and vibration reinforces center detection. Silence indicates the object is not detected. Nine legally blind participants (ages 29-69, all female) used the system to train recognizers for 15 commercial products in both a plain (vanilla) and cluttered (wild) environment.

Key findings

The feedback was highly effective at ensuring objects were included in photos: only 2% of photos in the vanilla environment and 8% in the cluttered environment missed the object entirely — even for participants who had never taken a photo before. Each participant took 375 training photos and 150 test photos across the two environments. Recognition performance was promising, with participants achieving around 50% accuracy on a 15-way classification task in the vanilla environment (compared to ~7% for random chance), though performance varied significantly with photography experience. Cluster analysis of interaction streams revealed three distinct patterns: C1 (no feedback before photo — objects were out of frame or undetected), C2 (center feedback before photo — participants waited for well-centered feedback), and C3 (some feedback — objects detected but not centered). Critically, C2 participants who waited for center feedback produced photos with the highest ratio of fully included objects. A negative correlation between age and model performance (r=-0.74, p<0.05) appeared driven by photography experience rather than age itself. Participants tended to trust the feedback even when aware it could be wrong — saying things like "I just had to trust the feedback I had at the last moment." Several participants who had never taken photos found the task feasible and expressed willingness to train their own recognizers. The cluttered environment degraded both feedback accuracy and recognition performance, highlighting the challenge of real-world deployment.

Relevance

This research addresses a critical bottleneck in making AI-powered object recognition truly accessible to blind users: the training data quality problem. While teachable recognizers offer independence from pre-trained models and remote sighted help, they are only as good as the photos used to train them. Without camera guidance, blind users produce photos that are blurred, poorly framed, or missing the target object entirely — and these poor-quality training images directly degrade recognition accuracy. The hand-proximity approach to object localization is particularly clever because it exploits a natural behavior (holding or touching the object) rather than requiring the user to learn a new interaction technique. For accessibility practitioners, this work highlights several important design considerations: feedback should leverage existing user behaviors; imperfect feedback is still valuable (participants benefited even though the localization model had 39% error rate); users trust automated feedback even when they know it's fallible, which creates both opportunities and responsibilities; and the gap between controlled and cluttered environments shows that lab results may overestimate real-world performance. The finding that participants with no prior camera experience could still produce usable training data demonstrates that the barriers to blind photography are primarily about feedback, not capability.

Tags: blind photography · teachable object recognizer · computer vision · deep learning · visual impairment · sonification · haptic feedback · camera guidance · object recognition · machine learning · assistive technology