ReCog: Supporting Blind People in Recognizing Personal Objects
Dragan Ahmetovic, Daisuke Sato, Uran Oh, Tatsuya Ishihara, Kris Kitani, Chieko Asakawa · 2020 · Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems · doi:10.1145/3313831.3376143
Summary
ReCog is a smartphone application designed to help blind users recognize their own personal objects — items like specific clothing, handmade goods, medicines, or family photos that cannot be identified by general-purpose recognizers such as Seeing AI or TapTapSee. The authors argue that while pre-trained deep networks can identify common categories (a bottle, a cup), they fail on personal or intra-class distinctions (which cup? which flavor of k-cup?), and that crowdsourcing alternatives like BeMyEyes introduce latency, privacy concerns, and worker-dependency. ReCog addresses this by letting users train a personalized deep convolutional neural network (built on Inception-v3 via transfer learning) using photos of their own objects, uploaded from an iPhone app to a backend recognition server. A central contribution is the camera-aiming guidance module: using ARKit's SLAM-based positional tracking, the system computes two novel object-framing metrics — a proximity score and a center-offset score — then steers the user toward correctly framed shots through combined sonification (pulsating tones that resolve when the target is centered) and verbal cues ('left', 'right', 'closer', 'farther'). The paper reports a two-session study with 10 blind participants, combining a controlled in-lab training/recognition protocol with a three-day in-the-wild deployment that evaluated behavior, preferences, and recognition accuracy after prolonged autonomous use.
Key findings
Photos captured with camera-aiming guidance were significantly better framed than those without: 64.2% were better centered (vs 18.8% without, p<.001) and received higher scaling scores (M=65.3% vs 22.9%, p<.01). Recognition accuracy for the full-trained model was significantly higher with guidance (M=0.94) than without (M=0.83, p<.05), though the benefit disappeared for the faster 'quick' training mode. Capturing multiple testing photos (up to five) did not improve recognition accuracy, since consecutively captured photos tended to be highly similar — the authors recommend dropping this feature. Over three days of autonomous use, participants shifted their preferences: while guidance was still unanimously seen as essential for training, the share who preferred guidance-free recognition for expert users rose from 30% to 75% (p<.05), and self-reported confidence using the app without guidance rose significantly (3.9 to 5.14, p<.01). Participants also reported learning transferable photography skills ('I sort of knew how to center it'), suggesting the guidance serves as a teaching tool as well as a real-time aid. System Usability Scale scores were 81.9 (session 1) and 72.2 (session 2) — both 'good'.
Relevance
ReCog illustrates a growing shift in blind-user AI tools from 'recognize-anything' generic models toward personalized recognizers that acknowledge users' real domestic environments — where objects are often handmade, culturally specific, or visually near-identical to neighbors (two k-cup flavors, two remote controls). For accessibility practitioners, the more durable contribution may be the camera-aiming guidance pattern itself: combining sonification with verbal cues, anchored by AR-based positional tracking, produced measurably better photos and taught users new skills. Limitations include reliance on sighted assistance for labelling unfamiliar objects, the cost of training a model per user, and the lack of evaluation with low-vision (rather than totally blind) users at scale. The study also predates the current generation of multimodal AI assistants, which changes the competitive baseline; nevertheless, the core insight that well-framed input matters more than model sophistication remains relevant to any vision-based assistive app.
Tags: visual impairment · blindness · object recognition · computer vision · deep learning · sonification · camera-aiming guidance · mobile accessibility · assistive technology · personal object recognizer · transfer learning · augmented reality · photography