Real Time Object Scanning Using a Mobile Phone and Cloud-based Visual Search Engine

Yu Zhong, Pierre J. Garrigues, Jeffrey P. Bigham · 2013 · Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2013) · doi:10.1145/2513383.2513443

Summary

This paper presents Scan Search, an iPhone application that enables blind users to identify everyday objects in real time by continuously scanning with their phone camera rather than taking individual photos. The core challenge addressed is that blind people struggle with the standard photo-snapping interface used by most assistive object identification apps — they cannot see what they are photographing, leading to poorly framed images that recognition engines fail to process. Scan Search solves this by treating object identification as a continuous scanning task rather than a discrete photo-taking task. The system uses a key frame extraction algorithm based on Lucas-Kanade optical flow tracking to automatically select high-quality, information-rich frames from the live camera video stream. These frames are sent to IQ Engines, a cloud-based visual search engine containing millions of trained images of packaged goods, logos, and print media. The algorithm evaluates frame quality based on camera stableness and richness of visual features, segments the video stream into scenes, and extracts at most one key frame per scene to balance thoroughness with efficiency. The application provides real-time audio and visual feedback as matches are found, and stores results in an accessible history table for later review.

Key findings

In a user study with 8 blind participants identifying food products, Scan Search achieved a 91.67% success rate compared to 62.5% for the standard photo-snapping interface — a statistically significant improvement. The average identification time was 73.2 seconds with Scan Search versus 126.4 seconds with the control, a 42% reduction though not statistically significant due to high variance. Most failed trials (9 of 11 total) occurred with the photo-snapping interface. The key frame extraction algorithm was optimized through experiments with both private and public image datasets, finding that a 2% movement threshold and 10% initialization threshold provided the best balance between thoroughness and quality. With the public dataset, identification rates dropped from over 70% to under 45% compared to the private dataset, reflecting real-world challenges with large uncontrolled image collections. The algorithm required minimal bandwidth (under 50 KB/s) and ran efficiently on iPhones from the 3GS through iPhone 5, extracting roughly one frame every 2 seconds. All participants preferred Scan Search, citing that it was easier to use and reduced the frustration of repeatedly trying to frame objects correctly.

Relevance

Scan Search represents an important shift in how assistive technology approaches the blind photography problem — moving from discrete photo-taking to continuous scanning. This design philosophy anticipated the direction that modern AI-powered visual assistance tools like Be My Eyes and Seeing AI would eventually take, where continuous camera input rather than single snapshots became the standard interaction model. The key frame extraction algorithm demonstrates that intelligent pre-processing on the device can dramatically improve the quality of images sent for recognition, a principle that remains relevant for any mobile assistive application. For accessibility practitioners, this research highlights that the camera interface itself is a critical accessibility barrier — even when recognition engines work well, blind users fail if they cannot capture useful input. The finding that continuous scanning nearly doubled success rates compared to photo-snapping underscores the importance of designing input methods that accommodate the specific challenges blind users face when interacting with cameras.

Tags: visual accessibility · object recognition · blind users · mobile accessibility · computer vision · assistive technology