Crowd-AI Camera Sensing in the Real World

Anhong Guo, Anuraag Jain, Shomiron Ghose, Gierad Laput, Chris Harrison, Jeffrey P. Bigham · 2018 · Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies · doi:10.1145/3264921

Summary

This paper presents Zensors++, a hybrid crowd-AI camera sensing system that allows users to point a networked camera at a scene, define a natural language question about it (such as "Is the coffee machine in use?" or "How many people are in the room?"), and receive continuous, real-time answers. The system combines crowdsourced human labeling with machine learning classifiers that progressively take over as they accumulate training data. The research addresses a core limitation of pure computer vision systems: they require extensive pre-training and do not generalize well across diverse, uncontrolled environments. By using crowd workers from Amazon Mechanical Turk to initially label camera images, Zensors++ bootstraps custom classifiers for each user-defined question on the fly. The system architecture includes several innovations for scale and cost: perceptual image hashing to avoid re-labeling redundant images (achieving 74.4% hash hits), dynamic HIT queue management to minimize crowd latency, majority voting with disagreement queues for quality control, and automatic face blurring for privacy. The researchers conducted two deployments — a 10-week discovery deployment with 13 users generating 661,090 answers, followed by a 4-week evaluation deployment with 17 participants who created 63 question sensors across homes, offices, labs, cafes, parking lots, and classrooms, generating 937,228 answers at a cost of roughly 0.6 cents per answer.

Key findings

Zensors++ achieved an average accuracy of 79.5% for yes/no question sensors (35% exceeded 90% accuracy, 56% exceeded 80%), with count questions averaging 0.2 units of error. The system delivered 1.6 million total answers across both deployments. Perceptual image hashing proved highly effective, matching 74.4% of incoming images to previously labeled ones at 99% accuracy, saving approximately $17,500 in crowd costs. Machine learning classifiers reached an average accuracy of 67% for yes/no questions after 4 weeks of training data, with 43% of classifiers exceeding crowd accuracy. The biggest source of sensing errors was not crowd workers or AI, but user-defined question quality — issues like poor image cropping, ambiguous language, missing context, and insufficient camera resolution. The median crowd latency was 120 seconds for a first answer and 355 seconds for a consensus answer of record. Average daily cost was $2.41 for yes/no sensors and $4.50 for count sensors, with 60% of sensors costing under $2/day. Participants in professional roles (facility managers, lab managers) were willing to pay hundreds to thousands annually for monitoring capabilities, while personal users valued the system at $1-10/month.

Relevance

While not exclusively an accessibility paper, this research has significant implications for assistive technology. The Zensors++ approach builds on the same crowd-AI paradigm used in accessibility tools like VizWiz (visual question answering for blind users) and VizLens (accessible screen reader for interfaces). The ability to turn any camera into a customizable sensor through natural language questions has direct applications for people with disabilities — monitoring home environments, detecting events, and answering visual questions that would otherwise require sighted assistance. The system demonstrates that hybrid crowd-AI approaches can scale to real-world deployment with acceptable cost and accuracy, providing a template for future assistive sensing systems. The privacy considerations explored in the paper (face blurring, region-of-interest cropping, consent signage) are also relevant to any camera-based assistive technology deployed in shared spaces. For practitioners, the key insight is that crowd-AI systems can bridge the gap between limited computer vision and the diverse, contextual questions real users want to ask about their environments.

Tags: crowdsourcing · computer vision · human computation · machine learning · smart environments · IoT · camera sensing