Tradeoffs in the Efficient Detection of Sign Language Content in Video Sharing Sites

Caio D. D. Monteiro, Frank M. Shipman, Satyakiran Duggina, Ricardo Gutierrez-Osuna · 2019 · ACM Transactions on Accessible Computing (TACCESS) · doi:10.1145/3325863

Summary

This paper addresses the problem of finding sign language (SL) content on video sharing sites like YouTube, where such videos serve as de-facto digital libraries of Deaf community knowledge, experiences, and culture. Metadata-based search is inadequate — prior work showed only 43% precision when searching for sign language videos by text queries, because metadata is inconsistent, incomplete, or misleading. The authors develop and optimize a visual content-based approach that detects sign language in videos by analysing motion patterns around detected faces using Polar Motion Profiles (PMPs), a translation and scale-invariant feature that captures the spatial distribution of hand and arm activity relative to the signer's face. The core challenge is computational scalability: the original high-accuracy approach required 896 seconds of processing per minute of video using an ensemble of five face detectors, making it impractical for YouTube-scale deployment. The paper systematically evaluates three optimizations: replacing the ensemble face detector with a single detector (reducing face detection time by 5-10x), sub-sampling frames rather than processing every frame (stable performance up to every 20th frame), and shortening the video segment analysed. A recommended configuration combining these optimizations reduced computation time by 96% (from 896s to 31s per minute of video) while losing only 1% in F1 score (from 0.78 to 0.77). Beyond continuous video processing, the authors also develop a keyframe-based approach that analyses only a small set of sampled frames, achieving comparable recall but lower precision.

Key findings

The recommended optimized approach achieved 83% precision, 71% recall, and 0.77 F1 score — nearly matching the original ensemble's 85% precision, 71% recall, and 0.78 F1 — while reducing computation time by 96%. A three-stage cascading classifier was then designed to further improve efficiency: stage 1 uses the fast keyframe approach combined with metadata to quickly classify easy cases, stage 2 applies frame-sampling with metadata for intermediate cases, and stage 3 applies full video analysis for the most ambiguous videos. The multimodal classifier combining video features and metadata outperformed either modality alone, achieving 0.80 F1. The cascading approach at a 20% decision threshold reduced average computation time by approximately half compared to the single-stage recommended approach, with only a 2% drop in F1 score. However, the recall drop of approximately 7% is concerning because missed sign language videos represent lost content for the Deaf community. The symmetry of motion proved to be the most discriminative feature for distinguishing signing from other gesturing, as sign language produces characteristically symmetric hand and arm movements around the face that other forms of human gesture typically do not.

Relevance

This research addresses a significant information access barrier for the Deaf community: the inability to efficiently find sign language content on platforms that have become primary repositories of community knowledge and cultural expression. For Deaf users who rely on sign language as their primary language, text-based search is doubly inadequate — it fails to find relevant SL videos and it requires proficiency in a written language that may not be their first language. The scalable detection approach could enable video platforms to automatically tag sign language content, making it discoverable and enabling features like SL-specific browsing categories or recommendation algorithms. For platform developers, the cascading classifier architecture provides a practical model for deploying computationally expensive accessibility features at scale by applying progressively more detailed analysis only where needed. The work also lays groundwork for subsequent sign language identification — determining which sign language is used — which would further improve content discovery for specific signing communities worldwide.

Tags: sign language · video analysis · content detection · deaf and hard of hearing · computer vision · information retrieval · digital library · ASL · cascading classifier