Speed-Accuracy Tradeoffs for Detecting Sign Language Content in Video Sharing Sites

Frank M. Shipman, Satyakiran Duggina, Caio D.D. Monteiro, Ricardo Gutierrez-Osuna · 2017 · Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS) · doi:10.1145/3132525.3132559

Summary

This paper addresses the problem of automatically detecting sign language content in videos on sharing platforms like YouTube and Vimeo. For many deaf and hard-of-hearing people, sign language is their primary communication medium, and they rely on online video content to stay informed. However, finding sign language videos is difficult because search on video sharing sites depends on text metadata, which is inconsistently applied — studies have found only 43% precision when searching for sign language videos on specific topics using metadata. As a result, many in the sign language community simply do not search for content. The researchers at Texas A&M University had previously developed a technique using Polar Motion Profiles (PMPs) to detect sign language in video based on visual features alone, without relying on metadata. PMPs model the quantity of motion relative to detected faces using polar coordinates, capturing the characteristic hand and arm movements of signing. The approach uses face detection (an ensemble of five OpenCV Haar-feature detectors) and background subtraction (adaptive Gaussian mixture model) to identify foreground activity, then computes a 460-element feature vector reduced to six features via PCA and classified with an SVM. While effective, this approach was computationally expensive — face detection alone took over 896 seconds for one minute of video using the ensemble. This paper evaluates three optimization strategies to reduce computation: (1) replacing the five-detector ensemble with a single face detector, (2) sub-sampling frames for face detection instead of processing every frame, and (3) analyzing shorter video segments. Additionally, a keyframe-based approach is explored that eliminates the need for background modelling entirely.

Key findings

The recommended optimized configuration — using the alt2 frontal face detector with face detection on every 20th frame — achieved a 96% reduction in computation time (from 896 seconds to 31 seconds per minute of video) while losing only 1% in F1 score (77% vs. 78%) and maintaining identical recall (71%). Precision dropped marginally from 85% to 83%. The single alt2 detector performed comparably to the ensemble with sufficient training data, and sub-sampling face detection every 20th frame had minimal impact because signers' faces and bodies tend to be relatively stationary in video sharing site content. The keyframe-based approach, which analyzes only 10 frames from a 60-second video instead of all 1,800, achieved comparable recall (74% vs. 71%) but substantially lower precision (69% vs. 85%), yielding an F1 of 71%. While less accurate overall, its extreme computational efficiency makes it suitable as a first-stage filter in a staged classification pipeline, quickly eliminating clearly non-SL videos before more expensive analysis. Shorter video segments (below 60 seconds) degraded performance more significantly, and 60-second segments consistently outperformed 30-second ones for the keyframe approach. The current system cannot handle edited videos with changing backgrounds or videos mixing SL and non-SL segments.

Relevance

This work addresses a fundamental information access barrier for the deaf community: the inability to efficiently find sign language content among the billions of videos online. For accessibility practitioners, the key insight is that metadata-based approaches to content accessibility are inherently unreliable — automatic content analysis is needed to supplement user-generated tags and descriptions. The 96% reduction in computation time with minimal accuracy loss demonstrates that practical, scalable sign language detection is achievable with relatively straightforward optimizations. The staged classifier concept — a fast, high-recall filter followed by more expensive precise analysis — offers a model applicable to many large-scale accessibility classification problems. However, the overall F1 scores (71-78%) indicate that fully reliable automatic detection remains a challenge, particularly given the diverse quality, backgrounds, and editing styles found on video sharing sites. As video platforms grow, integrating such detection systems could dramatically improve content discovery for deaf users who currently rely on social sharing rather than search.

Tags: sign language · computer vision · video classification · information retrieval · deaf and hard of hearing · machine learning · video accessibility · content detection