Spatial and Temporal Pyramids for Grammatical Expression Recognition of American Sign Language

Nicholas Michael, Dimitris Metaxas, Carol Neidle · 2009 · Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '09) · doi:10.1145/1639642.1639657

Summary

This paper presents a novel computer vision framework for recognizing grammatical facial expressions and head gestures in American Sign Language (ASL) video. While most sign language recognition research has focused on manual components (hand shapes and movements), this work addresses the critical but often overlooked non-manual component — facial expressions and head gestures that carry essential grammatical information in ASL. For example, the sign sequence JOHN BUY HOUSE could mean "John bought the house," "John did not buy the house," "Did John buy the house?" or other variations depending entirely on the accompanying facial expressions and head movements. The framework uses an Active Shape Model (ASM) face tracker to localize 79 facial landmarks and predict 3D head pose in real time from a single uncalibrated camera. It then extracts SIFT (Scale Invariant Feature Transform) descriptors from the eye and eyebrow regions to characterize facial appearance, and constructs spatial pyramids to model these features at multiple resolutions. For temporal information, the system tracks head pose changes (particularly yaw angle derivatives) and builds temporal pyramids to capture patterns of head shaking characteristic of negative expressions. The system was tested on 42 videos from the Boston University American Sign Language Linguistic Research Project (ASLLRP) dataset, which contains spontaneous and elicited utterances from multiple native ASL signers recorded with synchronized stereoscopic cameras.

Key findings

The system achieved excellent recognition accuracy of 95% or higher for two grammatically distinct classes of expressions. For wh-questions (who, what, when, where, why, how), a stacked SVM classifier combining spatial pyramid features (facial appearance from SIFT descriptors) with temporal pyramid features (head pose changes) achieved 95.5% accuracy with 91.7% precision and 100% recall. Using spatial pyramids alone yielded 90.9% accuracy, while pose information alone achieved only 63.6%, demonstrating that facial appearance features are more discriminative than head pose for wh-questions. For negative expressions, the temporal pyramid approach using head pose derivatives achieved 95% accuracy with 90.9% precision and 100% recall, with only one false positive in the confusion matrix. A key technical contribution is the signer-independent nature of the system — training and test sets contained data from different signers, meaning the system can generalize across individuals. The authors also discovered a correlation between head pose and the appearance of eyes and eyebrows, which the stacked classifier exploited by correctly reclassifying the one false negative that the SIFT-only classifier missed.

Relevance

This work addresses a fundamental challenge in making sign language computationally accessible: recognizing the non-manual grammatical markers that determine the meaning of signed utterances. Without recognizing whether a signer is asking a question or making a negative statement, any sign language recognition system would produce ambiguous or incorrect translations. The research has implications for multiple accessibility applications: automatic sign language-to-text translation for deaf communication, sign language video archiving and retrieval, and tools for sign language learners. The signer-independent approach is particularly important for practical deployment, as a system that only works for specific signers would have limited real-world utility. The paper also highlights the linguistic complexity of sign languages — they are not simply manual gestures but full languages with grammatical structures expressed through multiple simultaneous channels including face, head, and body. This understanding is essential for anyone developing sign language technology or accessible communication tools.

Tags: sign language recognition · American Sign Language · non-manual markers · computer vision · facial expression recognition · machine learning · deaf accessibility