Isolated Sign Language Recognition with Grassmann Covariance Matrices

Hanjie Wang, Xiujuan Chai, Xiaopeng Hong, Guoying Zhao, Xilin Chen · 2016 · ACM Transactions on Accessible Computing · doi:10.1145/2897735

Summary

This paper proposes a novel method for isolated sign language recognition using Grassmann Covariance Matrices (GCM) to fuse multimodal features captured by Microsoft Kinect. With 360 million people worldwide affected by hearing loss—21 million in China alone—automatic sign language recognition is critical for bridging communication gaps between deaf and hearing communities. The researchers address a fundamental challenge in sign language recognition: effectively combining information from multiple feature sources (hand shape and body skeleton) across the temporal dimension of a sign sequence. They use covariance matrices to naturally fuse features from both RGB images (hand shape via HOG descriptors) and depth data (skeleton joint positions), then project these onto the Grassmann manifold for more accurate distance measurement. This overcomes limitations of traditional Riemannian metrics that poorly preserve topological structure when measuring distances between covariance matrices.

Key findings

The GCM method achieved 96% accuracy on a 370-sign signer-dependent dataset, 92.4% on a challenging 1,000-sign vocabulary (outperforming HMM by 9.2 percentage points), and 70.9% on a signer-independent dataset with seven signers—significantly exceeding all baseline methods including HMM, DTW, and LED-SVM. Statistical analysis confirmed results were significant (p < 0.05). The method also proved computationally efficient at 174ms per sign during testing, nearly twice as fast as competing approaches. When tested on the ChaLearn multimodal gesture dataset, GCM achieved 94.02% recall and 93.07% precision, outperforming the competition's top-ranked methods. The researchers released three publicly available Chinese Sign Language datasets to facilitate further research—a significant contribution given the scarcity of SLR evaluation resources.

Relevance

This work advances practical sign language recognition toward real-world deployment. The high accuracy on large-vocabulary tasks (1,000 signs) and the fast processing time make the system viable for interactive applications. The signer-independent results, while lower (70.9%), represent the more realistic scenario where a system must work with users whose signing wasn't in the training data—a crucial requirement for general accessibility tools. The public release of three datasets addresses a major research bottleneck. For practitioners, the Kinect-based approach offers an affordable path to SLR systems. Limitations include focus on isolated signs rather than continuous signing, and the gap between signer-dependent and signer-independent accuracy that remains an open challenge across the field.

Tags: sign language recognition · Chinese sign language · computer vision · machine learning · deaf and hard of hearing · Kinect · multimodal features