Modeling the Speed and Timing of American Sign Language to Generate Realistic Animations

Sedeeq Al-khazraji, Larwan Berke, Sushant Kafle, Peter Yeung, Matt Huenerfauth · 2018 · Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '18) · doi:10.1145/3234695.3236356

Summary

This paper addresses the challenge of generating realistic computer animations of American Sign Language (ASL) by automatically modeling three critical timing parameters: where prosodic pauses should be inserted, how long those pauses should last, and how the signing speed of individual words should vary within a passage (e.g., slower at phrase endings). The motivation is that many Deaf and Hard of Hearing (DHH) individuals prefer ASL as their primary language, and over half of deaf adults in the U.S. have English reading literacy at the fourth-grade level — yet very few websites offer sign language content because producing and updating sign language videos is expensive. Animated signing avatars could make web content more accessible, but current systems produce unnaturally timed animations with uniform speed and pauses only at sentence boundaries. The researchers trained three cascaded machine-learning models on a motion-capture corpus of multi-sentence ASL recordings from three native signers (83 passages, 7,138 words): a Linear-Chain Conditional Random Field for pause insertion, a Gradient Boosting Regressor for differential signing rate, and a Gradient Boosting Regressor for pause duration. The models use linguistic features such as syntactic boundary proximity, phrase length, and complexity index.

Key findings

The new ASL-Speed model outperformed both simple baselines and the prior state-of-the-art 2008 Model across all three timing tasks. For pause insertion, the model achieved 80% accuracy (F1-score) versus 77% for the baseline and 63% for the 2008 Model. For differential signing rate, it achieved RMSE of 0.45 versus 0.50 baseline and 0.84 for the 2008 Model. For pause duration, RMSE was 2.77 versus 4.47 baseline and 5.31 for the 2008 Model. Crucially, the new model requires less input from human authors — it needs only sentence boundaries and basic phrase structure, while the 2008 Model required a full syntactic parse tree for every sentence. In a user study with 8 native ASL signers (all learned ASL before age 3), 6 of 8 participants preferred animations generated with the ASL-Speed model over the baseline. Participants commented that the new model produced signing that felt more natural: "It [new] is normal, almost like a real person signing." The paper also thoughtfully addresses Deaf community concerns, referencing a 2018 joint statement by the World Federation of the Deaf and the World Association of Sign Language Interpreters cautioning against signing avatars replacing human interpreters in critical contexts.

Relevance

This research tackles a fundamental barrier to sign language accessibility on the web: the cost and effort of producing sign language content. By automating the subtle timing decisions that make ASL animations look natural — pauses, speed variations, phrase-final lengthening — the work moves closer to a system where an ASL-fluent human author can write a script and have software generate a realistic animation without manually tuning hundreds of timing parameters. For accessibility practitioners, this highlights that sign language accessibility requires much more than translating words — the prosodic structure of signed languages is essential for comprehensibility. The ethical discussion is equally important: the researchers explicitly position their work as a tool to increase the volume of sign language content on the web, not to replace human interpreters in critical settings like healthcare or education. The involvement of DHH researchers and participants throughout the project demonstrates responsible development practices for technology that serves the Deaf community.

Tags: sign language · ASL · animation · machine learning · Deaf accessibility · natural language processing · prosody · motion capture