Beyond Technical Metrics: Understanding the Gap Between AI Performance and Deaf User Experience in Chinese Natural Sign Language Generation

Yang Liu, Hui Kang, Yurun He, Jiahui Li · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791429

Summary

This CHI 2026 paper investigates the disconnect between technical performance metrics and actual Deaf user experience in AI-generated Chinese Natural Sign Language (CNSL). The authors argue that existing sign language generation research, dominated by hearing researchers, relies on metrics borrowed from machine translation and computer vision (BLEU, SSIM, symbol accuracy, length deviation) that measure surface similarity to reference videos rather than whether Deaf viewers actually comprehend the output. This gap is especially acute for CNSL, which differs fundamentally from Chinese Sign Language (CSL, a manually constructed system mirroring Mandarin) through its scene-dependent expressions, clause-final interrogatives, synchronized non-manual negation, and classifier-based spatial grammar. The team built a three-stage CNSL generation prototype (Chinese text + scene label → HamNoSys → pose keypoints → diffusion-rendered video) as a controlled evaluation stimulus, then partnered with four Deaf co-researchers across five co-design workshops to develop a seven-dimension evaluation framework. The framework combines subjective and objective comprehension (paraphrase tests), temporal fluency, spatial grammar, non-manual markers, expressiveness, mental demand (modified NASA-TLX), and acceptance. Study 2 applied the framework in a 2×3 within-subjects evaluation with 24 native CNSL signers comparing AI-generated against human signing across declarative, interrogative, and negative sentence types. Mixed-effects models, interview analysis, and convergence tests between co-researchers and participants validated the framework's generalizability.

Key findings

Technical benchmarks diverged sharply from comprehension. The prototype achieved 80.6% HamNoSys symbol accuracy and high smoothness (0.74), yet objective paraphrase tests showed only 24.5% correct comprehension for AI-generated clips versus 37.6% for human videos. Subjective and objective comprehension also diverged — participants consistently rated AI clips higher than their paraphrases warranted, a 'pseudo-understanding' pattern the co-researchers named. Cognitive load for AI signing rose with linguistic complexity (declaratives 32.7 → negatives 43.0 on a 0-100 scale) while human signing showed the opposite trend, revealing an inverse complexity effect. Interrogatives and negatives produced the largest gaps due to failures in synchronized non-manual markers and spatial binding. Post-hoc contrasts showed human signing significantly outperformed AI on temporal fluency (+0.86), spatial grammar (+0.82), non-manual markers (+0.53), expressiveness (+0.44), and acceptance (+9.91), with mental demand 9.54 points higher for AI. Acceptance was scene-dependent: 18 of 24 participants accepted AI signing for low-stakes informational contexts (mall announcements, transit) but rejected it for medical instructions or personal communication.

Relevance

For accessibility practitioners building or evaluating AI-generated sign language, this paper is a sharp warning that improvements on technical metrics can mask — or even worsen — actual communicative effectiveness. It operationalizes 'with, not for' Deaf communities into a reusable seven-dimension evaluation framework, a modified NASA-TLX for signed content, and a dual subjective/objective comprehension protocol that exposes overconfidence bias. The scene-dependent acceptance finding reframes deployment questions: AI signing may be acceptable as a supplement in low-risk informational contexts but not as a replacement for interpreters in high-stakes domains. Limitations are that the study covers single sentences rather than discourse, one CNSL pipeline, and an online-recruited sample skewing younger and digitally engaged. The framework's principles (nonmanual synchrony, referential stability, spatial cohesion, cognitive effort) are transferable to ASL, BSL, and other signed-language generation work.

Tags: sign language generation · deaf accessibility · AI accessibility · participatory design · human-centered evaluation · cognitive load · Chinese Natural Sign Language · empirical evaluation