CuCap: Comparative Analysis of Customized Captioning between North American and South Korean d/Deaf and Hard-of-Hearing Users

Caluã de Lacerda Pataca, SooYeon Ahn, Suhyeon Yoo, JooYeong Kim, Khai N. Truong, Jin-Hyuk Hong, Roshan L Peiris, Matt Huenerfauth · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746400

Summary

This paper introduces CuCap, a customizable captioning system that allows d/Deaf and Hard-of-Hearing (DHH) users to personalize how paralinguistic speech features are visually represented in captions. Traditional captions convey only the words spoken, losing important information about tone, emotion, loudness, and rhythm that hearing people perceive naturally. CuCap addresses this gap by extracting five speech features—valence, arousal, loudness, pitch, and rhythm—and letting users map them to seven typographic styles including font color, size, weight, background color, baseline shift, opacity, and letter spacing. The system uses Praat for prosodic feature extraction (loudness, pitch, rhythm), an emotion recognition toolkit based on the circumplex model for affective features (valence and arousal), OpenAI Whisper for transcription with Dynamic Time Warping for per-word timestamps, and GPT-4 for compressing description generation. The researchers conducted a cross-cultural study with 49 DHH participants—28 from North America (English-speaking) and 21 from South Korea (Korean-speaking)—to understand how cultural and linguistic context influences captioning preferences. Participants watched video clips and customized their preferred speech-feature-to-typography mappings through CuCap's interface, followed by semi-structured interviews about their experiences and preferences.

Key findings

Emotion visualization was universally valued across both cultures, with valence chosen by 64% of North American and 52% of Korean participants (primarily mapped to font-color), and arousal selected by 61% NA and 84% KOR participants. Loudness was popular in both groups (63% NA, 76% KOR), most commonly mapped to font-size. The most striking cross-cultural difference was in pitch preference: only 21% of NA participants selected it compared to 52% of KOR participants, a statistically significant difference (p=0.04) potentially explained by Korean's emerging tonal contrasts and Hangul's articulatory encoding of speech sounds. Rhythm preferences also diverged, with NA users preferring letter-spacing and KOR users preferring font-size mappings. Participants reported improved comprehension of speaker emotions and intentions (N=35), increased enjoyment and immersion (N=10), and potential as a backup for assistive listening devices. Concerns included visual complexity and distraction (particularly among Korean participants, N=8), a learning curve for interpreting new visual encodings (N=14), and some preference for conventional unmodified captions (N=6). Sign language background influenced preferences—ASL users connected font-size changes for loudness to how ASL uses larger signs for emphasis.

Relevance

This research has significant implications for caption designers and developers building media accessibility tools. The finding that captioning preferences vary meaningfully across cultures challenges the assumption that one-size-fits-all captions are sufficient for global DHH audiences. The cross-cultural pitch difference highlights how linguistic structure shapes accessibility needs—caption systems serving Korean users should consider pitch visualization more prominently than those serving English speakers. The paper's design recommendations are directly actionable: offering progressive disclosure for customization complexity, providing sensible defaults while accommodating individual variation, and exploring genre-specific caption templates with machine-learning-based personalization. For organizations creating captioning solutions, this work demonstrates that user customization is not just a nice-to-have feature but a meaningful accessibility improvement that enhances comprehension, engagement, and emotional connection to media content.

Tags: deaf and hard of hearing · captioning · customization · cross-cultural study · speech prosody · typographic design · emotion recognition · personalization