Like, Comment & Caption: A Decade of Social Media Video Caption Research (2015-2025)

Huong Nguyen, Emma J. McDonnell, Lloyd May, Alexander Druzenko, Zoobia Saifullah Syeda, Mark Cartwright, Sooyeon Lee · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791868

Summary

This CHI 2026 paper is a systematic literature review of 36 peer-reviewed studies on Social Media Video Captions (SMVC) published between 2015 and 2025, spanning HCI, accessibility, media studies, education, and language learning. The authors use 'SMVC' as an umbrella for textual or symbolic elements — platform-generated captions, creator-edited captions, user-generated captions, and non-speech information — that are temporally aligned with video on platforms such as YouTube, TikTok, and Instagram. Using a two-stage Google Scholar + SerpAPI search, PRISMA-style screening of 662 records down to 36 papers, and collaborative grounded-theory coding (open, axial, selective), the authors map the SMVC research landscape across three research questions: how captioning systems have evolved, how viewers and creators engage with them, and what design/infrastructure gaps remain. The paper situates social media captioning within a historical arc running from U.S. broadcast captioning (NCI, 1976 FCC line-21 authorisation, the 1996 Telecommunications Act), through streaming-era policy (Twenty-First Century Communications and Video Accessibility Act, European Accessibility Act) and file-based formats (WebVTT, TTML), into the post-COVID videoconferencing moment and the current creator economy of over 300 million creators. Its central contribution is conceptual: the authors propose 'Participatory Captioning' — a framework characterising SMVC as a collective infrastructure co-produced by viewers, creators, and platforms — together with design opportunities and critiques drawn from participatory-design theory and the Collective Communication Access (CCA) framework.

Key findings

Across four key themes, the review surfaces consistent patterns. (1) SMVC Types: automatic captions remain error-prone (errors vary by accent, function words, and content domain; auto-translation compounds these issues); user-generated captions show high coverage but recurring inconsistency (e.g., only 72.7% of sampled videos applied captions consistently across audio sources; TikTok captions frequently deviated from linguistic norms); non-speech information is sparse and limited to generic tags ('[music]', '[applause]'); and captions for sign language are nearly absent from major platforms. (2) Viewer perspectives: DHH viewers report skipping uncaptioned content, spending roughly half of their visual attention reading captions, and using hashtags like #CaptionYourVideos as workarounds. Captions also benefit ADHD viewers, language learners, and general audiences (~85% of Facebook video is watched without sound; captioned videos sustain 80% higher engagement). Expectations vary by platform and genre, and viewers increasingly demand adjustable display, error reporting, and feedback channels. (3) Creator perspectives: creators caption for mixed motivations (accessibility, engagement, branding, monetisation) but face inadequate tooling, opaque moderation, and misclassification of Deaf-aware gesture captions as violations. (4) Systems, techniques, and datasets: 36 empirical papers, 2 datasets, and 4 artifacts are catalogued, including BandCaption, OnomaCap, SiAMP, and the YouTube NSI Captioning Dataset (715,000 videos). DHH participants (n=19 papers) and U.S./China samples dominate; neurodivergent (4) and non-English contexts remain underserved.

Relevance

For practitioners, policy-makers, and platform designers, this review reframes social media captioning from a compliance feature delivered top-down into a socio-technical, community-sustained infrastructure — with direct implications for how tools, feedback loops, and moderation should be built. The Participatory Captioning framework (shared labour, accessibility-first, platform support) gives teams a vocabulary for designing creator recognition systems, AI-assisted editing, structured viewer feedback, contributor badges, and expertise-prioritisation mechanisms. The authors also name uncomfortable governance questions: how to avoid exploiting invisible DHH and multilingual volunteer labour, how to prevent over-optimised AI from homogenising expressive caption styles (TikTok 'vibe text', emoji captions, hand-sign lipsync gestures), and how to preserve cultural specificity when auto-translate systems flatten linguistic nuance. Limitations to note: the corpus is English-only and relies on Google Scholar alone, so non-English and non-Western platforms (Moj, Likee, BIGO Live, Kuaishou) are under-represented, and the shift toward LLM/multimodal-AI captioning pipelines — likely to dominate the next decade — sits largely outside the 36-paper window. Still, this is the strongest current synthesis of SMVC research and a high-leverage starting point for caption-related design and policy work.

Tags: captioning · captions · video accessibility · social media accessibility · Deaf and hard of hearing · neurodivergence · language learners · participatory design · systematic literature review · automatic speech recognition · non-speech information · TikTok · YouTube

Standards referenced: WebVTT · TTML · CVAA · European Accessibility Act