Introducing Game Elements in Crowdsourced Video Captioning by Non-Experts

Hernisa Kacorri, Kaoru Shinkawa, Shin Saito · 2014 · Proceedings of the 11th Web for All Conference (W4A) · doi:10.1145/2596695.2596713

Summary

This paper from CUNY Graduate Center and IBM Research Tokyo presents a gamified crowdsourcing platform for video captioning that combines ASR output with non-expert human transcription to improve caption accuracy without monetary rewards. The system builds on the Collaborative Caption Editing System (CCES) framework, which automatically segments videos into 2-10 second clips based on phrase boundaries in the audio signal. Non-expert users work on these short segments in two modes: "Type" mode (listen and transcribe from scratch) and "Fix" mode (edit ASR-generated text). Game-like elements include a countdown timer (set at 9x the video clip duration based on prior research showing non-experts need approximately 8x video length for captioning) and a scoring system. Scoring in Type mode compares submissions against the ASR result, while Fix mode compares against another user's Type-mode transcription — a deliberate design choice because scoring Fix mode against ASR would discourage users from making corrections. Transcriptions from multiple users on the same segment are aligned and merged using majority voting, where a word is accepted if it appears at least twice across the ASR and user submissions.

Key findings

A pilot study with 42 voluntary participants across 578 English video segments (2-10 seconds each, spanning narrations, tutorials, and speeches) showed that merging just two user transcriptions per segment reduced the overall Word Error Rate from 20.7% (ASR alone) to 16.0% across 6,501 words. The merged results had notably smaller variance in both WER and error counts compared to individual submissions, suggesting that additional users per segment would yield further improvements. An interesting finding emerged about the complementary value of the two modes: while Type-mode transcriptions appeared individually worse than ASR (mean WER 33% vs. 22%), they served a critical function in the merging process by catching ASR errors that Fix-mode users missed. When ASR output is nearly correct, Fix-mode users tend to trust it, missing subtle errors like "blocks" instead of "blogs" — a finding that echoes the ASR quality threshold research by Gaur et al. Participants in Fix mode completed segments with substantially more time remaining (mean 19.76 seconds) than in Type mode (mean 6.32 seconds), reflecting the lower effort of editing versus transcribing from scratch. 20% of participants contributed to at least 12 segments, while segments were skipped roughly equally across modes (40 Type skips, 39 Fix skips across 71 segments).

Relevance

This research addresses the persistent and expensive problem of video captioning for deaf and hard of hearing accessibility. While ASR has improved dramatically since 2014, the core insight — that combining imperfect ASR with lightweight human editing can achieve usable caption quality at scale without professional captionists — remains relevant in many contexts where ASR accuracy is insufficient (noisy environments, accented speech, technical vocabulary). The gamification approach attempts to solve a fundamental crowdsourcing challenge: captioning has inherently limited entertainment value compared to tasks like image labeling, making non-monetary engagement difficult. For accessibility practitioners, the dual-mode design (Type + Fix) offers a practical architecture for human-AI captioning collaboration, and the finding that Type-mode users catch errors Fix-mode users miss validates the value of fresh-transcription alongside ASR editing. The work connects to the broader Gaur et al. finding about the 30% WER threshold — at 20.7% baseline ASR WER, the Fix mode is productive, but the system wisely includes Type mode to guard against ASR over-trust.

Tags: captioning · crowdsourcing · deaf and hard of hearing · gamification · automatic speech recognition · human computation