Crowdsourcing Correction of Speech Recognition Captioning Errors

M. Wald · 2011 · Proceedings of the International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1969289.1969318

Summary

This paper describes tools built around Synote, an award-winning web-based application from the University of Southampton, that enable crowdsourced correction of automatic speech recognition (ASR) captioning errors to make video content accessible at scale. The author frames the problem clearly: professional manual captioning is too expensive for routine use (e.g., university lectures), while ASR produces errors ranging from under 10% word error rate with trained speakers in good conditions to over 30% with untrained speakers in poor acoustic environments. Conversational speech is particularly challenging because speakers run words together, use fillers (ums, ahhs), hesitate mid-word, and don't speak punctuation. The Synote platform already provided synchronized captions, user notes, and slide images alongside video recordings, using speaker-independent speech recognition for automatic captioning. However, editing the full synchronized transcript was impractical for collaborative correction — the whole transcript had to be saved at once, and concurrent editors would overwrite each other's work. The paper presents two new tools: a Caption Creation Tool that investigates optimal ways to split transcripts into individually editable utterances/captions, and a Crowdsourcing Correction Tool that enables multiple users to independently correct individual caption segments.

Key findings

The crowdsourcing correction tool stores all edits from all users and uses a matching algorithm to verify corrections through agreement — when multiple users independently make the same correction to an utterance, it is accepted as correct. Administrator settings control the match closeness threshold and how many users must agree before an edit is accepted. The interface uses visual indicators: red bars with ticks mark utterances where sufficient agreement has been reached, green bars mark those still needing correction. Users can be assigned specific transcript sections or given freedom to correct any utterance. A points system rewards matching edits and can penalize corrections that don't match others', providing gamification incentives. The Caption Creation Tool investigates optimal segmentation strategies: splitting by word count, utterance duration, or silence length between words, and automatic formatting including punctuation insertion and capitalization. The system outputs both standard text caption files and XML for Synote. The author notes that previous studies showed students who edit lecture transcripts perform better on content tests than those who just watch — suggesting that crowdsourced caption correction could be incentivized through academic credit while simultaneously improving learning outcomes.

Relevance

This paper addresses a problem that was central to video accessibility in 2011 and remains relevant: the gap between imperfect ASR output and the high-quality captions needed for accessibility. The crowdsourcing approach — using agreement between multiple independent correctors as a quality signal — anticipated the human-in-the-loop correction workflows that are now standard in captioning services. YouTube's community contributions feature (now discontinued) and platforms like Amara used similar collaborative editing models. The insight that caption correction can double as a learning activity for students is particularly clever, creating a sustainable incentive structure where the effort of correction benefits both the corrector and future viewers. The matching algorithm approach to quality assurance avoids the need for expert review while still catching errors, though it requires sufficient volume of correctors to achieve agreement. Modern ASR has dramatically improved since 2011, but auto-generated captions still require correction for professional or legal accessibility compliance, making these crowdsourcing approaches still valuable for educational and organizational contexts.

Tags: captioning · speech recognition · crowdsourcing · deaf and hard of hearing · video accessibility · education accessibility · automatic speech recognition