Real-Time Captioning with the Crowd

Walter S. Lasecki, Jeffrey P. Bigham · 2014 · Interactions · doi:10.1145/2594459

Summary

This article presents Scribe, a crowdsourced real-time captioning system that allows groups of non-expert typists to collectively produce captions at the speed of natural speech — a task that normally requires highly trained professional stenographers. The authors motivate the work by highlighting the limitations of both professional captioning (stenographers cost $120-$300/hour, must be scheduled days in advance, and are in short supply) and automatic speech recognition (ASR), which at the time still produced too many errors for reliable classroom use, particularly with unfamiliar speakers, fast speech, and domain-specific vocabulary. Scribe works by streaming audio to multiple crowd workers simultaneously, where each worker captures roughly one-fifth to one-third of the spoken words. Since 3-5 workers collectively hear everything, the system merges their partial, overlapping captions into a single coherent output using a multiple sequence alignment (MSA) algorithm borrowed from computational biology (used to align DNA genomes). An A*-search-based approach finds the optimal interleaving of captions in under a second, keeping total turnaround time below five seconds. The system also assigns workers automatic roles and adjusts audio playback speed — slowing down segments workers are responsible for while slightly speeding up surrounding audio — which improved both coverage and precision while reducing worker stress.

Key findings

Scribe demonstrated that coordinated groups of non-expert workers can perform real-time captioning that matches or exceeds the quality of individual expert captionists. The average untrained person can capture one-fifth to one-third of spoken words, meaning 3-5 workers collectively capture everything a speaker says. The MSA merging algorithm combines partial captions into a single output in under one second, achieving total turnaround of less than five seconds from speech to displayed caption. Slowing audio playback for workers' responsible segments paradoxically improved latency — workers could type words as they heard them rather than listening to a full segment and then typing from memory. This also improved coverage and precision while reducing worker stress. The system can use crowd workers from Amazon Mechanical Turk or other sources, requiring no advance scheduling, and can run on any platform including Google Glass. Even four work-study students paid $15/hour would cost half of the least expensive professional stenographer, making the approach significantly more affordable and accessible.

Relevance

This work challenged the assumption that real-time captioning must be performed by a single highly skilled individual, demonstrating a powerful alternative model where collective non-expert effort can match expert performance. For accessibility practitioners, the key insight is that crowdsourcing can democratize the provision of accommodations — anyone who can hear and type can contribute to providing captioning access for deaf and hard of hearing people. While ASR technology has improved dramatically since this 2014 article, the underlying principle remains relevant: hybrid human-AI approaches can fill gaps where automation alone falls short, and designing systems that coordinate partial human contributions into high-quality outputs opens new possibilities for on-demand accessibility services. The work also illustrates a broader vision where access technology is not solely the domain of specialized professionals or AI systems, but something communities can collectively provide.

Tags: real-time captioning · crowdsourcing · deaf and hard of hearing · speech-to-text · human computation · accessibility