Legion Scribe: Real-Time Captioning by the Non-Experts

Walter S. Lasecki, Christopher D. Miller, Raja Kushalnagar, Jeffrey P. Bigham · 2013 · Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/2461121.2461151

Summary

This demo paper introduces Legion Scribe, a system that enables real-time captioning of speech by harnessing 3-5 ordinary typists working simultaneously, rather than relying on expensive professional stenographers. Real-time captioning provides text equivalents of spoken language within approximately 5 seconds, serving deaf and hard of hearing people in classrooms, live events, courtrooms, and television. Professional stenographers — the only reliable existing solution — cost $100-300 per hour, must be scheduled days in advance, and can only be booked in hour-long blocks, making them impractical for many situations. Scribe addresses this by streaming audio to multiple non-expert captionists via a web-based interface, where each person types as much of the audio as they can. Since no individual can type at natural speaking rates (approximately 150-180 words per minute), each captionist captures only a partial transcript. The system then computationally merges these overlapping partial captions into a single coherent caption stream using a multiple sequence alignment algorithm, and forwards the merged output to the deaf or hard of hearing user's mobile device.

Key findings

The system uses a merging server that applies a multiple sequence alignment approach to stitch together the partial, overlapping transcripts from individual captionists into a final output stream. A crowd correction step allows captionists to review and fix errors in the merged output before it reaches the end user. Evaluation showed that Scribe's caption accuracy approaches that of professional stenographers while achieving dramatically lower latency and cost. The web-based architecture means captionists can participate remotely from anywhere with an internet connection, and the system can be initiated on-demand without advance scheduling. The approach effectively decomposes a task that requires rare expert skill (stenography) into smaller subtasks that ordinary people can perform, demonstrating the viability of crowdsourced real-time accessibility services. The mobile-first design allows deaf and hard of hearing users to access captions on their own devices in any setting.

Relevance

Legion Scribe tackles a major practical barrier to communication accessibility: the cost and availability of real-time captioning services. Many deaf and hard of hearing students, employees, and event attendees go without captions simply because professional stenographers are too expensive or unavailable on short notice. By crowdsourcing the captioning task to non-experts, Scribe dramatically lowers the cost and eliminates scheduling constraints. This work is an early example of using human computation for real-time accessibility — a concept that has influenced subsequent systems including automated speech recognition approaches that now provide lower-cost captioning. While ASR technology has advanced significantly since 2013, the hybrid human-computation model remains relevant for situations requiring high accuracy or handling accented speech, technical vocabulary, and noisy environments where ASR still struggles. The paper is a short demo description and does not include detailed accuracy metrics or user study results within this publication.

Tags: captioning · deaf and hard of hearing · crowdsourcing · real-time captioning · communication accessibility · assistive technology · human computation