Enhancing Learning Accessibility through Fully Automatic Captioning

Maria Federico, Marco Furini · 2012 · Proceedings of the International Cross-Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/2207016.2207053

Summary

This paper proposes an architecture for automatically generating synchronized captions for video lectures using off-the-shelf automatic speech recognition (ASR) software, aimed at making educational content accessible to hearing impaired students, dyslexic students, ESL (English as a Second Language) learners, and students with motor impairments who have difficulty taking notes. The core challenge the authors address is that commercial ASR products like Dragon NaturallySpeaking produce plain text transcripts without timing information, making it impossible to synchronize captions with video playback. Their novel solution is an audio markup insertion mechanism: before sending the audio to the ASR, the system automatically injects a unique marker word ("Goofy" — chosen because it is unlikely to occur in a lecture) into silence periods within the audio stream. When the ASR transcribes the modified audio, the marker word appears in the transcript at known time positions, allowing the Caption Alignment module to replace each marker occurrence with its insertion timestamp, producing a timecoded transcript. The architecture is designed to be technology-transparent — any ASR engine, multimedia format, or networking technology can be substituted without changing the overall approach. The system was tested with Computer Science and Linguistics professors at the University of Modena and Reggio Emilia, recording real classroom lectures with live audiences.

Key findings

The experimental evaluation focused on two critical parameters: the minimum silence length for markup insertion (tested at 30, 60, 90, 120, and 150 ms) and the minimum distance between consecutive markups (tested at 10, 20, 30, and 40 seconds). ASR accuracy remained remarkably stable across all configurations, hovering around 80% with less than 1% variation across silence length settings and less than 1% variation across markup distance settings — demonstrating that the markup injection does not meaningfully degrade ASR performance. However, caption readability imposes practical constraints: with a 1024x80 pixel subtitle area using 16pt Arial font (maximum 375 characters per caption), only silence lengths of 30-90 ms and markup distances of 10-20 seconds produced captions that fit within the display area. The optimal configuration was 90 ms silence length with 20-second minimum markup distance, balancing accuracy with readable caption length. The 80% accuracy rate, while imperfect, reflects the inherent challenges of classroom speech — variable speed, emphasis changes, filler words, hesitations, and lack of punctuation — compared to the 99% accuracy ASR vendors claim for controlled dictation scenarios.

Relevance

This research tackled the cost barrier that prevented many educational institutions from providing captioned video lectures — manual captioning and "shadow speaking" (where a human repeats speech slowly for ASR) were prohibitively expensive for most schools. The fully automatic approach, while producing imperfect captions at 80% accuracy, offered a scalable alternative. Since 2012, ASR technology has improved dramatically — services like YouTube auto-captions, Otter.ai, and cloud ASR APIs now routinely achieve 90%+ accuracy on clear speech — making the fundamental premise of this work validated by industry adoption. The audio markup technique for time alignment has been largely superseded by modern ASR engines that natively provide word-level timestamps. However, the paper's framing of captioning as benefiting not just deaf/hard of hearing users but also dyslexic students, ESL learners, and note-taking-impaired students remains an important reminder that captions serve a far wider audience than is commonly assumed. WCAG 1.2 requirements for synchronized captions continue to be among the most resource-intensive accessibility provisions for educational institutions.

Tags: captioning · speech recognition · education accessibility · deaf and hard of hearing · automatic speech recognition · video accessibility · learning disabilities