Slidecho: Flexible Non-Visual Exploration of Presentation Videos

Yi-Hao Peng, Jeffrey P Bigham, Amy Pavel · 2021 · Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '21) · doi:10.1145/3441852.3471234

Summary

This paper presents Slidecho, a system that makes recorded presentation videos accessible to blind and visually impaired learners by automatically extracting slide content and synchronizing it with the presenter's speech. The core problem is that most presentation videos — including TED talks, course lectures, and conference presentations — contain significant visual information on slides that speakers fail to describe verbally. Neither TED videos nor ACM SIGCHI conference videos provide corresponding accessible slides. Slidecho addresses this through a computational pipeline that: (1) identifies slide frames in the video using Google Video Intelligence API; (2) extracts text via OCR and images via segmentation; (3) converts slide elements into accessible HTML with appropriate read order; (4) aligns slide elements to the transcribed speech using sentence embeddings (RoBERTa) and cosine similarity to determine which elements the speaker describes and which remain "undescribed"; and (5) inserts audio notifications at slide boundaries. The system provides three synchronized interface panes: a video player with optional audio notifications for slide changes and undescribed content, a slides pane showing the current slide's full text and image elements in accessible HTML, and an undescribed elements pane that filters to only the content the speaker did not mention. An edit mode allows presentation authors or third parties to correct OCR errors, adjust image descriptions, fix slide boundaries, and toggle element descriptions. The name "Slidecho" is a portmanteau of "slide" and "echo" — the system echoes slide transitions and content as the presenter speaks.

Key findings

A technical evaluation with 20 in-the-wild presentation videos (88.8 minutes, 158 unique slides, 574 elements) showed that presenters neglected to mention 1 in 5 text elements and 85% of images. Slidecho provides access to an additional 20% of total text elements and 30% of total image elements not described by speakers. The pipeline achieved strong performance: slide boundary detection F1-score of 97.2%, OCR character error rate of only 1.6%, image segmentation F1-score of 91%, slide element grouping F1-score of 87.8%, and element-to-speech alignment F1-score of 84.3%. A user study with 10 blind and visually impaired participants (8 blind, 2 low vision; using JAWS, NVDA, VoiceOver, and braille display) compared Slidecho's synchronized interface against a non-synchronized side-by-side approach. With the sync interface, participants read significantly fewer redundant slide elements (mean 3.90 vs 8.50), spent less total time (5.46 vs 7.30 minutes), and rated their ability to identify undescribed elements significantly higher (6.60/7 vs 5.10/7). Surprisingly, participants also rated the more complex synchronized interface as significantly less mentally demanding (3.00/7 vs 5.30/7) because the audio notifications eliminated the need to manually search for missing content. Eight of ten participants preferred the synchronized interface, and participants rated it as significantly improving accessibility (6.90/7 vs 6.00/7).

Relevance

Slidecho addresses a pervasive accessibility gap in online education and professional development — the inaccessibility of presentation videos that rely heavily on visual slide content. As video-based learning continues to grow across education, conference talks, and professional training, tools that can automatically bridge the gap between visual slide content and audio narration become increasingly critical. For accessibility practitioners, the system demonstrates a practical approach to augmenting existing content rather than requiring presenters to change their behaviour. The finding that speakers fail to describe 85% of images and 20% of text on their slides quantifies the scale of the problem. The concept of identifying "undescribed elements" — content present on slides but absent from speech — offers a useful framework for prioritizing what additional description is most needed. The research also has implications for conference and educational platform accessibility: organisations could run Slidecho on their video archives to provide synchronized, navigable slide content alongside recordings, significantly improving access without requiring manual audio description authoring for each video.

Tags: video accessibility · blind and low vision · audio description · presentations · screen reader · OCR · computer vision · educational technology · multimedia accessibility

Standards referenced: WCAG 1.2.7