The Potential of a Visual Dialogue Agent In a Tandem Automated Audio Description System for Videos

Abigale Stangl, Shasta Ihorn, Yue-Ting Siu, Aditya Bodi, Mar Castanon, Lothar D Narins, Ilmi Yoon · 2023 · Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2023) · doi:10.1145/3597638.3608402

Summary

This paper presents and evaluates a tandem AI-based audio description (AD) system for videos that combines two complementary tools: NarrationBot, which delivers automated minimum viable descriptions (MVD) of video content, and InfoBot, a visual dialogue agent that allows users to request on-demand descriptions and ask visual questions. The research addresses a critical accessibility gap — the explosive growth of video content online vastly outpaces the availability of human-authored audio descriptions, leaving blind and low vision (BLV) individuals excluded from significant amounts of digital media. Previous automated AD systems have offered either baseline descriptions or on-demand question answering, but not both. The authors conducted a mixed-methods user study with 26 BLV participants who watched six five-minute animal rescue videos under six different conditions: no support, InfoBot only, AI-NarrationBot only, human-revised NarrationBot only, AI-NarrationBot+InfoBot, and human-revised NarrationBot+InfoBot. Participants rated their comprehension and enjoyment on a 6-point Likert scale after each video and provided qualitative feedback. A detailed multimodal case study analysis of one video examined the timing, accuracy, and content of the system outputs, including when and how participants activated InfoBot to supplement NarrationBot descriptions. NarrationBot used scene segmentation, object detection, keyframe selection, image captioning, text summarization, OCR, and text-to-speech to generate baseline AD automatically. InfoBot used a Visual Dialog model trained on 120K question-answer dialogues to provide interactive responses grounded in keyframe images and dialogue history.

Key findings

When used in isolation, AI-only tools scored significantly lower than human-revised descriptions for both enjoyment and comprehension. However, the fully AI-based tandem system (AI-NarrationBot+InfoBot) matched the performance of the human-revised tandem system — participants reported no significant differences in enjoyment or comprehension between the two tandem conditions. This is the study's most striking finding: combining two imperfect AI tools produced outcomes comparable to having a human in the loop. Quantitatively, comprehension scores ranged from 2.27 (no support) to 4.58 (human-revised NarrationBot only), with the AI tandem system scoring 4.15 and the human-revised tandem scoring 4.54. InfoBot alone did not improve comprehension over no support (p = 0.363), but when paired with NarrationBot it made a significant contribution. InfoBot had a 48.2% accuracy rate on the 625 questions asked, yet still added value in the tandem configuration. The case study revealed that AI-NarrationBot missed 21 of the scene changes identified by human reviewers, and only 19 of 39 generated descriptions were fully accurate. Participants used InfoBot strategically — clustering their requests around gaps in NarrationBot coverage, inaccurate descriptions, or moments when they wanted more detail about a subject.

Relevance

This research has important implications for scaling video accessibility. With over 500 hours of video uploaded to YouTube every minute, human-authored AD cannot keep pace. The finding that a tandem AI system can match human-in-the-loop quality suggests a viable path toward automated AD at scale. For practitioners, the study reinforces that no single AI tool is sufficient — combining passive baseline descriptions with active user-driven querying creates a mutually compensatory system. The concept of minimum viable description (MVD) offers a practical framework for prioritizing what information to deliver automatically versus what to make available on demand. The study also highlights ongoing challenges: AI systems still struggle with accurate content recognition, contextual information delivery, and appropriate timing of descriptions. Organizations implementing automated AD should consider tandem approaches rather than relying on a single tool, and should build in transparency about the AI-generated nature of descriptions.

Tags: audio description · blind and low vision · visual question answering · visual dialogue · AI · video accessibility · minimum viable description · human-in-the-loop

Standards referenced: WCAG 2.1