DescribePro: Collaborative Audio Description with Human-AI Interaction

Maryam S Cheema, Sina Elahimanesh, Samuel Martin, Pooyan Fazli, Hasti Seifi · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746320

Summary

This paper presents DescribePro, a web-based platform that combines human expertise with AI capabilities to create and refine audio descriptions (AD) for video content. The system addresses the fundamental tension in AD production: human-crafted descriptions are high quality but time-consuming, while AI-generated descriptions are efficient but lack nuance and stylistic sensitivity. DescribePro integrates four key components: an AD Timing module that automatically identifies optimal insertion points by detecting silence, non-speech segments, and scene changes; a Description Generation module that uses GPT-4o with 42 curated AD guidelines to produce initial descriptions; an AI Prompting and Editing interface that lets describers refine descriptions through natural language prompts or manual editing; and a Forking interface that enables community collaboration by allowing describers to create alternative variations of existing descriptions. Each variation is tagged with customizable labels describing its style and focus, enabling BLV users to choose descriptions that match their preferences. The system was evaluated with 18 participants—9 professional describers averaging 12 years of experience and 9 novices with training but no professional experience—who completed three AD creation tasks across short entertainment and instructional videos. The study used a mixed-methods approach combining quantitative metrics (SUS usability scores, prompt analysis, edit distances, similarity measures) with qualitative semi-structured interviews.

Key findings

The system received an average SUS usability score of 72.6, above the benchmark of 68 for web-based interfaces. Forking was rated the most useful feature overall, while AI-generated baseline descriptions and the prompting interface were valued differently by the two groups. Novices rated AI-generated descriptions as more useful than professionals, finding them helpful for overcoming the initial challenge of starting from scratch. Professionals used AI prompting 82 times compared to 65 for novices, and achieved a higher acceptance rate of AI revisions (72.8% vs 60.1%), suggesting greater effectiveness in translating editing intentions into AI prompts. Professional describers focused on language refinement—replacing passive voice, removing emotional language, creating concise descriptions—while novices concentrated on adding detail and context, often worried about missing visual elements. Participants made greater modifications to human-written variations than AI-generated ones, with significantly higher Levenshtein distances. Both groups shortened descriptions overall. Professionals viewed AI as a research assistant that could handle tedious tasks like transcribing on-screen text, while novices saw it more as a collaborative agent that could offer suggestions and corrective feedback. Concerns emerged about AI hallucinations, potential flattening of descriptive styles, and the risk of ideational convergence where AI homogenizes creative output.

Relevance

DescribePro demonstrates a practical model for scaling audio description production while maintaining quality—a critical need given that most online video content lacks AD. The collaborative forking mechanism is particularly innovative, enabling multiple description variations for the same video that can serve different BLV user preferences, moving beyond the traditional binary of "has AD" or "doesn't have AD." For accessibility practitioners, the system's 42 curated AD guidelines (included in the appendix) provide a comprehensive reference for both human and AI-assisted description writing. The finding that professionals and novices interact with AI tools differently has implications for designing adaptive interfaces in accessibility workflows. The paper also raises important questions about AI's role in creative accessibility work—particularly the tension between efficiency gains and preserving the artistic, subjective qualities that make descriptions engaging for BLV audiences.

Tags: audio description · video accessibility · human-AI collaboration · authoring tools · blind and low vision · large language models · personalization · collaborative editing · multimodal AI

Standards referenced: WCAG 2.1