ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators

Franklin Mingzhe Li, Michael Xieyang Liu, Cynthia L Bennett, Shaun K. Kane · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791158

Summary

Li and colleagues tackle a rarely examined corner of accessibility: the fact that the tools used to produce Audio Description (AD) are themselves largely inaccessible to the blind and low-vision (BLV) creators who are often its most skilled practitioners. Professional AD authoring happens in graphical waveform and timeline interfaces — Ooona, Subtitle Edit Pro, non-linear video editors — that depend on visual metaphors (playheads, scrubbing, draggable bars, waveforms) that screen readers cannot render meaningfully, forcing BLV AD writers into sighted-collaborator dependencies. The paper introduces ADCanvas, a screen-reader-first AD authoring tool built as a web application on top of Google's Gemini 2.5 multimodal LLM. ADCanvas combines three surfaces: a plain-text WebVTT script editor, a bank of keyboard-first media controls with audio-cued timestamp feedback, and a context-aware conversational agent that supports visual question answering, AD-gap identification, full-script generation, line-specific TTS preview, and local or global script edits. The authors position ADCanvas as a technology probe rather than a production replacement for industry DAWs, and evaluate it through a qualitative user study with 12 BLV participants (all >5 years AD experience, daily JAWS/NVDA/VoiceOver users) across three scaffolded tasks — QC refinement of an AI-generated script, authoring from AI-identified gaps, and free-form authoring from scratch.

Key findings

Participants overwhelmingly endorsed the tool: 10/12 rated ADCanvas 7/7 for usefulness and 11/12 rated 7/7 for likelihood of future use. The conversational agent became a three-layered aide — informational conduit (summaries, object listings, property queries), structural drafter (gap identification, timestamped script generation), and revision partner (local rephrasing, global find-and-replace across script entries). Creators did not treat the model as an oracle: they adopted a 'trust but verify' stance, using AI output as a starting point then checking it against audio cues, on-screen text, and professional AD guidelines (present tense, be objective, describe actions over settings). Participants organised their queries into distinct categories — visual object properties (color, texture, text), events and timing (character action, pace), and interpretation (emotion, causality) — and often asked the agent to describe the same scene multiple times to triangulate detail. Coding of 202 prompts showed 13.4% incongruent agent responses (unsolicited edits were the most common frustration) and 5.9% VQA factual errors. Key tensions emerged: agent proactivity vs. user agency, simplified UI vs. professional-grade millisecond timing control, and objective AD tone vs. AI-injected interpretation.

Relevance

For practitioners, this is a concrete demonstration that multimodal LLMs can dissolve the long-standing dependency on sighted collaborators in accessible media authoring — a category of tool where BLV domain expertise has historically been blocked by interface design, not by skill. The paper's design implications translate directly to accessible creative tools beyond AD: separate conversation from editing (Ask-Only vs. Review-Suggestions modes), surface a suggestion drawer rather than auto-editing user text, offer configurable agent rules ('my voice, my style, my priorities'), provide toggleable complexity layers (basic / intermediate / professional), and support audio-cued timestamp feedback and hotkey shortcuts in place of visual timelines. The study also surfaces ethical concerns worth attending to in any AI-authoring product for disabled users: job-displacement anxieties among professional AD writers, the risk of erasing collaborative voices in industry AD production, and the responsibility to centre BLV professional judgment rather than replace it. Limitations include a 120-minute session per participant, English-only videos without overlapping dialogue, and no systematic model accuracy benchmarking.

Tags: audio description · blind and low vision · conversational agent · multimodal LLM · visual question answering · human-AI collaboration · accessible authoring · screen reader · content creation · technology probe

Standards referenced: WebVTT · Section 508