TaskAudit: Detecting Functiona11ity Errors in Mobile Apps via Agentic Task Execution

Mingyuan Zhong, Xia Chen, Davin Win Kyi, Chen Li, James Fogarty, Jacob O. Wobbrock · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791415

Summary

This CHI 2026 paper introduces TaskAudit, an automated accessibility evaluation system for mobile apps that detects what the authors coin 'functiona11ity errors' — accessibility barriers that only manifest through interaction, where a UI's static state looks accessible but its dynamic behaviour fails to meet user expectations or WCAG criteria. The authors argue that existing automated checkers (Accessibility Scanner, ScreenAudit, Groundhog, AXNav) cover only a fraction of real-world accessibility problems because they rely on static snapshots, mechanical crawling, or heuristic rules, missing errors that depend on semantic understanding of interaction outcomes. TaskAudit comprises three components. The Task Generator takes a screenshot of an app screen and, using Paddle OCR plus OmniParser for visual UI parsing and GPT-4o for captioning, produces task specifications for each interactive element — each with a description, prerequisites, target element, and success criterion. The Task Executor runs these tasks through a multi-agent loop (Decision agent, Reflection agent, and shared Memory) on an Android emulator, acting via a modified TalkBack 'screen reader proxy' that captures transcripts and accepts swipe, double-tap, back, and type gestures — deliberately without vision, to mirror a screen reader user's experience. The Accessibility Analyzer ingests the execution trace and uses a two-stage GPT-4o prompting pipeline (chain-of-thought root-cause analysis) to decide whether any step failed due to an accessibility error rather than agent error or app state change. The authors define five functiona11ity error categories, each mapped to WCAG: Locatability (cannot focus), Actionability (can focus but cannot activate), Label (semantic mismatch between label and function), Feedback (missing or uninformative announcement after action), and Navigation (structural barriers to efficient traversal). Three performance experiments answer RQ1 (task identification), RQ2 (execution success on accessible screens), and RQ3 (functiona11ity error detection).

Key findings

Task Generator coverage: detected 911 of 1,226 tappable UI elements (74.3%) from the RICO dataset, with 93.3% of text inputs captured; 93.4% of generated captions matched crowdsourced captions. Task Executor reliability: on 299 tasks across 36 accessible screens from 26 known-accessible apps (Android in the Wild, Google-developed only), agents succeeded on 287 tasks (96.0%); 2.3% of failures traced to incomplete task descriptions and 1.7% to content refreshing. Functiona11ity error detection on 54 screens from 14 real-world apps (a 78-error ground-truth set): TaskAudit correctly identified 48 errors (recall 0.615), with precision 0.676 and F1 0.644. By comparison, Accessibility Scanner detected only 4 (recall 0.052), ScreenAudit detected 10 (recall 0.130), and Groundhog detected 20 (recall 0.256) with much lower precision (0.142, F1 0.183). TaskAudit uniquely covered Label, Feedback, and Navigation categories entirely outside the scope of Groundhog and most prior tools. Of TaskAudit's 23 false positives, 12 stemmed from visual UI parsing errors (OCR, misinterpreted elements), 7 from agent execution issues, and 4 from misreading non-clickable elements like ads. Of 30 false negatives, 21 were Task Generator failures to produce a relevant task. Cost and time: TaskAudit averaged 1,129 seconds per screen (vs Groundhog's 766 s), consuming ~280k input and ~16k output tokens per screen — roughly US$0.61 per screen at August 2025 GPT-4o pricing. The qualitative analysis illustrated eight concrete error patterns including label-functionality mismatch (a 'search' tab labelled 'explore'), cluttered navigation (60 focusable airplane background elements trapping focus), and inappropriate feedback (dropdown menu changes visually but produces no audio announcement).

Relevance

For accessibility engineers, QA teams, and tool builders, TaskAudit is the first published system that actually catches the kinds of mobile errors blind screen reader users encounter every day — not just the static 'missing label' class. It raises the ceiling for what automated testing can plausibly find and sets a new reference baseline against Accessibility Scanner, Groundhog, AXNav, and ScreenAudit. Practical takeaways: teams running CI accessibility gates should treat TaskAudit-style agentic audits as complementary to rule-based linters and crawlers, not replacements — the authors explicitly recommend a hybrid pipeline where cheap mechanical crawlers pre-screen large app surfaces and TaskAudit is deployed as a targeted check on high-impact flows. Cost (~$0.61 per screen) and runtime (~19 minutes per screen) mean it is not yet suited to exhaustive regression sweeps, but is viable for release-gate review of critical journeys. Limitations matter: the system cannot perform touch exploration (common among screen reader users), depends heavily on visual UI parsing quality (12 of 23 false positives), and was evaluated only on Android/TalkBack — iOS/VoiceOver behaviour may differ. The authors also do not construct a complete functiona11ity error taxonomy, so the five categories should be treated as an initial, extensible framework rather than canonical. Read this paper before scoping any internal mobile-accessibility test automation project.

Tags: mobile accessibility · accessibility auditing · automated accessibility testing · generative agents · large language models · screen readers · TalkBack · WCAG · Android · agentic AI

Standards referenced: WCAG 2.2 · WCAG-EM 1.0 · EN 301 549