LLMs for Accessibility in Mobile Apps: Detection and Repair
Wajdi Aljedaani, Ahmed Aljohani, Marcelo M. Eler, Abdulrahman Habib, Hyunsook Do · 2025 · Proceedings of the 22nd International Web for All Conference (W4A 2025) · doi:10.1145/3744257.3744270
Summary
This study evaluates the capacity of three large language models—GPT-4o, Gemini 1.0 Pro, and Llama 3—to detect, classify, and remediate accessibility violations in Android mobile applications. While prior LLM accessibility research has focused primarily on web applications, this work addresses the underexplored domain of mobile app accessibility. The researchers selected 108 open-source Android apps from the F-Droid repository across categories including education, weather, social networking, finance, and entertainment. Using Google's Accessibility Scanner, they identified accessibility violations in XML layout files, which were then manually validated by two expert evaluators (inter-rater agreement kappa = 0.84), yielding 404 confirmed violations across five categories: color contrast (123), image alt text (75), hardcoded text (54), spacing issues (41), and touch targets (43). Each LLM was prompted using a hybrid strategy combining Few-Shot Prompting with Chain-of-Thought reasoning, configured with low temperature (0.1) and top-p (0.75) settings to maximize deterministic and accurate output. The LLMs were tasked with detecting violations from XML snippets, identifying the affected UI element, explaining the issue, and proposing code fixes in structured JSON format.
Key findings
Detection rates varied significantly: Llama detected the most violations (126 of 336, 38%), followed by GPT (100, 30%) and Gemini (89, 26%), but none exceeded 38% of the issues found by the automated Accessibility Scanner. All models struggled critically with pinpointing exact XML line locations—GPT identified only 8 correct locations, Gemini 5, and Llama none. For remediation, GPT demonstrated the best overall performance: it correctly described fixes for 68% of detected issues, generated correct fixing code for 55%, and produced syntactically valid code for 38%. Gemini followed with 62% correct fix descriptions, 47% correct code, and 34% compilable fixes. Llama, despite detecting the most violations, produced no syntactically correct code fixes. Performance varied by violation type: GPT achieved 100% fix rate for missing alt text; Gemini excelled at hardcoded text fixes (84.6%) and spacing issues (89%); Llama led in color contrast detection (46.3%) but had the worst fix rates. No model exceeded 50% success in fixing color contrast violations. The models frequently provided general recommendations rather than actionable, standards-specific guidance, and sometimes merged distinct violations into single responses.
Relevance
This research provides the first comprehensive evaluation of multiple LLMs for Android mobile accessibility, an area where accessibility tooling lags behind web development. The key practical finding is that LLMs are currently best suited as assistive tools that augment developer capabilities rather than as replacements for human accessibility expertise or dynamic analysis tools like Google's Accessibility Scanner. The models show promise for early-stage static analysis during development—identifying potential issues in XML layout files before runtime—which is faster and cheaper than dynamic testing that requires running the app on a device or emulator. However, the inability to produce consistently compilable fixes means human intervention remains essential. For mobile development teams, the study suggests a workflow where LLMs flag potential issues and describe violations (where they perform well at 84-99% accuracy) while developers handle the actual code remediation. The significant performance differences across violation types indicate that future accessibility-specific fine-tuning of LLMs should prioritize the most challenging categories like color contrast and touch targets.
Tags: mobile accessibility · large language models · Android accessibility · automated accessibility testing · accessibility remediation · code generation
Standards referenced: WCAG 2.1 · Android Accessibility Guidelines