When LLM-Generated Code Perpetuates User Interface Accessibility Barriers, How Can We Break the Cycle?

Alexandra-Elena Gurita, Radu-Daniel Vatavu · 2025 · Proceedings of the 22nd International Web for All Conference (W4A 2025) · doi:10.1145/3744257.3744266

Summary

This paper evaluates the ability of large language models (LLMs) to generate accessible web user interfaces, comparing ChatGPT (GPT-4-turbo) and Claude (3.5 Haiku) across two prompting strategies: accessibility-agnostic prompts ("Design the homepage of a banking app") and accessibility-oriented prompts that include explicit WCAG 2.1 requirements such as 200% zoom support, 4.5:1 contrast ratio, semantic structure with ARIA landmarks, labeled form controls, and 44x44px touch targets. The researchers generated 80 UIs total (10 per LLM-prompt combination, plus self-reflected improvements) and evaluated them using a three-pronged approach: automated testing with MAUVE++, expert evaluation by two accessibility specialists using a 5-point severity scale, and LLM self-reflection where the models were asked to identify and fix accessibility issues in their own output. The study builds on prior work by Aljedaani et al. (2024), which found that 84% of ChatGPT-generated websites exhibited accessibility violations in text resizing, contrast, and semantic relationships. The authors situate their work within the broader context of persistent web accessibility stagnation, noting that 97.5% of web homepages had detectable WCAG 2 failures in 2024, creating a self-reinforcing cycle where LLMs trained on inaccessible code reproduce inaccessible patterns.

Key findings

Accessibility-oriented prompts substantially reduced violation rates in expert evaluation from 58% to 19% and decreased average violation severity from 1.53 to 0.30 on a 0-4 scale. The strongest improvements appeared in semantic structure: landmark roles increased from 11% to 94%, heading hierarchy from 33% to 89%, skip links from 0% to 100%, and ARIA labels from 28% to 95%. Interactive element accessibility also improved significantly—keyboard access rose from 48% to 94%, focus indicators from 56% to 98%, and touch targets met the 44px minimum (up from 32px). However, automated testing told a more nuanced story: the accessibility-oriented prompt actually showed a slightly higher violation rate (17.32% vs. 15.93%), revealing that automated tools detect different issues than expert evaluation. Neither LLM demonstrated clear superiority across all metrics. LLM self-reflection achieved 76% alignment with expert evaluations and 94% accuracy in identifying issues, excelling at contrast and semantic structure problems but underreporting keyboard navigation barriers. Persistent challenges remained in language of page specification, name/role/value implementation, and non-text content alternatives regardless of prompt type.

Relevance

This research provides critical evidence for practitioners who increasingly rely on LLMs for UI code generation. The key takeaway is that explicit accessibility requirements in prompts make a dramatic difference—but are not sufficient alone. LLMs can implement technical accessibility features when instructed but lack deeper semantic understanding of why those features matter for users, often producing technically correct but practically inadequate solutions. The authors propose four actionable recommendations: explicitly include WCAG success criteria in prompts, perform supplementary testing for keyboard navigation and focus management, leverage LLM self-reflection as an initial accessibility check, and recognize that LLMs and visual design tools have complementary strengths. The finding that 97.5% of web content fails basic accessibility creates a self-reinforcing cycle in LLM training data, making curated accessible training datasets essential for future improvement. Organizations using AI-assisted development should treat LLM output as a starting point requiring human accessibility review, not a finished product.

Tags: large language models · WCAG compliance · automated accessibility · prompt engineering · code generation · AI accessibility

Standards referenced: WCAG 2.1 · ARIA · European Accessibility Act