AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code

Nadeen Fathallah, Daniel Hernández, Steffen Staab · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746360

Summary

This paper introduces AccessGuru, a novel method that combines traditional automated accessibility testing tools with large language models (LLMs) to both detect and correct web accessibility violations in HTML code. The work addresses a persistent gap in accessibility tooling: while existing automated tools like Axe, WAVE, and AChecker can detect syntactic and layout violations (missing alt attributes, insufficient color contrast), they cannot evaluate semantic quality — whether an alt text actually describes its image, whether a button label meaningfully conveys its function, or whether heading text accurately reflects its section content. The authors propose a three-part taxonomy of web accessibility violations that structures their approach. Syntactic violations involve missing or malformed HTML elements and attributes required for accessibility (e.g., missing alt text, absent table headers, missing ARIA attributes). Semantic violations occur when accessibility-enhancing elements are present but fail to convey meaningful content (e.g., alt text set to "image" instead of describing the actual image). Layout violations refer to visual or structural barriers that impede interaction (e.g., insufficient color contrast, viewport settings that prevent zooming). This taxonomy spans over 112 distinct violation types drawn from real-world web pages. AccessGuru operates in two stages. AccessGuruDetect uses Axe-Playwright for syntactic and layout detection, and GPT-4o with multimodal capabilities (analyzing both HTML and page screenshots) for semantic detection. AccessGuruCorrect then generates corrections using a sophisticated prompting strategy that integrates role-play prompting (adopting an accessibility expert persona), contextual prompting (enriching with violation-specific data and WCAG guidelines), and metacognitive prompting (structured self-reflection through five stages: comprehension, preliminary judgment, critical evaluation, decision confirmation, and confidence assessment). A corrective re-prompting step provides feedback when initial corrections still contain violations. The authors also created the first comprehensive publicly available dataset of 3,500 real-world web accessibility violations across all three categories, sourced from 448 URLs guided by the WebAIM 2025 study.

Key findings

AccessGuru with GPT-4 achieved up to 84% average violation score decrease on the authors' benchmark dataset, significantly outperforming three baseline methods: contextual prompting (46%), ReAct prompting (50%), and zero-shot prompting (19%). On semantic violations specifically, AccessGuru achieved a 96% violation score decrease, resolving 53 out of 55 semantic violations. A cross-LLM comparison tested GPT-4, Mistral-7B, and Qwen2.5-Coder. GPT-4 consistently outperformed both smaller models across all correction categories. Mistral-7B achieved 82% with AccessGuru's prompting (versus 13% with ReAct), and Qwen2.5 reached 74% (versus 44% with ReAct), demonstrating that AccessGuru's prompting strategy substantially improves even smaller models. An ablation study confirmed the value of corrective re-prompting: performance dropped from 84% to 72% without it. A human developer correction study comparing LLM corrections with those from three full-stack developers found an average Sentence-BERT semantic similarity score of 0.77, indicating strong alignment between AI and human corrections. Key reliability issues emerged: LLMs sometimes produced incomplete HTML, hallucinated unrelated content, or generated textual advice instead of code. Certain violation types proved persistently difficult — page-has-heading-one, color-contrast, and link-name were the most commonly uncorrected violations across all three LLMs. The system also only adjusts color values to meet contrast thresholds without considering semantic use of color (e.g., red for errors), and cannot handle dynamic content like dropdown menus or pop-ups since detection relies on static screenshots. Baseline methods often took an "Occam's Razor" approach — removing problematic elements rather than properly correcting them, or changing both foreground and background to black and white for contrast violations, which distorts visual design. AccessGuru's taxonomy-driven prompts prevented these issues.

Relevance

This research represents a significant advance in automated accessibility remediation, moving beyond detection-only tools toward systems that can both identify and fix violations. The three-part taxonomy (syntactic, semantic, layout) provides a practical framework for understanding different categories of accessibility issues and their distinct correction requirements. For accessibility practitioners and development teams, the key takeaway is that LLM-based tools can meaningfully assist with accessibility remediation — particularly for semantic violations that traditional automated tools miss entirely — but they are not yet reliable enough for unsupervised use. The 84% violation score decrease is impressive but means 16% of violations remain, and the system occasionally introduces new issues. The finding that baseline methods sometimes "fix" contrast by converting everything to black and white is a cautionary tale about naive AI application to accessibility. The publicly available dataset of 3,500 violations and the open-source code provide valuable resources for the accessibility research community. The prompting strategies — particularly the metacognitive approach with structured self-reflection stages — offer transferable techniques for anyone using LLMs for code generation or correction tasks. However, the current limitation of processing violations individually rather than producing fully corrected pages, and the inability to handle dynamic content or CSS-defined styles, represent important gaps that must be addressed before such tools can be deployed in production workflows.

Tags: automated testing · web accessibility · large language models · HTML remediation · prompt engineering · WCAG · accessibility violations · machine learning

Standards referenced: WCAG 2.1 · WCAG 4.1.2