The Transparency of Automatic Web Accessibility Evaluation Tools: Design Criteria, State of the Art, and User Perception

Marco Manca, Vanessa Palumbo, Fabio Paternò, Carmen Santoro · 2023 · ACM Transactions on Accessible Computing · doi:10.1145/3556979

Summary

This paper investigates a critical but often overlooked problem in web accessibility practice: the lack of transparency in automated accessibility evaluation tools. While these tools are widely used to identify WCAG violations, users frequently struggle to understand why different tools produce divergent results, what techniques each tool actually checks, and how to interpret the findings. The authors argue that transparency—clearly communicating inputs, capabilities, limitations, and outputs—is essential for tools to be trusted and used effectively. The research develops five transparency design criteria: C1 (what standards, success criteria, and techniques are supported), C2 (how accessibility issues are categorized using EARL terminology like passed/failed/cannot tell), C3 (how validation results are presented at different granularity levels), C4 (whether practical guidance for fixing problems is provided), and C5 (whether the tool discloses its limitations, such as inability to test dynamic content). These criteria were derived from prior research, collaboration with national accessibility agencies, and direct observation of accessibility practitioners. The authors analyzed 11 free, license-free tools from the W3C Web Accessibility Evaluation Tools list against these criteria. Tools examined include WAVE, MAUVE++, QualWeb, Lighthouse, IBM Equal Access Accessibility Checker, and others. The analysis revealed significant gaps: only 5 of 11 tools explicitly list which techniques they implement, few clearly communicate their limitations, and dynamic content support (crucial for modern SPAs) is rarely documented.

Key findings

A survey of 138 accessibility professionals (36% accessibility experts, 25% web developers, 11% web commissioners) found strong support for all transparency criteria. On a 5-point scale, users rated "suggestions to solve errors" highest (M=4.67), followed by "standards/techniques supported" (M=4.62), and "tool limitations information" lowest but still high (M=4.22). Notably, 64% of respondents reported having experienced difficulty understanding results from automated tools—citing mismatches between tool output and actual page state, divergent results across tools, unclear error messages, and lack of remediation guidance. Frequency of tool use significantly affected preferences: frequent users valued detailed technical information and explicit limitation disclosure more than infrequent users, who preferred summary accessibility scores. The most commonly used tools were MAUVE++ (16%), WAVE (14%), SiteImprove (11%), and Lighthouse (6%). A user test with 18 participants (accessibility experts, web developers, web commissioners) evaluated MAUVE++, QualWeb, and Lighthouse against the transparency criteria. None achieved full transparency. Transparency ratings (1-5 scale) were: MAUVE++ (mean 3.88, median 4), QualWeb (mean 4, median 3), Lighthouse (mean 2.44, median 2). Key gaps included: Lighthouse does not explicitly reference WCAG guidelines; most users could not correctly distinguish between errors and warnings; half of participants could not find information about tool limitations. Users appreciated MAUVE++'s multiple views (developer vs. end-user) and QualWeb's filtering capabilities, but criticized Lighthouse's lack of explicit standards references.

Relevance

This research provides a practical framework for both evaluating and designing accessibility tools. Organizations selecting tools should assess them against the five transparency criteria—not just coverage statistics—to ensure the tool will be interpretable by their team. The finding that frequent users need detailed technical information while occasional users prefer summaries suggests tools should support multiple presentation modes. For tool developers, the paper identifies specific improvements: explicitly list supported techniques (not just guidelines), clearly distinguish errors from warnings, disclose limitations around dynamic content and JavaScript frameworks, provide contextual remediation guidance beyond links to W3C documentation, and communicate how accessibility scores are calculated. The research highlights a broader issue: even accessibility experts struggle to interpret automated tool results. This has implications for organizational accessibility programs—tool outputs cannot be treated as definitive, and manual testing remains essential. The parallel drawn to AI explainability is apt: as tools become more complex, transparency about their decision-making becomes more critical. The criteria proposed here could inform procurement requirements, tool certifications, or future W3C guidance on evaluation tool design.

Tags: automated testing · accessibility evaluation tools · WAVE · axe · Lighthouse · MAUVE++ · transparency · usability · WCAG compliance

Standards referenced: WCAG 2.0 · WCAG 2.1 · EARL · ACT Rules · EN 301 549 · Section 508