Is Accessibility Conformance an Elusive Property? A Study of Validity and Reliability of WCAG 2.0
Giorgio Brajnik, Yeliz Yesilada, Simon Harper · 2012 · ACM Transactions on Accessible Computing · doi:10.1145/2141943.2141946
Summary
This landmark study investigates whether WCAG 2.0's definition of "Reliably Human Testable"—that at least 80% of knowledgeable evaluators would agree on an audit conclusion—is actually achievable in practice. The researchers recruited 25 experienced accessibility evaluators (professionals with publications or consulting experience) and 27 trained novices (students who completed 14 hours of accessibility coursework) to independently audit four web pages against all 61 WCAG 2.0 success criteria. The four pages (a Facebook group, IMDB movie page, Bloomberg homepage, and Scientific American article) were deliberately chosen to vary in layout, complexity, and accessibility support. Each evaluator rated every success criterion as "pass," "fail," or "not applicable," using whatever evaluation tools they preferred. The study measured both reliability (agreement between evaluators) and validity (accuracy of their judgments, with "correct" answers determined by majority consensus among experienced evaluators). This experimental design directly tests the foundational assumption underlying WCAG conformance claims: that accessibility can be objectively determined through expert inspection. The results challenge this assumption fundamentally, with implications for how organizations approach accessibility audits, procurement requirements, and legal compliance.
Key findings
The 80% agreement threshold was almost never achieved. Mean max-agreement among experienced evaluators was only 73% (SD=18%), meaning 25-30% of audit results would be contested. Only 5 of 61 success criteria consistently reached 80% agreement across all four pages, while 9 success criteria never reached this threshold on any page. Critically, expertise had no effect on reliability—novices showed identical agreement levels to experienced evaluators. Validity was also problematic. Experienced evaluators achieved 76% accuracy, while novices reached only 66%. The mean F-measure (balancing false positives and missed problems) was 0.70 for experienced evaluators and 0.52 for novices. On average, experienced evaluators produced 26-35% false positives while missing 26-35% of true accessibility problems. The primary benefit of expertise was reducing false positives by about 19%, not improving detection of real problems. The optimal evaluation strategy is pooling results from two independent experienced evaluators, which captures at most 76% of true problems with 24% false positives. Adding more evaluators does not improve results—with 10 experienced evaluators, performance actually matches that of just 2 novices. Novice evaluators take three times longer than experienced ones (170 vs 106 minutes) while producing less accurate results.
Relevance
This research has profound implications for accessibility practice. It demonstrates that conformance claims are inherently uncertain—even highly experienced evaluators working independently disagree on roughly 30% of success criteria, and collectively miss a quarter of real problems while flagging a quarter that aren't actually issues. Organizations relying on a single accessibility audit for compliance assurance are operating with significant hidden uncertainty. For procurement and legal contexts, the findings suggest that pass/fail conformance statements should be treated with appropriate skepticism. A site that "passes" one audit might well "fail" another conducted by equally qualified evaluators. The authors conclude that accessibility should be understood more like usability—contextual, dependent on user characteristics, and not reducible to a binary property. Practically, organizations should engage two independent experienced evaluators and pool their findings, which represents the optimal balance of coverage and accuracy. Training accessibility evaluators should focus on reducing false positives rather than improving problem detection. Perhaps most importantly, untrained developers, designers, or QA testers cannot be expected to reliably determine WCAG conformance—the validity gap between novices and experts is substantial, and untrained staff would perform even worse.
Tags: WCAG · conformance testing · accessibility evaluation · evaluator effect · reliability · validity · expert review
Standards referenced: WCAG 2.0