OPTIMAL-EM: Complexity-Driven Clustering for Optimised Web Accessibility Evaluation
Alexander Hambley, Yeliz Yesilada, Markel Vigo, Simon Harper · 2026 · ACM Transactions on the Web, Vol. 20, No. 2 · doi:10.1145/3799797
Summary
Hambley, Yesilada, Vigo, and Harper (University of Manchester and METU-NCC) extend their OPTIMAL-EM methodology for large-scale web accessibility conformance evaluation. The problem they address is fundamental to professional auditing: the W3C's Website Accessibility Conformance Evaluation Methodology (WCAG-EM) instructs auditors to assess a 'representative sample' of pages from a target site, but the sampling method is non-probabilistic and offers no statistical basis for generalising findings to the rest of the site. Because WCAG manual evaluation is time-consuming — especially for criteria like 1.1.1 Non-text Content that automated tools cannot fully assess — picking the wrong sample can either waste auditor time on redundant templated pages or miss barrier-rich sections entirely. OPTIMAL-EM reframes sampling as a clustering problem: represent each page as a vector of HTML tag frequencies (either all tags or only block-level tags), reduce dimensionality with t-SNE, cluster structurally similar pages using DBSCAN, and then measure two cluster-level metrics — average complexity (the ratio of 'rich' embedded/interactive HTML elements to total elements) and complexity *variance* (how much pages inside a cluster differ from each other). The authors ran the pipeline against four real sites: the University of Manchester's StudentNet (388 pages, the training set) and three validation sites — University of Cambridge, Goodreads, and The Eclipse Foundation — each sampled at 500 pages. Accessibility barriers were measured with axe-core via Pa11y, categorised by severity (critical, serious, moderate, minor). The three research questions ask whether within-cluster complexity variance predicts accessibility barriers, whether variance is a stronger predictor than average complexity, and how clustering can guide more representative sampling.
Key findings
Across three of four sites, higher within-cluster complexity variance correlated positively and often strongly with total accessibility barrier counts: r=0.61 for StudentNet, 0.37 for Cambridge, 0.76 for Goodreads. Critical barrier correlations with variance were particularly striking on Cambridge (0.91) and Goodreads (0.72). Conversely, *average* complexity correlated only weakly or negatively with barriers (r=-0.36 on StudentNet), meaning that dense, rich-content pages are not inherently more inaccessible — it is inconsistency *between* pages in a cluster that predicts problems. Moderate-severity barriers drove most of the signal (variance-to-moderate-barriers r=0.76 on StudentNet). The Eclipse Foundation was an instructive outlier (r=-0.23) because site-wide shared components propagate identical barriers across all pages, flattening the variance signal; in that case total-to-serious correlation was 1.00, showing that site-wide template issues dominate. Restricting the representation to block-level HTML preserved the overall pattern (variance-to-total r=0.49 vs 0.61) but reduced signal on inline-element barriers like missing alt text on <svg>. Most pages fell into large, templated clusters with low variance and few barriers; a long tail of small, ad-hoc clusters had high variance and disproportionate barrier counts. The authors propose two actionable recommendations: target manual evaluation at clusters with high complexity variance, and treat consistency as a stronger accessibility predictor than simplicity.
Relevance
For professional accessibility auditors and procurement teams, this work offers a statistically defensible alternative to WCAG-EM's ad-hoc sampling — particularly relevant given the EU Web Accessibility Directive and the UK Public Sector Bodies accessibility regulations, which require conformance evaluation across potentially thousands of pages on public-sector sites. The core insight — 'design consistency matters more than design simplicity' — is practically useful for design systems and templating teams: fixing a single shared template can cascade accessibility improvements across a cluster, while bespoke ad-hoc pages deserve disproportionate audit attention. The methodology also fits naturally with modern front-end architectures built from shared components. Limitations: the complexity metric is a coarse structural ratio and ignores visual aesthetics, perceived complexity, and content semantics; Axe-core alone is used, and prior work shows automated tools cover only ~50% of WCAG criteria, so the 'barriers' count is partial; DBSCAN parameter tuning requires domain expertise; and the Eclipse Foundation result shows the method can mislead when site-wide templates carry shared barriers. Future work could explore hierarchical DBSCAN and Affinity Propagation, which naturally surface cluster exemplars for sampling.
Tags: web accessibility · accessibility evaluation · WCAG-EM · OPTIMAL-EM · representative sampling · web page complexity · clustering · DBSCAN · t-SNE · machine learning · audit methodology · automated testing
Standards referenced: WCAG 2.0 · WCAG 2.1 · WCAG 2.2 · WCAG-EM · EU Web Accessibility Directive · ISO/IEC 40500:2012