Web Accessibility Evaluation in a Crowdsourcing-Based System with Expertise-Based Decision Strategy

Shuyi Song, Jiajun Bu, Ye Wang, Zhi Yu, Andreas Artmeier, Lianjun Dai, Can Wang · 2018 · Proceedings of the 15th International Web for All Conference (W4A 2018) · doi:10.1145/3192714.3192827

Summary

This short paper presents a crowdsourcing-based web accessibility evaluation system that addresses the scalability bottleneck in manual accessibility evaluation: expert evaluators are scarce (training takes years) and expensive, yet many WCAG checkpoints cannot be evaluated automatically and require human judgement. The system, developed through a collaboration between Zhejiang University and the China Disabled Persons’ Federation, introduces two novel decision strategies for synthesising reliable evaluation results from workers with heterogeneous expertise levels. The architecture has four modules: Sampling (selecting representative web pages from a site using random walk or stratified sampling), Automatic Evaluation (checking structural/technical checkpoints like alt text presence), Manual Evaluation (assigning subjective checkpoints to multiple human workers), and Measuring (aggregating results into accessibility scores). For manual evaluation, each task is assigned to multiple workers who independently judge whether a checkpoint is accessible or inaccessible on a given web page. The key challenge is that simple Majority Vote treats all workers’ opinions equally, which is problematic when workers have very different expertise levels — an expert’s correct opinion can be outvoted by multiple non-experts’ incorrect ones. The Golden Set Strategy (GSS) addresses this by interspersing known-answer "golden tasks" among regular tasks to estimate each worker’s accuracy, then weighting opinions by expertise when merging. The Time-Based Golden Set Strategy (T-GSS) extends this by also considering the time workers spend on each task, based on the observation that workers who spend very little time before marking a page "accessible" are likely answering carelessly, while time spent on "inaccessible" judgements should be shorter since the evaluator can submit as soon as they find a barrier.

Key findings

The system was evaluated on 98 Chinese government websites providing services for people with disabilities, generating 23,901 manual evaluation tasks from 4,617 sampled web pages across 27 WCAG checkpoints that could not be automatically evaluated. Fifty non-expert volunteers (trained in accessibility evaluation with varying expertise, ages 23-36, mean 27) and five expert evaluators (7+ years of accessibility experience) participated. Each task was assigned to 4-7 workers, with golden tasks randomly interspersed. Expert evaluation of all tasks served as ground truth. The T-GSS strategy achieved 80.15% accuracy compared to 72.94% for standard Majority Vote — a 7.21% improvement. GSS alone achieved 78.61% accuracy. Both strategies significantly outperformed Majority Vote, confirming that expertise weighting improves result quality. Crucially, the crowdsourcing approach completed the evaluation in approximately 31 hours — half the estimated 66 hours that would be needed if only the five experts worked in parallel at approximately 10 seconds per task. The system identified a bias toward reporting pages as accessible: workers with lower expertise were more likely to miss accessibility barriers and mark pages as accessible, confirming that equally weighting all opinions (as in Majority Vote) systematically underestimates the number of accessibility problems.

Relevance

This paper addresses a critical scalability challenge in web accessibility: the growing demand for evaluation far exceeds the supply of qualified experts. The crowdsourcing approach with expertise-weighted decision strategies offers a practical middle ground between expensive expert-only evaluation and unreliable automated-only checking. For accessibility practitioners and organisations managing large-scale evaluations, the key insights are: (1) non-experts can meaningfully contribute to manual accessibility evaluation when their opinions are weighted by demonstrated accuracy; (2) time spent on tasks provides a useful signal about answer reliability, particularly for distinguishing careless from thoughtful responses; (3) the bias toward "accessible" judgements among less experienced evaluators means that simple vote counting will miss real barriers; and (4) the golden set approach for estimating worker reliability is practical and effective. The system’s modular architecture means its decision strategies could be integrated into existing accessibility evaluation platforms. The collaboration with the China Disabled Persons’ Federation and focus on Chinese government disability service websites makes this one of the few accessibility evaluation studies conducted in a Chinese context, expanding the geographic diversity of the research base.

Tags: web accessibility · crowdsourcing · accessibility evaluation · WCAG · automated testing · manual evaluation · expertise · quality assurance · China

Standards referenced: WCAG · Section 508