On the Identification of Accessibility Bug Reports in Open Source Systems

Wajdi Aljedaani, Mohamed Wiem Mkaouer, Stephanie Ludi, Ali Ouni, Ilyes Jenhani · 2022 · Proceedings of the 19th International Web for All Conference (W4A) · doi:10.1145/3493612.3520471

Summary

This paper addresses a significant gap in software engineering research: the automated identification of accessibility-related bug reports in open-source projects. While bug tracking systems like Bugzilla and Monorail accumulate vast numbers of reports, manually sifting through them to find accessibility-specific issues is time-consuming and error-prone. The authors frame this as a binary classification problem — given a bug report, determine whether it relates to accessibility or not. The study draws on an existing dataset of manually curated accessibility bug reports from seven popular open-source projects spanning three major platforms: Mozilla (Firefox, Firefox Core, Mac), Google Chromium (Windows, Chrome, Android), and Apache (NetBeans). In total, 2,567 accessibility bug reports and 256,700 non-accessibility bug reports were analyzed, representing data collected between 1997 and 2020. The methodology follows a three-step pipeline: data collection from Bugzilla and Monorail repositories, text preprocessing using NLP techniques (tokenization, special character removal, stop-word removal, and lemmatization), and data transformation via feature hashing to convert textual descriptions into numerical feature vectors. Five machine learning classifiers were evaluated: Decision Tree, Random Forest, Decision Jungle, Support Vector Machine (SVM), and Neural Network. The researchers used 10-fold stratified cross-validation to account for the severe class imbalance inherent in the dataset.

Key findings

The classifier achieved high F1-scores of 93%, demonstrating that automated identification of accessibility bug reports is feasible. Decision Tree consistently outperformed all other classifiers across most projects, achieving accuracy scores of 0.89-0.92 with strong precision and recall. Neural Networks performed comparably, particularly excelling on the Firefox project with 0.91 accuracy, 0.94 precision, and 0.97 AUC. SVM was the weakest performer overall due to its kernel trick being less suited to the small, imbalanced datasets. The study also investigated minimum training data requirements, finding that a single fold of bug reports (as few as 10 accessibility and 100 non-accessibility reports) was sufficient to achieve performance equivalent to 93% F1-score using Random Forest. NetBeans and Android bug reports contained the most discriminative keywords, yielding the highest classification accuracy. Inter-rater agreement during manual validation reached a Cohen Kappa coefficient of 0.83, indicating almost perfect agreement. The researchers identified key discriminative keywords organized by accessibility guideline categories, including terms related to screen readers, visual impairment, audio descriptions, form labels, and text alternatives.

Relevance

For accessibility practitioners and development teams, this research offers a practical path toward automating one of the most tedious aspects of accessibility quality assurance: finding accessibility bugs in large repositories. Organizations maintaining open-source projects could deploy similar classifiers to automatically flag and prioritize accessibility-related bug reports, ensuring they receive timely attention rather than being lost in the volume of general defect reports. The finding that relatively small training datasets suffice makes this approach accessible to teams without large labeled datasets. However, the study is limited to English-language bug reports and open-source systems, and the models have not been validated on commercial or industrial projects where bug report quality and terminology may differ. The reliance on text descriptions alone also means the classifier cannot catch accessibility issues that are poorly described or use non-standard terminology.

Tags: accessibility bug reports · machine learning · open source · bug classification · automated testing · text classification · software quality

Standards referenced: WCAG 2.1 · BBC Mobile Accessibility Guidelines