Dataset Bias

Also known as: Training Data Bias, Data Representation Bias, Sampling Bias

A systematic skew in the composition of training data used to build machine learning models, resulting in models that perform well for overrepresented groups but poorly for underrepresented ones. In accessibility contexts, dataset bias is a pervasive problem: activity recognition datasets typically represent younger adults aged 18-48, image recognition datasets may not include images taken by blind users, and speech recognition datasets often exclude people with speech impairments. Addressing dataset bias requires deliberate inclusion of diverse populations in data collection, including older adults, people with various disabilities, and individuals from different cultural and socioeconomic backgrounds.

Category: Machine Learning · Ethics · AI fairness · data science

Related: Machine Learning · Disability-first Design · AI Fairness

Sources

https://doi.org/10.1145/3441852.3476475