Interface Support for Evaluating Disability Bias in AI-Generated Images

Kelly Avery Mack, Lucy Jiang, Lotus Zhang, Leah Findlater · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791922

Summary

Mack and colleagues investigate whether interface-level interventions can help users of generative text-to-image (T2I) tools recognise and avoid disability stereotypes in AI-generated images. The authors frame the work around a gap in AI safety: while model-side debiasing is an active research area, end users — especially nondisabled prompters with limited disability familiarity — are a largely untapped site for mitigating bias at the point of use. Drawing on prior disability representation research, they consolidate 17 previously identified T2I stereotypes into four groups: pity (sad, lonely, or idle portrayals), extraordinary (superhero, bionic assistive technology, horror aesthetics), medical/mortality imagery, and inaccurate assistive technology. Two interventions are prototyped: an Education module that explains each stereotype with examples, and an AI Feedback component that uses GPT-4o-mini to flag stereotypes in a generated image. Both are evaluated alone and in combination through a controlled online experiment (N = 103 on Prolific) using pre-generated DALL-E 3 images, followed by a 90-minute qualitative Zoom study (N = 10) with an interactive ReactJS/Flask prototype that let participants generate images end-to-end. The work positions itself within Design Justice, crip technoscience, and "access as friction" traditions — treating friction as a pro-social design move that encourages critical review of model outputs rather than frictionless consumption.

Key findings

The Education intervention significantly reduced participants' self-reported likelihood of using images containing stereotypes (β = -0.62, p = 0.01 in a cumulative link mixed model), with an ~85% higher odds of giving a lower use rating after exposure; the AI Feedback intervention did not produce a statistically significant effect on either representation quality or likelihood of use. AI Feedback accuracy against the researchers' codebook averaged 80% agreement across the four stereotype groups, but with a 34% false-negative rate and only 4% false positives — the model missed stereotypes more often than it invented them. Over-reliance was pronounced: participants changed their ratings to match AI Feedback more than half the time (false positives 50.7%; false negatives 54.5%). The Education intervention also significantly reduced participants' confidence in their ability to assess disability representation (β = -1.65, p = 0.007), suggesting it surfaced the complexity of the task. Qualitatively, many participants wanted the image subject to visibly "look disabled" — through assistive technologies, metaphorical cues, or exaggerated symptoms — which sometimes reintroduced stereotypes even when participants had identified them. Disabled participants (N = 35) tended to rate stereotypical images lower than nondisabled participants.

Relevance

For accessibility practitioners working with or advising teams that deploy generative AI, this paper offers concrete, testable interface patterns and a sobering picture of their limits. User-facing education appears genuinely helpful for raising awareness and shifting behaviour, and is implementable today without model retraining; LLM-based feedback, however, is not yet reliable enough to carry the debiasing load and risks being taken at face value by users. The finding that prompters often want "realistic-looking disabled" subjects points to a real tension between disability visibility and stereotype avoidance that content guidelines, stock libraries, and AI tool defaults will all have to confront. Limitations include a DALL-E 3-only evaluation, a US sample recruited through Prolific that skewed nondisabled, and no blind or low-vision participants in the prompting studies — leaving open questions about other T2I models, other cultural contexts, and T2I use by disabled prompters themselves.

Tags: AI bias · generative AI · text-to-image · disability representation · disability stereotypes · AI auditing · end-user auditing · prompt engineering · AI over-reliance · DALL-E