What's in an ALT Tag? Exploring Caption Content Priorities through Collaborative Captioning

Annika Muehlbradt, Shaun K. Kane · 2022 · ACM Transactions on Accessible Computing · doi:10.1145/3507659

Summary

This paper investigates what makes a good image caption through a novel collaborative captioning methodology. Six pairs of blind and sighted partners—including married couples, friends, and roommates—worked together to create and refine captions for 15 social media photographs across three categories: people, places, and events. The study used Google Docs with JAWS screen reader support, allowing both partners to contribute to a shared document in real time. The research addresses a fundamental tension in image captioning: sighted captioners don't know what information blind users need, while blind users can't evaluate how comprehensive a caption is without seeing the image. By making captioning collaborative, the researchers could observe the negotiation process, capture the questions blind users ask about images, and document how caption content priorities differ across image types and between individuals. Each session lasted approximately two hours, with participants first creating detailed captions collaboratively and then shortening them to identify which elements were most essential. The study yielded 72 initial captions averaging 44.9 words each, plus shortened versions that revealed what content participants considered expendable versus essential. The analysis combined conversation coding, parts-of-speech analysis, cosine similarity measures, and post-task interviews to understand both the process and products of collaborative captioning.

Key findings

Caption content varied dramatically by image category. For people images, participants emphasized clothing details, accessories, colors (mentioned 126 times), gender, and age. For places, they focused on architectural structures, materials, and shapes—but surprisingly used few directional or positional words to describe spatial relationships. For events, captions described both settings and activities, with more directional language to explain composition. Blind participants asked an average of four follow-up questions per image, totaling 198 questions across the study. The most common question types were about object attributes like color and size (36.3%), image composition (18.3%), aesthetics (11.1%), and people's actions or intentions (10.2%). Questions about emotional content—"Is it a happy picture?"—were least common at 9%. These questions reveal information gaps that captions typically fail to address. The shortening task revealed that participants converged on essential content: shortened captions were significantly more similar to each other than initial captions (cosine similarity 0.11 vs 0.08). Participants prioritized restructuring sentences over removing content, and when forced to cut, they removed redundant adjectives, implied context, and accessory details while preserving core descriptions. Sighted partners were more reluctant to remove content, with some insisting that all details were integral to understanding. The collaborative process revealed important dynamics: sighted partners generated 69.3% of caption text, and blind participants found collaboration more difficult (average 3.5 on 5-point scale) than sighted partners (2.83). Cross-ability collaboration faced accessibility barriers—screen readers struggled with real-time document editing, leading many blind participants to dictate rather than type.

Relevance

This research has immediate implications for anyone writing alt text or developing captioning guidelines. The finding that blind users frequently ask about aesthetics, emotions, and composition challenges the common assumption that captions should focus only on "objective" visual facts. The question categories identified (attributes, composition, aesthetics, actions, emotions) could serve as a checklist for caption authors. The study also demonstrates that caption quality is inherently contextual—what matters depends on the image type, the audience, and the relationship between captioner and consumer. Generic captioning guidelines may be insufficient; the paper suggests that future captioning systems could incorporate question prompts customized by image category to help authors include commonly requested information. For practitioners, the shortening task findings are particularly useful: when captions must be brief, prioritize core subject descriptions over accessories and implied context. The convergence in shortened captions suggests there is some consensus about essential content, even when initial approaches vary widely. However, the significant individual differences remind us that personalization matters—what one blind user considers essential, another may find irrelevant.

Tags: image descriptions · alt text · collaborative accessibility · cross-ability collaboration · screen readers · social media accessibility

Standards referenced: WCAG