Expanding Perspectives to Improve Access to Visual Archives through Multimodal Image Enrichment
Karina Rodriguez Echavarria, Myrsini Samaroudi · 2026 · ACM Journal on Computing and Cultural Heritage · doi:10.1145/3771993
Summary
This paper addresses a pervasive challenge in the Galleries, Libraries, Archives and Museums (GLAM) sector: large-scale visual collections that have been digitised but remain undiscoverable because they lack descriptive metadata. The authors, from the University of Brighton, propose a multimodal AI-driven workflow for automated metadata enrichment and demonstrate it on the Design Council Photographic Library's glass plate negatives dataset — approximately 9,300 digitised black-and-white images of 20th-century British industrial design, which had almost no metadata beyond their physical box identifiers. The workflow operates in four stages. First, all images are hosted through an IIIF (International Image Interoperability Framework) server, ensuring compliance with the FAIR principles (Findable, Accessible, Interoperable, Reusable). Second, expert labels are generated by fine-tuning the DinoV2 foundation model on ~2,000 manually labelled images covering the Design Council's 35-category taxonomy — producing top-three category predictions for every image. Third, non-expert labels (free-text captions) are generated using the BLIP vision-language model, providing natural language descriptions that go beyond formal taxonomy. Fourth, a Search and Browse interface is built using a 3D scatter plot (UMAP dimensionality reduction of image embeddings) alongside a text-based semantic search (GloVe word embeddings), allowing users to query the collection in natural language and explore it spatially. The paper explicitly frames the accessibility goal as dual: enabling access for both expert researchers (using domain taxonomy) and non-expert users (using everyday language, personal associations, or emotional descriptors such as "cosy" or "minimalistic design"). Two domain-expert archivists evaluated the system, providing structured feedback via the User Experience Questionnaire (UEQ-S) and NASA Task Load Index.
Key findings
The fine-tuned DinoV2 classifier achieved 73% accuracy on the expert taxonomy, with 88% of cases having the correct category in the top-three predictions — a practical result for a dataset of 9,300 previously uncatalogued images. The AI approach is estimated to reduce a 62-working-day manual cataloguing effort to a tractable automated workflow, democratising access to collections that would otherwise remain invisible. The BLIP captioning model successfully generated descriptive non-expert labels, though it showed sensitivity to image rotation (the glass plates were digitised at arbitrary orientations). Even low-confidence classifications yielded useful descriptive terms. The dual expert/non-expert labelling strategy is a key contribution: users can retrieve content through formal Design Council categories or through everyday descriptors, mental-health associations, emotional responses, or stylistic terms. The 3D scatter-plot visualisation offered more spatial affordances than equivalent 2D views but created usability problems in practice. Expert users struggled to maintain orientation within the 3D space, lost track of which data points they had already explored, and did not discover or use the Modebar toolbar. The researchers identified a need for visual wayfinding aids — markers for visited versus unvisited points, colour and shape coding — to reduce cognitive load and support systematic exploration of large-scale datasets. Text-based semantic search was found to be complementary and more immediately usable than the spatial visualisation.
Relevance
For digital accessibility practitioners, this paper's significance lies at the intersection of two familiar concerns: making content discoverable for diverse audiences, and designing interfaces that do not overwhelm users cognitively. The FAIR principles, particularly the "Accessible" dimension, provide a shared framework between accessibility practice and digital heritage — the paper operationalises FAIR accessibility as a technical pipeline, not merely a policy aspiration. The finding that non-expert, emotionally or experientially motivated search terms can retrieve culturally significant content has direct relevance to inclusive design: taxonomies built by domain specialists often fail users who lack specialist vocabulary, including users with cognitive disabilities, people from non-Western cultural contexts, or simply members of the general public. The 3D visualisation findings are a case study in cognitive load and spatial accessibility — the interface had sufficient features but insufficient wayfinding support, echoing WCAG guidance on predictable navigation and orientation. Practitioners working on digital archive accessibility, museum interface design, or AI-assisted content description will find the methodology and the failure modes of the 3D interface particularly instructive.
Tags: cultural heritage · metadata enrichment · AI image classification · FAIR principles · information discovery · visual archives · multimodal AI · accessibility · museum accessibility · cognitive load · IIIF · GLAM
Standards referenced: FAIR Principles · IIIF (International Image Interoperability Framework)