Vision-Language Model

Also known as: VLM, Multimodal AI Model, Large Multimodal Model

An artificial intelligence model that can process and reason about both visual (image/video) and textual information simultaneously. Vision-language models like GPT-4o, Claude, and Gemini can describe images, answer questions about visual content, and generate text based on visual input. In accessibility, these models enable new forms of visual feedback for blind users, including aesthetic evaluation of images, detailed scene descriptions, and interactive question-answering about visual content. Challenges include accuracy, bias, and the difficulty of conveying subjective visual qualities through language.

Category: artificial intelligence · assistive technology

Related: Generative AI · Image Recognition · Alt Text · Visual Question Answering

Sources

https://doi.org/10.1145/3663547.3746345