Multimodal Large Language Model

Also known as: MLLM, Vision-Language Model, VLM

A deep learning model that can process and generate content across multiple types of input including text, images, audio, and video. In accessibility contexts, MLLMs like GPT-4o, Gemini, and Claude have become transformative tools for blind and low vision users, enabling on-demand description of visual content through services like Be My AI and Seeing AI. MLLMs can generate detailed, context-aware descriptions of images including scene descriptions, text reading, chart interpretation, and subjective assessments. However, they are prone to hallucination (fabricating content not in the image), misinterpretation, and omission, creating reliability challenges that are particularly dangerous for users who cannot visually verify the output.

Category: artificial intelligence · assistive technology

Related: AI Hallucination · Image Description · Visual Access Technology · AI Trust Calibration

Sources

https://dl.acm.org/doi/10.1145/3663547.3746393