Voice and Video-Capable Language Model
Also known as: VVLM, Multimodal AI Assistant, Video-Capable LLM
A large language model that can process real-time or near-real-time video and audio input alongside text, enabling conversational interaction about the visual world. VVLMs represent a shift from static image analysis (single photo question-answering) to dynamic, continuous visual understanding. In accessibility, VVLMs offer potential as navigation and environmental awareness tools for blind and low-vision users, though current limitations include frame sampling rather than true continuous video analysis, hallucinated responses, and weak egocentric spatial reasoning.
Category: artificial intelligence · assistive technology · visual impairment
Related: Visual Question Answering · AI Hallucination · Visual Grounding