Voice and Video-Capable Language Model

Also known as: VVLM, Multimodal AI Assistant, Video-Capable LLM

A large language model that can process real-time or near-real-time video and audio input alongside text, enabling conversational interaction about the visual world. VVLMs represent a shift from static image analysis (single photo question-answering) to dynamic, continuous visual understanding. In accessibility, VVLMs offer potential as navigation and environmental awareness tools for blind and low-vision users, though current limitations include frame sampling rather than true continuous video analysis, hallucinated responses, and weak egocentric spatial reasoning.

Category: artificial intelligence · assistive technology · visual impairment

Related: Visual Question Answering · AI Hallucination · Visual Grounding

Sources

https://doi.org/10.1145/3663547.3749833