Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

Variation Summary: A concise presentation format for AI-generated image descriptions that explicitly organizes information into three categories: agreements (claims supported by all or most models), disagreements (claims where models conflict), and unique mentions (information provided by only one…
Variation Surfacing(also: Variation Display, Surfacing Variations): A technique for helping users assess AI reliability by generating multiple responses from one or more AI models and systematically presenting the differences, agreements, and unique mentions across those responses. In the context of image descriptions for blind and low vision…
Variation-Aware Description: A presentation format for AI-generated image descriptions that aggregates multiple model responses into a single coherent, hierarchical description while highlighting variations inline. When multiple AI models describe the same image, a variation-aware description combines their…
Video Summarization(also: Video Summary, Video Condensation): The process of creating a shortened version of a video that captures its key content, either through extractive methods (selecting key segments) or abstractive methods (generating new condensed content). Video summarization is an emerging accessibility tool that can make…
Vision Language Model(also: VLM, Vision-Language Model, Multimodal Large Language Model): A machine-learning model trained to take both images and natural-language text as input and to produce natural-language output. Modern VLMs — such as GPT-4o, Gemini, and Claude — can describe a photo, read text inside an image, answer questions about a scene, identify objects,…
Vision-Language Model(also: VLM, Multimodal AI Model, Large Multimodal Model): An artificial intelligence model that can process and reason about both visual (image/video) and textual information simultaneously. Vision-language models like GPT-4o, Claude, and Gemini can describe images, answer questions about visual content, and generate text based on…
Vision-and-Language Navigation(also: VLN): Vision-and-language navigation is a task setup in which an agent follows natural-language instructions to move through a visual environment, grounding words like 'turn left at the blue sofa' onto what it sees in real time. Research in VLN has moved from small indoor simulators…
Visual Access Technology(also: Visual Assistance Technology, Visual Access Tools): Technologies that help blind and low vision people understand visual content in both digital and physical environments. Traditional visual access technologies include screen readers, magnification software, and human-powered description services (like Be My Eyes with volunteer…
Visual Dialogue(also: Visual Dialog, VisDial): Visual dialogue is an AI task that involves holding a multi-turn natural language conversation about visual content such as an image or video frame. Unlike single-turn visual question answering (VQA), visual dialogue systems maintain context across multiple exchanges, using…
Visual Document Understanding(also: VDU, Document Understanding): A field of AI research focused on the interpretation and analysis of visually-rich digital documents such as forms, tables, menus, reports, receipts, and academic papers. Visual document understanding goes beyond basic OCR text extraction by comprehending the spatial layout,…
Visual Grounding(also: Grounded Visual Understanding): The ability of an AI model to connect its language output to specific elements actually present in the visual input, ensuring that descriptions and responses are anchored to real objects and scenes rather than generated from learned patterns or assumptions. Poor visual grounding…
Visual Interpreter(also: Visual Interpreter Service, Visual Description Service, VIDS): A visual interpreter or description service (VIDS) is a technology or human-powered service that provides people who are blind or have low vision with descriptions of their visual surroundings, typically by receiving camera feeds from the user's smartphone or smart glasses.…
Visual Language Model(also: VLM, Vision-Language Model): AI models that can process and reason about both visual and textual information, combining computer vision with large language model capabilities. VLMs could potentially enhance assessment descriptors by providing contextually rich and customizable descriptions of visual…
Visual Question Answering(also: VQA): A task in which a system receives an image and a natural language question about that image, then generates a natural language answer. VQA emerged as a key accessibility paradigm through services like VizWiz, where blind users could submit photos with questions and receive…
Visual Verification(also: Visual Fact-Checking): The process of confirming the accuracy of information by visually inspecting the original source material. In accessibility contexts, visual verification represents a fundamental challenge for blind and low vision users who cannot directly compare AI-generated descriptions…
Visual question answering(also: VQA, Visual QA): A computer vision and natural language processing task in which a system answers natural language questions about the content of an image or video. In accessibility contexts, VQA enables blind and visually impaired users to query visual content interactively — asking specific…
VizWiz: A mobile application and research platform that allows blind people to take photos with their phones and receive answers to visual questions from human workers or AI systems. VizWiz originated as a research project at Carnegie Mellon University and has generated important…
Voice Assistant(also: Virtual Assistant, Smart Speaker): An AI-powered system that responds to voice commands to perform tasks, answer questions, and control devices, such as Amazon Alexa, Google Assistant, and Apple Siri. Voice assistants have accessibility potential for people with vision impairments by providing hands-free,…
Voice Cloning(also: Voice Synthesis Cloning, Personalized Text-to-Speech): The use of machine-learning models to synthesise a target speaker's voice from a short reference recording, enabling text-to-speech output that sounds like that specific person. For accessibility, voice cloning has transformative potential: people whose voices are at risk of…
Voice and Video-Capable Language Model(also: VVLM, Multimodal AI Assistant, Video-Capable LLM): A large language model that can process real-time or near-real-time video and audio input alongside text, enabling conversational interaction about the visual world. VVLMs represent a shift from static image analysis (single photo question-answering) to dynamic, continuous…
Voice-Activated Personal Assistant(also: VAPA, Voice Assistant, Virtual Assistant): AI-powered software that responds to spoken commands to perform tasks such as scheduling, setting reminders, searching information, and controlling devices. Examples include Siri, Alexa, Google Assistant, and Cortana. For blind and low vision users, VAPAs offer hands-free…
Voice-activated personal assistant(also: VAPA, Smart assistant, Virtual assistant): An AI-powered software agent that responds to voice commands to perform tasks such as answering questions, controlling smart home devices, managing schedules, and reading content aloud. For people with visual impairments, VAPAs like Amazon Alexa, Google Assistant, and Apple Siri…

22 results.

Category

Search results