Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

Time-Causal Model(also: Temporal Causal Model, Sequential Logic Model): A computational model that enforces temporal coherence in predictions by ensuring that the sequence of recognized events follows a logical causal order. In recipe tracking, a time-causal model prevents the system from predicting that an earlier step is currently happening after…
Transfer Learning: A machine learning technique where a model trained on a large general dataset is adapted to perform a new, more specific task using a much smaller amount of new training data. Rather than training a model from scratch, transfer learning leverages patterns already learned by an…
UI Detection(also: User Interface Detection, GUI Element Detection): The use of computer vision and machine learning to automatically identify and classify user interface elements (buttons, text fields, icons, toggles, etc.) from screenshots or screen pixels. In accessibility contexts, UI detection enables systems to generate accessibility…
VQA(also: Visual Question Answering): VQA (Visual Question Answering) is an AI task in which a system answers natural-language questions about the content of an image. In assistive contexts, VQA systems such as Be My AI, Seeing AI, and Aira let blind and low-vision users ask about their visual surroundings - from…
Video Inpainting(also: Video Fill, Content-Aware Video Fill): A computer vision technique that fills in removed or missing regions of a video frame with plausible content generated based on surrounding visual information. Video inpainting is used in accessibility applications to seamlessly remove distracting visual elements (overlays,…
Video Segmentation(also: Scene Segmentation, Video Scene Detection): The process of dividing a video into meaningful segments or scenes based on visual changes, content shifts, or thematic transitions. Video segmentation enables granular customization and navigation, allowing viewers to apply different settings to different parts of a video or…
Vision Language Model(also: VLM, Vision-Language Model, Multimodal Large Language Model): A machine-learning model trained to take both images and natural-language text as input and to produce natural-language output. Modern VLMs — such as GPT-4o, Gemini, and Claude — can describe a photo, read text inside an image, answer questions about a scene, identify objects,…
Visual Dialogue(also: Visual Dialog, VisDial): Visual dialogue is an AI task that involves holding a multi-turn natural language conversation about visual content such as an image or video frame. Unlike single-turn visual question answering (VQA), visual dialogue systems maintain context across multiple exchanges, using…
Visual Document Understanding(also: VDU, Document Understanding): A field of AI research focused on the interpretation and analysis of visually-rich digital documents such as forms, tables, menus, reports, receipts, and academic papers. Visual document understanding goes beyond basic OCR text extraction by comprehending the spatial layout,…
Visual Grounding(also: Grounded Visual Understanding): The ability of an AI model to connect its language output to specific elements actually present in the visual input, ensuring that descriptions and responses are anchored to real objects and scenes rather than generated from learned patterns or assumptions. Poor visual grounding…
Visual Inertial Odometry(also: VIO): A motion tracking technique that combines camera-based visual tracking with inertial sensor data (gyroscopes and accelerometers) to estimate a device’s position and orientation in 3D space with high accuracy. VIO works by tracking salient visual features across consecutive video…
Visual Interpreter(also: Visual Interpreter Service, Visual Description Service, VIDS): A visual interpreter or description service (VIDS) is a technology or human-powered service that provides people who are blind or have low vision with descriptions of their visual surroundings, typically by receiving camera feeds from the user's smartphone or smart glasses.…
Visual Language Model(also: VLM, Vision-Language Model): AI models that can process and reason about both visual and textual information, combining computer vision with large language model capabilities. VLMs could potentially enhance assessment descriptors by providing contextually rich and customizable descriptions of visual…
Visual Layout Analysis(also: Layout Analysis, Document Layout Analysis): The automated process of examining the spatial arrangement and visual properties of elements within a document to infer meaningful structural relationships between them. In accessibility contexts, visual layout analysis is used to automatically generate metadata about how…
Visual Question Answering(also: VQA): A task in which a system receives an image and a natural language question about that image, then generates a natural language answer. VQA emerged as a key accessibility paradigm through services like VizWiz, where blind users could submit photos with questions and receive…
Visual Saliency(also: Saliency Detection, Visual Attention Prediction): A computer vision concept referring to the degree to which visual elements attract attention compared to their surroundings. Saliency detection models predict which parts of an image or video frame will draw the viewer eye first, based on factors like contrast, color, motion,…
Visual Saliency(also: Saliency, Saliency Detection, Saliency Map): A computational measure of how much a particular region of an image or video stands out from its surroundings and attracts visual attention. Saliency models predict where people are most likely to look based on factors such as contrast, colour, motion, and semantic content. In…
Visual question answering(also: VQA, Visual QA): A computer vision and natural language processing task in which a system answers natural language questions about the content of an image or video. In accessibility contexts, VQA enables blind and visually impaired users to query visual content interactively — asking specific…
Visual-Inertial Odometry(also: VIO): A computer vision technique that combines camera imagery with motion sensor data (accelerometer and gyroscope) to track a device's position and orientation in 3D space. In accessibility applications, VIO enables smartphones to maintain awareness of object positions even when…
Watershed Algorithm(also: Watershed Segmentation, Watershed Transform): An image segmentation technique inspired by geographical hydrology, where the gradient magnitude of an image is treated as a topographical surface. The algorithm simulates water flowing downhill from each pixel to local minima, forming catchment basins that define segmented…
Wearable Camera(also: Body-worn Camera, Head-mounted Camera, Egocentric Camera): A camera worn on the body — typically mounted on glasses, a hat, or the chest — that captures images or video from the wearer's perspective (egocentric view). In assistive technology for blind and low vision users, wearable cameras coupled with computer vision can provide…
YOLO(also: You Only Look Once): YOLO (You Only Look Once) is a real-time object detection algorithm that identifies and locates objects within images or video frames in a single pass through a neural network. In accessibility applications, YOLO enables systems to automatically detect objects, people, and…
YOLO (You Only Look Once)(also: YOLO, YOLOv8, YOLO Object Detector): A family of real-time object detection neural networks that predict bounding boxes and class labels in a single forward pass over an image, rather than using a two-stage propose-then-classify pipeline. YOLO has become a workhorse detector for accessibility research and assistive…

Category

Search results