Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

Pattern Recognition: A branch of machine learning and artificial intelligence focused on identifying regularities, patterns, and structures in data such as images, sounds, or sensor readings. In accessibility, pattern recognition is fundamental to technologies like sign language recognition systems…
Perceptual Linear Prediction(also: PLP, PLP Coefficients): Perceptual Linear Prediction (PLP) is an acoustic feature extraction technique used in speech processing that models human auditory perception. PLP analysis applies psychoacoustic principles including critical band frequency resolution, equal-loudness pre-emphasis, and…
Perplexity(also: Language Model Perplexity): A standard metric for evaluating language models that measures how well the model predicts a sample of text. Mathematically, perplexity is the inverse probability of the test set, normalised by the number of words — a lower perplexity indicates that the model assigns higher…
Pointwise Mutual Information(also: PMI): A statistical measure used in natural language processing to quantify the strength of association between two words based on how much more frequently they co-occur in a corpus than would be expected by chance. PMI is calculated as the logarithm of the ratio of the observed…
Polar Motion Profile(also: PMP): A Polar Motion Profile (PMP) is a computational technique used in sign language detection that models the quantity and distribution of motion relative to a detected face using polar coordinates. The method captures the characteristic hand and arm movements associated with…
Pose estimation(also: Body pose estimation, Human pose estimation): The computational process of determining the position and orientation of a person's body joints and limbs from sensor data such as cameras, depth sensors, or inertial measurement units. In accessibility contexts, pose estimation enables applications like gesture-based…
Principal Component Analysis(also: PCA): A statistical technique that reduces the dimensionality of data by identifying the principal axes of variation in a dataset. In accessibility and assistive technology contexts, PCA is commonly used in face recognition systems (as the basis of the Eigenfaces method), gesture…
Privacy-Enhancing Data Filters(also: Privacy Filters, Data Obfuscation Filters): Visual or data modifications applied to training datasets that obscure the identity of contributors while preserving the information needed for machine learning tasks. In the context of sign language video, these filters may include face blurring, cel shading, avatar…
Project Sidewalk: An open-source web-based crowdsourcing tool developed at the University of Washington that enables volunteers to virtually audit sidewalk accessibility using Google Street View panoramas. Contributors label four types of accessibility features and problems: curb ramps, missing…
Prompt engineering(also: Prompt design, Prompt crafting): The practice of designing and iteratively refining natural language inputs to large language models to elicit more accurate, relevant, or useful responses. In accessibility contexts, prompt engineering is an emerging skill that enables disabled users to customise AI interactions…
Random Forest Classifier(also: Random Forest): A machine learning algorithm that creates multiple decision trees during training and outputs the class that is the mode of the individual trees predictions. Random forests are widely used in gesture recognition, activity recognition, and other classification tasks in assistive…
Recurrent Neural Network(also: RNN): A recurrent neural network (RNN) is a type of artificial neural network designed to process sequential data by maintaining an internal state (memory) that captures information from previous inputs in the sequence. Unlike feedforward networks, RNNs have connections that loop…
SHAP(also: SHapley Additive exPlanations): A unified framework for feature-importance explanations of machine-learning models, introduced by Lundberg and Lee in 2017, grounded in Shapley values from cooperative game theory. For any model and input, SHAP assigns each feature a value representing its contribution to that…
SMOTE(also: Synthetic Minority Over-sampling Technique): A data augmentation technique that addresses class imbalance in machine learning datasets by generating synthetic examples of the minority class rather than simply duplicating existing ones. SMOTE creates new instances by interpolating between existing minority class samples and…
Scene Classification(also: Scene Recognition, Scene Understanding): Scene classification is a computer vision task that categorizes images or video frames into predefined scene types such as indoor/outdoor, kitchen, office, or street. For accessibility, scene classification helps automated systems provide context about environments in image…
Semantic Segmentation(also: Pixel-Level Classification, Scene Parsing): A computer vision technique that classifies every pixel in an image into a predefined category, producing a detailed map of what objects are present and where they are located. Unlike object detection (which draws bounding boxes around objects), semantic segmentation provides…
Sequence-to-Sequence(also: Seq2Seq, Encoder-Decoder): A neural network architecture designed for tasks where both input and output are sequences of variable length, such as machine translation, speech recognition, and video captioning. A seq2seq model consists of an encoder that processes the input sequence into a fixed-length…
Sign Language Machine Translation(also: English-to-ASL Translation, Sign Language MT, Text-to-Sign Translation): The automatic translation of written or spoken text into a signed language (or vice versa) using computational methods, typically producing output as an animated signing avatar or, less commonly, as recorded video clips. Because signed languages such as American Sign Language…
Sign language translation(also: SLT, Sign-to-text translation): The automatic conversion of sign language video into written or spoken language text using machine learning. Unlike sign language recognition, which identifies individual signs or glosses, sign language translation produces fluent natural language output that accounts for the…
Signer-Independent Recognition(also: signer-independent SLR): A sign language recognition approach designed to work with signers whose data was not included in the training set. Similar to speaker-independent speech recognition, signer-independent systems must handle variations in signing style, hand size, speed, and regional signing…
Singular Value Decomposition(also: SVD): A mathematical technique that decomposes a matrix into three component matrices, used to reduce high-dimensional data to its most important features while preserving essential relationships. In accessibility research, SVD is a core component of Latent Semantic Analysis and has…
Sound Recognition(also: Sound Classification, Audio Event Detection, Environmental Sound Recognition): Technology that automatically identifies and classifies sounds in a user's environment, typically using machine learning models trained on audio datasets. In accessibility contexts, sound recognition systems help deaf and hard of hearing people become aware of environmental…
Speaker Adaptation(also: Voice Adaptation, Speaker-Adaptive Training, Voice Personalization): Speaker adaptation is the process of adjusting an existing automatic speech recognition (ASR) system — usually one trained on a large, demographically broad corpus of able-bodied speakers — to a particular individual's voice using a relatively small amount of that person's…
Speech Emotion Recognition(also: SER, Vocal Emotion Recognition): A class of machine-learning techniques that infers a speaker's emotional state from acoustic features of speech — pitch contour, intensity, rhythm, spectral properties, voice quality — usually producing a label (happy/sad/angry/calm) or continuous values on valence and arousal…
Speech Language Model(also: SLM, Audio Language Model, Speech Foundation Model): A class of large neural models that processes both speech and text in a single end-to-end framework, integrating tasks — automatic speech recognition, spoken language understanding, dialogue, speech generation — that traditionally required separate modular systems. Examples…
Stable Diffusion: An open-weights latent text-to-image diffusion model released by Stability AI in 2022. It operates by iteratively denoising a random latent tensor, conditioned on text embeddings produced by a frozen CLIP encoder, until the latent can be decoded by a VAE into a coherent image.…
Supervector(also: GMM Supervector): A supervector is a high-dimensional feature representation created by concatenating the mean vectors from all components of a Gaussian Mixture Model (GMM) adapted to a specific speaker or utterance. This concatenation transforms variable-length speech into a fixed-length vector…
Support Vector Machine(also: SVM): A supervised machine learning algorithm used for classification and regression tasks. SVMs work by finding the optimal hyperplane that separates data points into distinct categories in a high-dimensional feature space. In accessibility research, SVMs have been used to detect…
Target Sound Extraction(also: Target Sound Separation, TSE): A machine-learning task in which a model isolates a specific target sound (or class of sounds) from a complex acoustic mixture, conditioned on some specification of the target - a text label, a reference recording, or an embedding. Distinct from blind source separation (which…
Teachable AI(also: Teachable Machine Learning, Interactive Machine Learning): Teachable AI refers to artificial intelligence systems that allow end users to personalize the system by providing their own training examples, high-level constraints, or prompts — without requiring programming or machine learning expertise. In the accessibility context,…
Teachable Object Recognizer(also: Teachable Machine, Personalized Object Recognizer): A machine learning application that allows end users to train custom object recognition models by providing their own example images, rather than relying on pre-trained models with fixed categories. In accessibility contexts, teachable object recognizers empower blind and…
Text-to-Audio(also: Text-to-Audio Generation, TTA): A class of generative AI models that synthesise non-speech sound (environmental sounds, sound effects, music stems) from a text prompt - for example producing the sound of 'leaves rustling in wind' or 'church bells ringing'. Distinct from text-to-speech, which produces spoken…
Topic Modeling(also: LDA, Latent Dirichlet Allocation): A machine learning technique that automatically discovers abstract themes or topics within a collection of documents by analyzing patterns of word co-occurrence. Latent Dirichlet Allocation (LDA) is the most widely used topic modeling algorithm. In accessibility research, topic…
Training Data(also: Training Set, Training Dataset): The collection of labeled examples used to teach a machine learning model to perform a specific task. The quality, quantity, and diversity of training data directly determine how well a model will perform. In accessibility contexts, training data quality is especially important…
Transfer Learning: A machine learning technique where a model trained on a large general dataset is adapted to perform a new, more specific task using a much smaller amount of new training data. Rather than training a model from scratch, transfer learning leverages patterns already learned by an…
Transformer(also: Transformer Model, Transformer Architecture): A deep learning architecture introduced by Vaswani et al. in 2017 that relies entirely on attention mechanisms rather than recurrence (RNNs) or convolution for sequence modeling tasks. Transformers process entire input sequences in parallel using "self-attention" to weigh the…
Trigram(also: 3-gram): A sequence of three consecutive words used in statistical language modeling for word prediction. Trigram models predict the next word based on the two preceding words, capturing more context than simpler unigram (single word) or bigram (two word) models. In AAC word prediction,…
Universal Background Model(also: UBM): A Universal Background Model (UBM) is a large Gaussian Mixture Model trained on speech from many speakers to represent speaker-independent acoustic characteristics. The UBM serves as a reference distribution against which individual speaker models are compared, typically using…
Vision Language Model(also: VLM, Vision-Language Model, Multimodal Large Language Model): A machine-learning model trained to take both images and natural-language text as input and to produce natural-language output. Modern VLMs — such as GPT-4o, Gemini, and Claude — can describe a photo, read text inside an image, answer questions about a scene, identify objects,…
Visual assistance technology(also: VAT, AI visual assistance, Visual interpretation service): Technology that uses artificial intelligence, computer vision, or human volunteers to provide visual information to blind and low-vision users. Examples include apps like Seeing AI, Be My Eyes, and Lookout, which can identify objects, read text, describe scenes, and recognise…
Viterbi Algorithm: The Viterbi algorithm is a dynamic-programming procedure for finding the most likely sequence of hidden states in a Hidden Markov Model given a sequence of observations. It is the standard solution to part-of-speech tagging, many speech-recognition tasks, and decoding problems…
Wav2Vec(also: Wav2Vec2, Wav2Vec 2.0): A family of self-supervised speech representation models from Meta AI that learn rich acoustic embeddings directly from raw waveform audio without requiring transcribed training data. Wav2Vec 2.0, introduced in 2020, became a backbone for low-resource automatic speech…
Whisper(also: OpenAI Whisper, Whisper ASR): An open-source automatic speech recognition (ASR) model released by OpenAI in 2022, trained on 680,000 hours of multilingual and multitask supervised audio data. Whisper supports transcription in dozens of languages, translation into English, language identification, and…
YOLO(also: You Only Look Once): YOLO (You Only Look Once) is a real-time object detection algorithm that identifies and locates objects within images or video frames in a single pass through a neural network. In accessibility applications, YOLO enables systems to automatically detect objects, people, and…
YOLO (You Only Look Once)(also: YOLO, YOLOv8, YOLO Object Detector): A family of real-time object detection neural networks that predict bounding boxes and class labels in a single forward pass over an image, rather than using a two-stage propose-then-classify pipeline. YOLO has become a workhorse detector for accessibility research and assistive…
iVector(also: Identity Vector, i-vector): A low-dimensional representation of voice characteristics widely used in speaker recognition and verification systems. iVectors capture many acoustic aspects of a speaker's voice in a compact form, making them useful for automatically estimating speech intelligibility in people…

Category

Search results