Speech Emotion Recognition

Also known as: SER, Vocal Emotion Recognition

A class of machine-learning techniques that infers a speaker's emotional state from acoustic features of speech — pitch contour, intensity, rhythm, spectral properties, voice quality — usually producing a label (happy/sad/angry/calm) or continuous values on valence and arousal axes. Modern SER typically uses transformer-based models trained on labelled emotional-speech corpora such as RAVDESS or IEMOCAP. SER is used in affective computing, voice assistants, call-centre analytics, and accessibility research (e.g. driving expressive captions or emotional haptic feedback for d/DHH viewers). Known limitations include demographic bias (performance varies across gender, age, language, and cultural background), vulnerability to acted vs naturalistic speech differences, and high computational cost for real-time use.

Category: machine learning · affective computing

Related: Affective Computing · Arousal · Valence · Prosody

Sources