Speaker-Aware Affective Captioning for Multi-Speaker STEM Talk in Inclusive Classrooms

Sunday David Ubur, Denis Gracanin, Stephanie P DeHart, Enoch Katey Akli, Fatemeh Sarshartehrani, Sikiru Adewale · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) · doi:10.1145/3772363.3798718

Summary

Ubur and colleagues at Virginia Tech address a specific failure mode of live captioning in classroom and meeting settings: collapsing multi-speaker discourse into a single text stream that obscures who said what and how it was said. They argue this is especially consequential for Deaf and hard-of-hearing (DHH) students in STEM, where rapid turn-taking, dense terminology, and frequent clarification mean speaker attribution and pragmatic intent carry real comprehension weight. The paper presents Speaker-Aware Affective Captioning (SAAC), a captioning front-end that assumes diarization as a backend and adds three layers on top of speaker-attributed turns: (1) confidence-gated affect tags combining an emoji and a text label such as [Calm] with the classifier's softmax confidence, suppressed below a 0.4 threshold and replaced with a neutral 'Listening...' status; (2) an on-demand 'AI Describe' micro-summary that paraphrases the current turn using GPT-4 to clarify intent and recover from ASR errors; and (3) session-bound speaker sign-in for stable name labels. The implementation streams 16 kHz audio to Google Cloud Streaming ASR, runs Wav2Vec2-based emotion classification (8 classes from Russell's circumplex model) on rolling 3-second windows every ~4 s, and broadcasts captions via Socket.IO. The design is grounded in calibrated-trust theory for uncertain AI outputs, signaling principles from multimedia learning, and conversation analysis. Evaluation was a within-subjects pilot (n=16; 10 DHH, 8 hearing) comparing SAAC against a baseline of speaker-attributed captions only, using a 5-item MCQ comprehension quiz and NASA-TLX workload ratings, with sessions delivered both in-person and over Zoom.

Key findings

SAAC produced a small but statistically significant comprehension advantage over the speaker-attributed-only baseline: median quiz score 5.0 vs 4.0 (Wilcoxon signed-rank S=31, p=.031, two-tailed), with 15 of 16 participants performing the same or better with SAAC. NASA-TLX workload ratings were broadly comparable across conditions, with effort and perceived performance trending in SAAC's favor - suggesting the comprehension gain came from improved interpretability rather than added cognitive cost. Speaker attribution itself was rated more reliable in SAAC (S=35, p=.043) and less incorrect or confusing (S=-39, p=.047), even though the underlying diarization was held constant - indicating that stable session-bound names and the broader supporting layout shape perceived reliability. Trust calibration differed across the layered cues. AI Describe was consistently perceived as helpful (15/16 agreed it clarified meaning) and used as a recovery scaffold when participants felt lost (11/16). Affect tags drew more cautious trust: in this academic dialogue most utterances were calm, so tone/intent ratings were mixed (4/16 found them informative for tone), but 12/16 wanted to keep the feature as a glanceable 'gist' layer for more expressive scenarios. The authors infer a practical hierarchy of dependence in multi-layered captioning: speaker attribution first, then on-demand clarification, then affect.

Relevance

For practitioners building captioning, conferencing, or classroom-AI products, the paper offers a concrete reference architecture for layering diarization + ASR + emotion recognition + on-demand LLM paraphrase into a usable DHH-facing interface, plus interaction patterns worth borrowing: confidence-gated affect tags with neutral fallback, paraphrase summaries triggered as a recovery action rather than auto-displayed, and stable session-bound speaker names rather than anonymous turn arrows. The trust hierarchy finding (attribution > paraphrase > affect) is a useful default for prioritizing engineering effort: speaker attribution is foundational, AI summaries are high-value when uncertain, and emotion tags are an optional 'gist' layer to be designed conservatively. Caveats are large and the authors flag them: pilot n=16, scripted live stimuli with confederate speakers (so prosody and pacing vary across sessions), in-person vs Zoom modality confounded with hearing status, no ablation of individual SAAC layers, and largely calm academic dialogue that limits what can be said about affect tags. Backend metrics (diarization accuracy, ASR word error rate, emotion classifier accuracy) are not reported. The paper is best read as a promising design pattern rather than a definitive evaluation.

Tags: captioning · deaf and hard of hearing · speaker diarization · speech emotion recognition · STEM education · classroom accessibility · automatic speech recognition