CARTGPT: Real-Time Correction of CART Captions Using Large Language Models
Liang-Yuan Wu, Andrea Kleiver, Dhruv Jain · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746326
Summary
This paper introduces CARTGPT, a real-time system that enhances Communication Access Realtime Translation (CART) captions by combining human-generated CART transcripts with automatic speech recognition (ASR) output and using GPT-4 to detect and correct transcription errors. CART captioning uses stenographic keyboards to produce near-verbatim transcripts valued by deaf and hard of hearing (DHH) users for their accuracy and inclusion of speaker cues and contextual sounds, but performance degrades under challenging conditions like background noise, technical jargon, or rapid speech. The research began with a formative study interviewing 10 professional CART captioners who identified four categories of errors: omissions from inaudible or unclear speech (marked with "[inaudible]" or "[indiscernible]"), other word omissions marked with "(?)", untranslate errors from wrong stenographic key presses producing garbled text, and mistranslate errors where incorrect but valid words appear. The CARTGPT pipeline processes two parallel input streams — the CART transcript and a Whisper ASR transcript — aligning them at the clause level using MiniLM sentence embeddings with greedy monotonic matching. When error markers are detected, the system replaces them with a placeholder and prompts GPT-4 with two paragraphs of CART context plus the corresponding ASR text to generate corrections. A post-processing step using WordPiece tokenization reverts any unintended changes to preserve transcript fidelity, ensuring only flagged errors are modified.
Key findings
Evaluated on a 39.7-hour dataset spanning medical interviews, computer science lectures, phone conversations, and general talks with added environmental noise, CARTGPT achieved 89.0% word accuracy compared to 83.4% for standard CART and 71.7% for Whisper ASR alone — a statistically significant 5.6% improvement over CART (p < .001). Improvements were more pronounced for technical content (+6.9% over CART for medical and computer science topics) than casual conversation (+4.1%). The text alignment module achieved 91.2% accuracy with 96.7% inter-annotator agreement. A post-hoc hallucination analysis found the model inserted ungrounded content in 6% of corrected segments, typically from ambiguous or incomplete utterances. In a user study with 16 DHH participants, CARTGPT captions received significantly higher comprehension ratings (M = 4.4, SD = 0.5) compared to traditional CART (M = 3.7, SD = 0.7), with p < .001. All 16 participants identified at least one instance where CARTGPT clarified unclear technical content. Despite a 300-400ms correction delay per segment, participants universally perceived the system as real-time. Participants valued the hybrid approach where AI fills gaps rather than replacing the human captioner.
Relevance
CARTGPT demonstrates a practical model for AI-augmented accessibility services where machine intelligence enhances rather than replaces human expertise. For organizations providing captioning accommodations, this approach could meaningfully improve caption quality in the most challenging scenarios — technical meetings, medical consultations, and fast-paced lectures — precisely where DHH users need accuracy most. The finding that participants wanted transparency features like toggle views between original and corrected captions, visual indicators for AI-modified text, and dual-display modes has broad implications for any AI-assisted accessibility tool: users need agency and visibility into AI modifications. The 6% hallucination rate, while low, underscores the importance of confidence indicators and user control in high-stakes communication contexts. The system is open-sourced on GitHub, making it available for integration into existing captioning workflows.
Tags: deaf and hard of hearing · real-time captioning · CART · large language models · automatic speech recognition · caption accuracy · hybrid captioning