Automatic Generation and Evaluation of Usable and Secure Audio reCAPTCHA

Mohit Jain, Rohun Tripathi, Ishita Bhansali, Pratyush Kumar · 2019 · Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2019) · doi:10.1145/3308561.3353777

Summary

This paper presents reCAPGen, a system that automatically generates usable and secure audio CAPTCHAs by leveraging the gap between human and machine speech recognition abilities. Visual CAPTCHAs — the dominant form of online human verification — are inherently inaccessible to people who are blind or visually impaired, yet existing audio CAPTCHA alternatives suffer from extremely poor usability (success rates as low as 39-52%) and are insecure against modern automatic speech recognition (ASR) attacks (broken at >90% accuracy by Google ASR). reCAPGen works by selecting audio clips from old radio programs, podcasts, and YouTube lectures, then using IBM Watson Speech-to-Text to identify pairs of consecutive words where one has high transcription confidence (the "control word" for authentication) and the other has low confidence (the "suspicious word" for crowd-sourced transcription). Background noise is calibrated using a binary search algorithm to find the minimum noise level that prevents both IBM Watson and Google Speech from correctly transcribing the control word, ensuring security while minimizing impact on human usability. The system incorporates six filters to remove clips with offensive content, multiple speakers, short words, non-dictionary words, extreme lengths, or false ASR confidence scores. Four audio CAPTCHA schemes were evaluated: Random Digits (the state-of-the-art), Two Words, Last Two Words (novel), and Full Phrase.

Key findings

A user study with 60 sighted participants (Amazon Mechanical Turk) and 19 visually impaired participants (recruited in-person from three vocational training centres for blind people in India) evaluated 4,740 audio CAPTCHAs across the four schemes. The novel Last Two Words (LTW) scheme achieved the best balance of usability and security: 78.2% success rate with sighted users and 81.3% with visually impaired users, with response times of 9.6s and 14.5s respectively. This is comparable to visual CAPTCHA performance (87% success, 9.8s) and vastly superior to Random Digits for blind users (26.7% success, 73.6s). The dramatic failure of Random Digits for visually impaired participants — who had to memorise 10 spoken digits by listening repeatedly without pen and paper — compared to sighted users (89.0% who could write digits down) starkly illustrates how the state-of-the-art audio CAPTCHA design disadvantages blind users. Visually impaired participants rated LTW as the most preferred scheme, appreciating the contextual words preceding the target ("has less issues of chopped words") and the low memory demand of needing only two words. Security evaluations showed reCAPGen CAPTCHAs were resistant to both ASR attacks (0.7% success for state-of-the-art ASR systems from Microsoft and Amazon) and supervised learning attacks (<0.1% success), because words are drawn from a large open vocabulary rather than a fixed set. As a reCAPTCHA, solving the CAPTCHAs also generates transcriptions of words that ASR systems fail on, achieving >82% accuracy — useful for making audio media accessible to people with hearing impairments.

Relevance

This paper addresses one of the most persistent accessibility barriers on the web: CAPTCHAs that block blind users from accessing online services. The finding that the current state-of-the-art audio CAPTCHA (Random Digits) has a 26.7% success rate for visually impaired users — meaning three out of four attempts fail — while being breakable by ASR at >90% demonstrates that existing solutions are simultaneously unusable and insecure. The proposed LTW scheme achieves parity with visual CAPTCHA usability for the first time, suggesting it could be deployed as a genuine alternative rather than a frustrating afterthought. The dual benefit of the reCAPTCHA approach — security verification plus crowd-sourced audio transcription — means solving CAPTCHAs can improve media accessibility for deaf and hard of hearing users. For web developers and accessibility practitioners, the paper makes a clear case against Random Digits audio CAPTCHAs and provides an open system (reCAPGen) for generating more accessible alternatives. The speech-input modality (speaking answers instead of typing) is also significant, as it reduces the interaction complexity for screen reader users who otherwise must navigate between audio playback and text input fields.

Tags: CAPTCHA · audio accessibility · blind · visual impairment · web accessibility · speech recognition · security · crowdsourcing · screen readers