Automating lived experience accessibility testing using AI

When we think about automated accessibility testing of websites and apps, we tend to expect the test tool to be directly inspecting the content and structure of the page and reporting WCAG (Web Content Accessibility Guidelines) success criteria failures against it. But that is only one small part of web/app testing. A much larger part involves manual inspection of the pages by auditors and testers, and testing usability of the page by disabled users usually with their own assistive technology.

I've previously demonstrated some AI prompts that can help extend that "classic" accessibility testing, the question is, can we apply AI to help detect and report accessibility issues when people are actually interacting with the website or app? Here, I report on a tool I developed which does exactly that.

Reports

The first question to ask is what type of issues are we trying to report? Are they strictly WCAG success criteria fails, or do we include best practice issues, or usability issues, or user experience, and do we recommend remediation approaches? I decided to try for all of them, it's an experiment after all.

With this in mind, we perhaps need a range of reports:

Explicit failures against WCAG success criteria and related best practice guidance with suggested remediation where possible.
A detailed list of user pain-points.
A detailed list of user assertions about their experience using the website/app.
An overview of the testing, say a list of key takeaways.

Capturing User Experience

If we are to capture and analyze user experience then we need ways of recording it. We have the pages themselves as rendered over time, and potentially the source code they are built from, images of the user interacting with the page and any audio related to that activity. That stream of information may be of varying length, but probably no longer than two hours given that testers become fatigued.

In my case, the immediately available sources were screen-recordings of lived experience accessibility testers interacting with websites and apps with an accompanying live commentary as the tester "thinks aloud" about their experience together with a test supervisor (these are all supervised tests, not necessarily scripted, but with a test supervisor in attendance), plus potentially the audio generated by the user's assistive technology. So, in practice, all the reports need to collate their results from listening to testers and potentially watching the screen as it changes. Because that is what I had.

Current Commercial AI Models

Given time and resources, the goal here was to be a consumer of existing commercial AI models, not to train them, and in the end three different AI APIs were used. Deepgram (https://deepgram.com) was used to parse the audio output of the screen recordings, and both OpenAI's GPT API (https://openai.com) and later Anthropic's Claude 3 API (https://www.anthropic.com) were used for text and image queries. I moved to Claude for one simple reason: I needed to analyse large quantities of data and that is better supported by Claude 3 than by GPT. GPT gives 4K input tokens, Claude gives 200K input tokens. 200K input tokens is enough to analyze up to 90 minutes of audio transcription in one go.

Approach

I decided to explore four text-based AI prompts, one for each report type needed, with the source input coming from a transcript of the audio. The expectation was for the prompts to build upon each other, so that the key takeaways prompt would utilize the outputs of the user pain-points and WCAG analysis.

Those text prompts required an initial transcript of the audio, and I chose to create this by auto-generating a VTT caption file from the MP4 screen-recording. The problem here is that most auto-captioning tools, in addition to making mistakes (and they make many mistakes) do not always identify the speaker well. I needed a quality diarizing captioning tool. Diarized captions identify the current speaker, and in my case there could be three speakers: the user/tester, the test supervisor, and the synthetic speech of the screen-reader. Potentially, the screen-reader could be speaking at a different rate in comparison to the recorded human speech.

Platform

I chose to build the application in JavaScript/ES6 using NodeJS (https://nodejs.org/en). This provides for creation of a command-line tool with a wide range of available library modules, and is a platform I use for my "classic" test automation tools. It is not an ideal platform for sharing the application with others, and a commercial solution would undoubtedly take a different approach, but it was sufficient for my needs.

Pipeline

The raw input was in the form of a high resolution MP4 file with a mono audio output. This was fed into Deepgram via an API call requesting diarized transcription into VTT format. Deepgram allows for a region discriminator for their language e.g. en-GB for the UK or en-CA for Canada. However, this applies to all speakers, one cannot apply the region to individual diarized users. en-GB was chosen as it seemed to make the least mistakes with a mix of users and synthetic speech.

That VTT file was held in memory, and attached to each prompt to GPT/Claude as required.

Each prompt was constructed to return its answers in JSON format. JSON has become the de-facto standard for data interchange between web applications, and is specifically designed to make it easy to load, create, and parse in JavaScript. The prompts, which were quite long, were largely instructions to provide reliable consistent responses to the prompts that were suitable for machine parsing. Without that very explicit level of instruction the results returned can be variable to say the least.

A Corpus of Knowledge

I constructed each prompt to have three elements: a summarizing of the audio conversation, actions on that understanding, and the application of a particular query. To summarize and understand a conversation requires AI to have a deep corpus of knowledge of the subject matter being discussed, to understand multiple intertwined ontologies in English that is perhaps less than perfectly expressed; particularly in this case where only the audio track is available (video is discussed below). The result was perhaps as good as one would expect on AI trained on publicly available content that focusses on identifying common and popular opinion. The result was that some of the prompts required a "training block" to explicitly instruct the AI to ignore some of its core knowledge in this specific instance, and to utilize the corrections provided. This list of replacement digital accessibility heuristics to apply grew as I tested the prompts against multiple screen-recording of both digital audits and lived experience testing that I had access to.

Identification Of Issues

The first prompt attempted was to summarize the key issues discussed in the diarized audio transcript, looking to see how well understood the conversations were. So, not looking explicitly at the detailed analysis of pain-points and WCAG success criteria, simply analyzing the conversation.

The key part of the initial prompt was:

The following text is a VTT file of a conversation between a disabled tester and a test supervisor with the sound of a screen-reader as it is used by the tester. Reporting using JSON, identify the key takeways from the conversation in terms of accessibility and usability. I do not require any other output other than the JSON

An example VTT fragment is shown below.

WEBVTT

NOTE
Transcription provided by Deepgram
Request Id: 3a45517e-186e-4774-8d17-684ccea208ff
Created: 2024-04-06T15:24:48.738Z
Duration: 372.048
Channels: 1

00:00:00.719 --> 00:00:02.800
<v Speaker 0>And now let me describe the page that

00:00:02.800 --> 00:00:04.720
<v Speaker 0>we have here. So I'm Bob Dodd. I'm

The comments are at the beginning, followed by two lines of captions, each with a timecode on one line followed by diarized text on the next.

Each speaker is identified numerically as "Speaker n" where n is the speaker id.

Open AI's GPT fought against reviewing this content, it believed that the content may be copyrighted even though it was my own content and refused to analyze more than a few lines at a time. Anthropic's Claude 3 AI was more lenient and a full VTT file could be added to the prompt (up to 200K input tokens).

When creating these prompts it is necessary to remember that execution is not free: the degree of compute required means that it will seldom be a free service, and one needs to maximize the usage of the information in the prompt (it cannot be referenced in future prompts). Consequently, I played with the idea of running all the prompts as a single prompt to reuse the VTT content. In practice this didn't work because whilst we have 200K input tokens, we only get 2K output tokens (same on GPT and Claude) so there is limited ability to report issues and results.

Variability In Responses

What became quickly evident is that the quality of the individual responses could vary quite wildly. This was especially the case on GPT where the same prompt on different days changed wildly in the richness and completeness of the answers. I was using the preview version of what is now GPT 4-o and it felt as if OpenAI were doing AB testing: sometimes one got the "clever" version of GPT, and sometimes the "dunce", and it did seem to vary wildly between those two extremes.

Even with Claude 3, I still get variations in quality that seems related to nearing the maximum allowed tokens for either input or processing, or (I suspect) busyness of their servers. Certainly the quality of the response can be worse mid-morning Eastern Standard Time on a work day than at weekends or in the evening. It is like Anthropic deliberately vary the quality of the models to manage server workloads. I don't know if that is true, but it would certainly match my experience.

The variability shows in broken JSON, missing parts of the result, significantly less rich descriptions of issues and solutions, sometimes ignored instructions, and sometimes in reduced knowledge. The most obvious case of reduced knowledge was with GPT which sometimes failed to understand the format of the timecode in the VTT file and would report the correct issue at the wrong timecode.

There is often talk of hallucination in AI model responses, and I have yet to experience that with the prompts. GPT came close on occasion where it would describe solutions to issues that simply do not exist, including referencing non-existent API calls in commercial packages, but it would always reference this as an example of how something might work. The more common issue is a lack of understanding of a subject: it is only as clever as the training data. Mistakes in the data appear in the responses.

Accuracy in Summarizing

Overall, I found the accuracy of Claude's summarizing extremely good, and usually more succinct than GPT produced for the same content. In terms of issues in the audio, it is only as good as the speakers. I discovered early on in experimenting that I speak Robot very well in my recordings. I'm a professional auditor, and I have a great deal of experience of explaining the issues I find in a way that the AI liked. Claude has never missed a single issue I have raised when using this tool to listen to just me talking through my manual inspections.

Claude also worked well with the audio of other auditor colleagues who are more discursive in talking about the content in front of them, but it did occasionally miss smaller issues that were not well announced.

Once the conversation involves users with lived experience, the richness of the issues captured by Claude depends on the type of testing. Less scripted testing, directed testing depends a great deal on having the user "think aloud" and the supervisor respond and query the user. It is where having AI also have access to the images of the page being discussed would help scope the context and issues reported. However, even with only the audio track Claude does remarkably well.

With fully scripted and directed lived experience testing that has simpler step-by-step tasks rather than a general instruction e.g. to buy a product on a website, the AI shines. It hears the description of the task as it is given in context to the user, it hears the user and sometimes their screen-reader as they talk about what they are doing and how they are interacting, and as a result has a well structured understanding of the task and the issues encountered.

WCAG Success Criteria / Conformance Testing

Summarizing and extracting pain-points is, to some extent, the core of what Large Language Models do, and it's perhaps not so surprising that with time and effort, prompts can be engineered to identify pain-points and key takeaways. What was asked of AI in terms of WCAG is much greater.

I asked Claude to identify which issues found/announced in the audio track were related to WCAG success criteria, or to the WCAG requirements for conformance. Specifically I asked Claude to identify issues with use of assistive technology that would cause a web page to fail chapter 5.2.4 of WCAG that requires a page to properly support a user's assistive technology in addition to meeting the general success criteria.

For each issue identified, I required:

A succinct title for the issue.
What the issue was.
Why it was important from an accessibility perspective.
Who it impacts from the disability community.
The level of impact the issue has on the user (low to high).
Ways to remediate the issue.
The WCAG success criteria and conformance issues at risk

Note that last requirement: "at risk". It is hard enough to ask automated testing tools to identify WCAG fails when the test tool has access to the source code of the page, let alone when you only hear about what is happening and need to infer why that may be occurring. In this situation it seems reasonable to let the AI paint potential scenarios rather than to demand exact answers.

Having run the best part of 3 years worth of audit screen recordings, Claude missed very little, missing only 4 of my audit issues (but reported on two issues identified in the recording that somehow didn't make it to the report).

It is harder to put numbers to the accuracy with lived experience accessibility testing because Claude only had access to the audio transcript and couldn't see the screen. Supervisors can see both and can therefore identify more issues including those that users don't necessarily know or understand. A blind screen-reader user for example only knows what the screen-reader reports of the page, whilst the supervisor can visually identify content that is incorrectly marked up and visually present. Where tested, those issues obvious in the audio track got identified, but I have yet to compare each page of every test. So good, so far is all I can currently claim.

Video

All of the work described has concentrated on the audio track of the testing, and not the video track. This is partly because you have to start somewhere, and partly that identifying the appropriate frames of video at issue is difficult as users may be talking about issues no longer on screen; there is a lack of context. In theory, it should be possible to inspect video frames up to the point of where an issues is identified and match on-screen behaviour with the subject matter, but that is not easy. It would also not be cheap. Even at 25 frames per second, that is a lot of data to process.

Future Steps

My next step will be with video. I am already creating "classic" automated tests that compare rendered source code against full-page screen-shots, now I want to bring that to lived experience.

Rather than completely separating the two forms of automated testing, I want to build improved capture tools that test against the page content each time the page substantially changes or scrolls. For web pages, I want to take a full-page screenshot and pull the live source code.

This won't fully link the spoken word with the screen content (see 'Video' above) but we'll at least know what automated test results we had around the time of the issue.

Added to this will be capture of mouse, keyboard, and gesture events, so that again, we have a richer set of inputs to give to AI. Quite what we can get out of that I'm not yet sure, but it will be a continuation of the current adventure into current out-of-the-box AI.