A More Accessible Web with Natural Language Interface

Xiang Deng · 2023 · Proceedings of the 20th International Web for All Conference (W4A) · doi:10.1145/3587281.3587700

Summary

This extended abstract proposes building a general natural language interface (NLI) for the Web that would allow users to express their needs in plain language and have the system automatically carry out the required actions on any website. The approach aims to reduce the complexity barrier that makes the modern web difficult for people with disabilities or those less technologically proficient. Unlike existing smart assistants (Siri, Google Assistant, Alexa) that are constrained by predefined APIs and limited to specific services, this proposed system would operate on any website through a browser, translating natural language commands into sequences of UI actions. The system consists of two key components: a web page encoder that comprehends the structure, text content, and multimedia of diverse web pages without relying on site-specific human annotations, and a semantic parser that maps natural language user commands to grounded action sequences executable on the target page. The author has explored pre-training methods for the web page encoder that generalize across websites, and novel methods to enhance the semantic parser for complex queries. A benchmark dataset is being curated through a crowdsourcing annotation tool that collects diverse tasks and action demonstrations across 145 target websites spanning shopping, travel, job searching, and government services.

Key findings

Preliminary results show that pre-training the web page encoder with self-supervision produces generalizable representations effective for web page understanding tasks including information extraction and question answering. The pre-trained encoder demonstrates strong generalization in both few-shot and zero-shot settings, making it suitable for building interfaces that work across diverse websites without site-specific training. The research also evaluated large language models like ChatGPT and Codex for both web page encoding and action parsing from user requests, using them to generate seed tasks for annotators. The annotation tool has been developed and tested in small-scale in-house trials, demonstrating its effectiveness in collecting the task-action demonstrations needed to train the system. The work adapts a model from the language-vision navigation literature — originally designed to guide a robot in a simulated visual environment — to navigate the Web instead, highlighting the parallels between physical navigation and web task completion.

Relevance

This research anticipates the direction that web accessibility is heading with the rise of AI agents and large language models. A general natural language interface for the Web could fundamentally change how people with disabilities interact with online content and services — eliminating the need to navigate complex visual interfaces, understand site-specific layouts, or perform multi-step form interactions. For users of screen readers, who currently must navigate through DOM structures sequentially, a natural language interface could provide a more direct path to task completion. However, this is early-stage work presented as an extended abstract with preliminary results only, and significant challenges remain around reliability, security, and handling the full diversity of real-world websites. Practitioners should watch this space as the convergence of LLMs and web automation continues to evolve, with potential to complement rather than replace traditional accessibility approaches like semantic HTML and ARIA.

Tags: natural language processing · web accessibility · web automation · semantic parsing · large language models · digital divide