VisualAid: Enhancing Accessibility for Visually Impaired Users Through AI

Wajdi Aljedaani, Sijo Rejigeorge, Priya Jha, Srija Yadavalli, Manikanta Kothakota, Marcelo M. Eler, Abdulrahman Habib · 2025 · Proceedings of the 22nd International Web for All Conference (W4A) · doi:10.1145/3744257.3744277

Summary

This technical note presents VisualAid, an AI-powered Android application designed to help visually impaired users understand and navigate their physical surroundings. The app integrates multiple AI technologies into a single mobile interface: YOLO11x for real-time object detection (trained on the COCO dataset with 80 object classes, achieving 54.5% mAP at 13ms latency), BLIP (Bootstrapped Language-Image Pretraining) for generating descriptive image captions, ViLT (Vision-and-Language Transformer) for visual question answering, and Tesseract for optical character recognition. The system architecture consists of an Android front end connected to a Flask back end. Users capture images through the phone's camera and interact entirely through voice — the app uses Android's built-in Text-to-Speech (TTS) and SpeechRecognizer for hands-free operation. Queries are processed based on type: object-specific queries use YOLO bounding boxes with BLIP captioning, text extraction queries use cropped images with Tesseract OCR, and general queries are handled by ViLT combined with BLIP captions. The app supports English and Spanish and includes a voice-guided onboarding module for first-time users. This is a system description paper presenting the conceptual framework and features; no user evaluation has been conducted yet.

Key findings

The paper describes the technical architecture rather than presenting evaluation results. The system combines four distinct AI capabilities — object detection, image captioning, question answering, and text extraction — into a unified accessible interface, which the authors position as more comprehensive than existing single-function tools. YOLO11x was selected for object detection due to its efficiency, using 22% fewer parameters than prior versions while maintaining accuracy. BLIP achieves a CIDEr score of 133.3 with 14M pre-training images for captioning, outperforming models like Enc-Dec and VinVL. ViLT handles visual question answering at 10x faster than other vision-language pretraining models. The query processing pipeline intelligently routes requests: when a user asks about a specific object, the system first uses YOLO to locate it, crops the relevant region, and passes it to BLIP for detailed description; for text-related queries, it routes to Tesseract OCR instead. The authors plan future work to evaluate the application with visually impaired users to assess usability, accessibility, and effectiveness.

Relevance

VisualAid represents the growing trend of combining multiple AI models to create comprehensive assistive tools for visually impaired users, moving beyond single-purpose apps toward integrated environmental understanding. For accessibility practitioners, the paper illustrates how modern computer vision and language models can be combined into a practical mobile tool with voice-first interaction design. The multi-modal approach — combining object detection, captioning, OCR, and question answering — reflects the complexity of real-world information needs when navigating without sight. However, the paper's main limitation is the absence of any user evaluation; the system is purely described at a conceptual and technical level. Without testing with actual visually impaired users, it is unclear whether the described features effectively address real needs or whether the interaction design is genuinely accessible. The planned user study will be critical for validating the approach.

Tags: visual impairment · object detection · image captioning · OCR · voice interaction · mobile accessibility · AI-driven accessibility · Android