VisionAid: A Multimodal Assistive Application Supporting Safe Road Navigation for Visually Impaired People in Bangladesh

Asif Mahbub, Nabil Bin Hannan · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26) · doi:10.1145/3772363.3798801

Summary

VisionAid is an Android pedestrian-safety app for blind and low vision (BLV) users in Bangladesh, where road crossing is dangerous because traffic is unstructured: lane discipline is rare, drivers run signals, and crosswalks are inconsistent. The authors argue that Western-centric assistive navigation tools (NavCog, PathFinder, LineChaser, Seeing AI, Oko) assume infrastructure or behavior that does not exist in Dhaka, and that hardware-heavy systems (ALVU, WOAD, LiDAR-equipped wearables, LLM-Glasses) are unaffordable in the Global South. They opt instead for a software-only mobile solution that runs on commodity smartphones. The system has three stages: a CameraX preprocessing pipeline, a quantized YOLOv11 detector consolidated into a single unified model trained on 9,500 images including locally relevant classes (rickshaws, buses, potholes), and a multimodal feedback layer combining Android TTS, intensity-based vibration patterns (short pulses for warnings, sustained vibration for immediate hazards), and beep tones whose density tracks traffic volume. The core algorithmic contribution is a Vehicle Occupancy Heuristic: rather than alerting on every detected vehicle, which produces useless noise on a busy Dhaka street, the system alerts only when vehicle bounding boxes cumulatively cover more than 60% of the camera frame for at least one second, which the authors take as a proxy for an immediate, blocking hazard. A pilot field test was run across four urban locations including high-traffic intersections and a university campus.

Key findings

Consolidating from multiple specialized YOLOv8 detectors (one for crosswalks at 50 ms, one for traffic lights at 45 ms) into a single unified YOLOv11 model dropped end-to-end inference from roughly 400 ms to 120 ms, sustaining 8–10 FPS on mid-range Android phones, at the cost of precision (94.7% to 82.4%) and recall (87.8% to 68.5%). The authors argue the recall drop is acceptable because the Vehicle Occupancy Heuristic only requires high confidence on vehicles that breach the 60% spatial threshold, which is also where missed detections are most dangerous; the heuristic also filters transient noise (camera shake, vehicles passing the periphery) by requiring the spatial-dominance condition to hold for about one second. Battery use was around 15% over a 16-minute continuous stress test with screen on, which the authors plan to reduce by using the accelerometer to drop frame rate when the user is stationary. Field tests confirmed that signal violations make traffic-light detection alone insufficient and that the occupancy heuristic successfully suppressed alerts for distant traffic. Importantly, the field study recruited 6 sighted, non-vision-impaired participants aged 18–24, not BLV users.

Relevance

The paper makes a concrete case that pedestrian-safety AT for the Global South is not just a port of Western-centric tools; it requires different algorithmic choices, because the assumptions baked into structured-traffic systems (predictable lanes, signal compliance, well-marked crosswalks) simply do not hold. The Vehicle Occupancy Heuristic is a small, transferable idea that practitioners can lift directly: filter detector output by spatial dominance plus temporal persistence rather than by raw confidence, and the fail-safe pivots toward false positives at distance rather than false negatives in the crossing zone. The phone-only, no-extra-hardware design is also significant for affordability. That said, the evaluation has a serious gap that the authors acknowledge only obliquely: the actual user population, BLV pedestrians in Dhaka, was not included in field testing. The participants were sighted university students, so the data speaks to model latency, battery, and algorithmic noise filtering, not to whether BLV users can act on the feedback in time, whether haptic and TTS patterns are interpretable under the cognitive load of crossing a live road, or whether the holding-up-a-phone posture is feasible alongside a white cane. The system also degrades in low light and rain, and 2D bounding-box heuristics give no real distance estimate.

Tags: blind and low vision · pedestrian navigation · mobile accessibility · computer vision · haptic feedback · object detection · Global South accessibility · assistive technology · Android