DiG-Net: Enhancing Human–Robot Interaction through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics
Eran Bamani Beeri, Eden Nissinman, Avishai Sintov · 2026 · ACM Transactions on Human-Robot Interaction · doi:10.1145/3803864
Summary
DiG-Net (Distance-aware Gesture Network) addresses a fundamental limitation in gesture-controlled assistive robotics: existing dynamic gesture recognition systems work reliably only within about seven metres of the camera, severely constraining their usefulness in real-world environments such as large homes, rehabilitation gyms, warehouses, or outdoor spaces. For people with mobility impairments who rely on gesture-based control of assistive robots, this range restriction is a direct accessibility barrier. The authors propose a framework that combines three technical modules. A Depth-Conditioned Deformable Alignment (DADA) block compensates for the visual degradation — reduced resolution, defocus blur, and physical signal attenuation — that occurs when a person is far from the camera. Spatio-Temporal Graph (STG) modules model local motion dynamics across frames. A Graph Transformer encoder then captures long-range temporal dependencies, linking early and late phases of a gesture to resolve ambiguity. Only a standard RGB camera is required; no depth camera or wearable sensor is needed, which is a deliberate design choice to keep the system accessible and deployable without specialist hardware. A novel loss function, RSTDAL (Radiometric Spatio-Temporal Depth Attenuation Loss), incorporates physical models of signal attenuation (Beer-Lambert law) and optical defocus to adaptively penalise misclassification of gestures at greater distances, pushing the model to develop distance-robust representations. The gesture vocabulary covers 13 classes: eight dynamic (go-back, go-up, go-down, move-right, move-left, turn-around, beckoning, follow-me) and four static (pointing, thumbs-up, thumbs-down, stop) plus a null class. A dataset of 3,240 video samples was collected from 16 participants across indoor and outdoor environments at distances from 2 to 30 metres, augmented to 4,790 training samples. A 10-person user study examined human perceptual performance at the same distances as a behavioural reference.
Key findings
DiG-Net achieves 97.3% recognition accuracy on the test set, outperforming all compared methods by a substantial margin — the next best competitor, MViT, reached 87.9%. On the two novel distance-specific metrics introduced by the authors, DiG-Net scored 0.92 on Distance-Weighted Accuracy (emphasising correct recognition at greater distances) and 0.96 on Gesture Stability Score (measuring consistency of predictions frame-by-frame). Inference runs at 15–25 FPS on an NVIDIA Jetson Orin Nano embedded board using full-precision (FP32) weights without compression or quantisation — confirming real-time capability on the kind of low-power hardware commonly found in mobile robots. Robustness testing shows 90.1% accuracy under severe background clutter, 91.5% in overcast outdoor conditions, and 88.3% under severe blur/fog. Ablation results confirm each module is necessary: removing the DADA block dropped accuracy to 88.9%, removing the Graph Transformer to 87.5%, and replacing RSTDAL with standard cross-entropy to 90.1%. The human user study provided an important comparative baseline: participants achieved 84% accuracy for dynamic gestures at long range (25–30 m), while DiG-Net maintained 94.9% under equivalent conditions. Human performance on static gestures dropped sharply with distance (68% at long range), mirroring the model's design emphasis on temporal motion cues rather than fine spatial detail. New gestures can be added to the classifier with as few as 15 video examples via fine-tuning.
Relevance
DiG-Net is directly relevant to accessibility practitioners working with assistive robotics and augmentative communication for people with mobility impairments. The paper frames long-range gesture recognition as an accessibility challenge: people who rely on gestures to control robots — particularly those who cannot approach equipment closely due to motor limitations, wheelchair use, or safety constraints — are excluded by the short operational range of existing systems. By achieving reliable recognition at 30 metres using only a standard camera, DiG-Net removes hardware cost barriers (no depth camera, no wearables) and spatial barriers simultaneously. The application domains cited — home healthcare, industrial safety, and emergency response — map directly to scenarios where people with disabilities often require assistive robot support. The paper also represents an example of user-centered design principles applied to AI: the user study grounding the model's performance in human perceptual data, and the explicit inclusion of people with motor impairments as the primary target population. Limitations include a small and demographically narrow dataset (16 participants, single geographic region, aged 25–44), and a user study with only 10 participants, limiting generalisability to diverse populations including older adults or users with atypical motor patterns.
Tags: assistive robotics · gesture recognition · human-robot interaction · mobility impairment · accessibility · computer vision · deep learning · nonverbal communication