HapticLens: Interactive Vibrotactile Haptic Generation from Spatially Localized Video Motion

Kevin John, Hasti Seifi · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790269

Summary

HapticLens is an interactive method for generating single-actuator vibrotactile signals from arbitrary video content by letting a designer select a spatial region of interest and converting motion within that region into a vibration waveform in real time. The authors motivate the work by noting that prior video-to-haptics approaches depend on specific inputs (e.g., first-person camera motion, predefined actions like a basketball being caught) and on spatial haptic hardware such as motion chairs and vests that very few people own. By targeting the vibrotactile actuator found in almost every smartphone, VR controller, and game controller, HapticLens shifts video-to-haptics from a lab-hardware niche toward commodity devices. The pipeline has three steps: a GPU-accelerated vision algorithm computes a dynamic visual feature over a spatiotemporal volume; the user-selected subvolume is averaged to a one-dimensional feature signal; that signal is then upsampled to 2000 Hz and used to drive both the amplitude envelope and a 220 Hz carrier through first-order frequency modulation. Two complementary vision algorithms are implemented, released open source on GitHub: a Phase-Based method that adapts Zhang et al.'s phase-acceleration motion magnification to detect sub-pixel motion, and a Saliency-Based method that adapts Kim et al.'s real-time spatiotemporal saliency estimation. The authors evaluate the system through a motion sensitivity experiment on 120 synthetic videos, robustness tests under white noise, H.264 compression, and resolution reduction on 50 MotionBench clips, runtime benchmarks, and a within-subjects user study in which 22 novice participants designed haptic feedback for five diverse real-world clips (handgun firing, drone hovering, Formula 1 race, Ace Combat dogfight, archery shot) using both algorithms.

Key findings

Phase-Based extraction tracked ground-truth motion best (cosine similarity ~0.84 with acceleration, ~0.82 velocity, ~0.73 displacement), while Saliency-Based aligned most with displacement (~0.87) but was noisier under fine fluctuations. Saliency-Based proved more robust to white noise and performed better on videos with complex camera motion, whereas Phase-Based was more robust to resolution loss (holding >20 dB SNR down to roughly 0.5x resolution) and to low-level video compression. Both algorithms ran well under interactive rates: vibration generation stayed under ~8 ms (~125 fps) irrespective of region size, and end-to-end controller latency was under 20 ms, well below the 50-100 ms visual-haptic asynchrony threshold. In the user study, both algorithms received high ratings, but Phase-Based was rated significantly higher for Overall Quality (mean 87.73 vs 82.56 on a 0-100 scale, p = .021), with no significant difference for Relevance to Video. Fourteen of 22 participants preferred Phase-Based, 7 preferred Saliency-Based, and 1 had no preference; designers who preferred Phase-Based cited its 'crisper' feel, while those preferring Saliency-Based cited its greater predictability. Most participants completed each design in under 90 seconds, indicating that novices with no haptic background could produce satisfying signals quickly through the interactive region-selection workflow.

Relevance

HapticLens is not framed as an accessibility paper, but its contribution matters directly to blind and low-vision access to visual media. Today, non-visual access to video relies almost entirely on audio description, which is labor-intensive to produce, frequently missing, and unable to convey texture, rhythm, or the physicality of on-screen motion. By showing that a commodity vibrotactile actuator driven by motion extracted from arbitrary video can produce perceptually meaningful, content-aware vibrations that novices rate highly, the paper opens a practical path to haptic transcoding of visual content for BLV audiences on phones and game controllers they already own. Limitations relevant to accessibility practice: the signal is perceptually grounded rather than physically accurate, 30 fps input caps faithful high-frequency reconstruction at ~15 Hz, region selection is currently manual, and the study did not include BLV participants. Extending this workflow with automatic region selection, BLV user evaluation, and integration with audio description would be the natural next steps for accessibility research.

Tags: haptics · vibrotactile feedback · video-to-haptics · computer vision · multimodal interaction · sensory substitution · blind and low vision · multimedia accessibility · haptic design · non-visual access