Multi-Perspective Visual Contrastive Decoding for Reliable Assistance

Bocheng Pan, Hailong Shi, Xingyu Gao · 2026 · ACM Transactions on Internet of Things · doi:10.1145/3785360

Summary

This technical paper presents MPVCD (Multi-Perspective Visual Contrastive Decoding), a framework designed to address the reliability of AI-generated visual descriptions for people who are blind or have low vision (BLV). The core problem it tackles: when BLV users photograph their environment using smartphones or wearable IoT devices, the resulting images frequently suffer from three specific challenges—quality degradation (motion blur, poor lighting, incorrect focus due to inability to visually verify capture), object incompleteness (partial captures from framing difficulties), and spatial misalignment (objects too close or distant). These image quality issues cause multimodal large language models (MLLMs) to generate hallucinated descriptions that fabricate non-existent objects, misinterpret spatial relationships, or incorrectly identify attributes. Since BLV users cannot visually verify AI output, hallucinations pose direct safety risks. MPVCD is built on Visual Contrastive Decoding (VCD), which compares token probability distributions between original and transformed images. The insight: tokens genuinely grounded in visual content change significantly when an image is transformed, while hallucinated tokens (driven by language priors) remain stable. Three specialized modules are implemented: Noise Contrastive Decoding (compares original vs. noise-injected images to address quality degradation); Retrieval Contrastive Decoding (retrieves semantically similar images from a CLIP-encoded memory bank to provide context for incomplete objects); and Focus Contrastive Decoding (uses a DINO object detector to crop and analyze detected regions, addressing spatial misalignment). These are dynamically balanced through Adaptive Perspective Integration using confidence-based Dynamic Weight Adjustment. The system deploys end-to-end across a voice-command client interface, encrypted communication middleware, and Kubernetes-orchestrated cloud infrastructure.

Key findings

MPVCD was evaluated against two baselines (InstructBLIP, Pensieve) on three datasets representing different aspects of the BLV assistance problem: On MSCOCO (ideal photography): MPVCD outperformed baselines on all metrics—BLEU-4: 0.420 vs 0.404 (Pensieve), CIDEr: 1.429 vs 1.377, SPICE: 0.251 vs 0.243—demonstrating that multi-perspective contrastive decoding benefits even well-composed images. On WHOOPS (hallucination resistance, commonsense-violating images): The most dramatic improvement was here—CIDEr jumped from 0.906 (Pensieve) to 1.300, SPICE from 0.186 to 0.214. This is the most safety-critical finding: MPVCD is substantially more reliable when inputs are designed to induce hallucinations. On VizWiz Caption (authentic blind photography): BLEU-4 improved from 0.302 to 0.316, CIDEr from 0.874 to 0.935, SPICE from 0.159 to 0.166—confirming real-world benefits under genuine BLV photography conditions. Ablation study on VizWiz found Focus Contrastive Decoding contributed most (CIDEr -0.048 when removed), followed by Retrieval (-0.032) and Noise (-0.021). The key tradeoff: MPVCD requires 2.5x the processing time of the InstructBLIP baseline (7.3s vs 2.9s), primarily due to retrieval and object detection overhead.

Relevance

AI-generated image descriptions are increasingly central to BLV independence—used in apps like Seeing AI, Be My Eyes AI, and smart glasses. This paper provides evidence that current MLLMs frequently hallucinate under real-world BLV photography conditions, and that the VizWiz dataset (built from authentic blind user photography) confirms this gap persists in practice. For accessibility practitioners, the key implication is that description reliability cannot be assumed—and that system design must account for the specific image quality patterns that characterize BLV photography. MPVCD demonstrates a principled approach to this challenge. The privacy-by-design features (ephemeral image storage, accessible multi-modal consent, granular user control via voice) are model accessibility-first practices for IoT deployments. A significant limitation is the absence of user studies with actual BLV participants—performance gains are measured via automated metrics against sighted-annotated reference captions, which may not fully capture what BLV users find accurate or useful. Computational overhead (2.5x) also limits feasibility for on-device deployment, though cloud offloading mitigates this for connected scenarios.

Tags: blindness and low vision · multimodal AI · image captioning · visual hallucination · assistive technology · IoT · visual description · machine learning