Multi-view Mouth Renderization for Assisting Lip-reading

Andrea Britto Mattos, Dario Augusto Borges Oliveira · 2018 · Proceedings of the 15th International Web for All Conference (W4A) · doi:10.1145/3192714.3192824

Summary

This paper presents an assistive tool that uses Generative Adversarial Networks (GANs) to enhance video for people who rely on lip-reading. The core problem is that lip-readers generally prefer a frontal view of a speaker's face, but in real-world video the speaker may be captured from any angle, and certain lip gestures (like protrusion and rounding) are actually more visible from a profile view. The system takes unconstrained video of a speaker captured at an arbitrary angle, detects the mouth region using a face detector (DLib with HOG features), and then uses three independent Pix2Pix GANs to generate augmented views of the lips at fixed angles: frontal (0 degrees), and two profile views (45 and 60 degrees). These rendered mouth views are overlaid onto the original video. A key innovation is the training approach: because collecting large-scale paired datasets of real faces at multiple angles is impractical, the authors generate a synthetic training dataset using realistic 3D face models created with FaceGen software. They produced 2,550 diverse synthetic subjects (balanced by gender and racially diverse) displaying 16 US-English visemes under varied lighting and rotation conditions, totaling 40,800 synthetic images. The approach is both speaker-independent and language-independent, since it operates on visemes (visual mouth shapes) rather than words or phonemes, and US-English visemes encompass nearly all visemes from Dutch, Portuguese, Spanish, Italian, and French.

Key findings

The GANs trained on synthetic data successfully transferred to real video data, producing visually compelling multi-view mouth renderings. Quantitative evaluation using the Structural Similarity Index (SSIM) showed that GAN-generated outputs were consistently closer to ground truth target views than the original random-angle inputs were, across all three fixed angles. The frontal view (0 degrees) produced the best results, while profile views (especially 60 degrees) showed some artifacts, particularly in the teeth region, due to less visual information being available at extreme angles. When tested on real video, the system tracked lip movements accurately for frontal input videos, while profile input videos showed potential but with several tracking failures at the 60-degree rotation. The authors deliberately addressed racial and gender bias in their synthetic dataset — a notable consideration given that benchmarks like Labeled Faces in the Wild are approximately 83% white and 78% male. The system differs fundamentally from captioning approaches: rather than converting speech to text, it enhances the visual information available to lip-readers, supporting their existing skills and autonomy rather than replacing them with a different modality.

Relevance

This research represents an innovative application of deep learning to accessibility, specifically addressing a gap in assistive tools for deaf and hard of hearing people who rely on lip-reading rather than (or in addition to) captioning. The practical implications are significant: captioning services are expensive, not always available on demand, and ASR systems still produce unacceptable error rates in many situations. Moreover, from an educational perspective, captioning can actually hinder the development of lip-reading skills, as students tend to read captions rather than practice watching the speaker. The synthetic training data approach is particularly practical, as it avoids the massive cost of collecting real paired datasets while enabling racial and gender diversity. The language-independence through viseme-based processing means the tool could theoretically work across many languages without retraining. Key limitations include the lack of user studies with actual deaf and hard of hearing participants, and reduced quality at extreme profile angles — both identified by the authors as future work.

Tags: lip-reading · hearing impairment · Deaf and hard of hearing · deep learning · generative adversarial networks · computer vision · assistive technology · video accessibility · synthetic data · 3D modeling · image generation · viseme · speechreading