An Empirical Study of Visual Features for Deep Learning based Audio-Visual Speech Enhancement

Abstract

Speech enhancement is the task of enhancing the speech of the target speaker from the background interference or noise. Recent approaches articulate speech enhancement as a supervised learning process, where discriminatory features of speech, speakers, and interference are learned during the training process. Whereas traditional audio-only speech enhancement approaches cannot solve the cocktail party problem, the audio-visual approaches have shown remarkable success in such scenarios. In this work, we have re-implemented the audio-visual multi-modal deep neural network structure presented in Google looking to listen project. We further explored different visual features that are commonly discussed in the literature. Our study shows that using visual features can lead to a significant improvement over audio-only methods to enhance the visually present speaker in a multi-talker situation. We have also demonstrated that even the raw lip images based visual features can also successfully be used for the speaker-independent AV speech enhancement model. Our newly proposed AV-Lips model using raw lip images as visual features achieves 6.9 dB SI-SDR gain in a single interferer case, compared to 7.2 dB SI-SDR gain achieved by the baseline AV-facNy model using face embeddings as the visual feature. Further analysis with multiple interferers indicates that the AV models trained with a sufficiently large dataset with a single interferer, can also moderately enhance the target speaker from extremely interfered noisy speech. Our study also suggests that the visual features don't help in traditional speech denoising tasks, rather it can degrade the desired speech if corrupted or unreliable visual features are employed for this task.

Comparison of different visual features with respect to audio-only model

GRID as clean and LRS as interfering speaker
Noisy mixture	Audio-only model	AV-faceNy (Face embeddings as visual feature)	AV-Lips (Raw lip images as visual feature)

LRS as clean and GRID as interfering speaker
Noisy mixture	Audio-only model	AV-faceNy (Face embeddings as visual feature)	AV-Lips (Raw lip images as visual feature)

More Examples

Video #1
Noisy mixture	Audio-only model	AV-faceNy (Face embeddings as visual feature)	AV-Lips (Raw lip images as visual feature)

Video #2
Noisy mixture	Audio-only model	AV-faceNy (Face embeddings as visual feature)	AV-Lips (Raw lip images as visual feature)

Please contact at shetu.nitjsr13@gmail.com for further details and to get a copy of my master thesis report.

An Empirical Study of Visual Features for Deep Learning based Audio-Visual Speech Enhancement

Abstract

Comparison of different visual features with respect to audio-only model

GRID as clean and LRS as interfering speaker

LRS as clean and GRID as interfering speaker

More Examples

Video #1

Video #2