Back close

Attention based Multi Modal Learning for Audio Visual Speech Recognition

Publication Type : Conference Paper

Publisher : IEEE

Source : International Conference on Artificial Intelligence and Speech Technology (AIST), Delhi, India, 2022, pp. 1-4, doi: 10.1109/AIST55798.2022.10065019: IEEE Xplore

Url : https://ieeexplore.ieee.org/document/10065019

Campus : Coimbatore

School : School of Computing

Year : 2022

Abstract : In recent years, multimodal fusion using deep learning has proliferated in various tasks such as emotion recognition, and speech recognition by drastically enhancing the performance of the overall system. However, the existing unimodal audio speech recognition system has various challenges in handling ambient noise, and varied pronunciations, and is inaccessible to hearing-impaired people. To address these limitations in audio-based speech recognizers, this paper exploits an idea of an intermediary level fusion framework using multimodal information from audio as well as visual movements. We analyzed the performance of the transformer-based audio-visual model for noisy audio. We accessed the model across two benchmark datasets namely LRS2 and Grid. Overall, we identified that multimodal learning for speech offers a better WER compared to other baseline systems.

Cite this Research Publication : A. Kumar, D. K. Renuka, S. L. Rose and M. C. Shunmugapriya, "Attention based Multi Modal Learning for Audio Visual Speech Recognition," 2022 4th International Conference on Artificial Intelligence and Speech Technology (AIST), Delhi, India, 2022, pp. 1-4, doi: 10.1109/AIST55798.2022.10065019: IEEE Xplore

Admissions Apply Now