Hybrid CNN-BiLSTM Architecture With Multiple Attention Mechanisms To Enhance Speech Emotion Recognition

Publication Type : Journal Article

Publisher : Elsevier BV

Source : Biomedical Signal Processing and Control

Url : https://doi.org/10.1016/j.bspc.2024.106967

Keywords : SER

Campus : Amritapuri

School : School of Computing

Year : 2025

Abstract : During recent years, the concept of attention in deep learning has been increasingly used to boost the performance of Speech Emotion Recognition (SER) models. However, these models for SER exhibit shortcomings in jointly emphasizing the time-frequency and dynamic sequential variations, often under-utilizing the rich contextual emotion-related information. We propose a hybrid deep learning model for SER using Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory Networks (BiLSTM) with multiple attention mechanisms. Our model utilizes features from the speech waveform viz. Mel spectrograms and Mel Frequency Cepstral Coefficients (MFCC), along with their time derivatives as input to the CNN and BiLSTM modules, respectively. A Time–Frequency Attention (TFA) mechanism, optimally incorporated into CNN, helps to selectively focus on emotion-related energy–time–frequency variations in Mel spectrograms. Attention-based BiLSTM uses MFCC and its time derivatives to identify the positional information of emotion for addressing the dynamic sequential variations. Finally, we fuse the attention-learned features from the CNN and BiLSTM modules and feed them to a Deep Neural Network (DNN) for SER. The experiments were carried out using three different datasets: Emo-DB and IEMOCAP, which are public datasets, and Amritaemo_Arabic; a private dataset. The hybrid model exhibited superior performance on both the public and private datasets, generating an average SER accuracy of 94.62%, 67.85%, and 95.80% with Emo-DB, IEMOCAP, and Amritaemo_Arabic datasets, respectively, effectively outperforming several state-of-the-art models.

Cite this Research Publication : Poorna S.S., Vivek Menon, Sundararaman Gopalan, Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition, Biomedical Signal Processing and Control, Elsevier BV, 2025, https://doi.org/10.1016/j.bspc.2024.106967

About Amrita Vishwa Vidyapeetham

Rankings

Accreditation

Governance

Chancellor

Leadership

Press Media

Newsletters

Amritapuri
Campus

Amaravati
Campus

Bengaluru
Campus

Chennai
Campus

Coimbatore
Campus

Faridabad
Campus

Kochi
Campus

Mysuru
Campus

Nagercoil
Campus

Haridwar

Research

Centers

Patents

Publication