Back close

Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN

Publication Type : Journal Article

Publisher : International Journal of Speech Technology

Source : International Journal of Speech Technology, Springer , Volume 24, Issue 2, p.303 - 314 (2021)

Url : https://doi.org/10.1007/s10772-020-09792-x

Campus : Bengaluru

School : Department of Computer Science and Engineering, School of Engineering

Department : Computer Science

Year : 2021

Abstract : Emotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields of education, health, forensics, defense, robotics, and scientific purposes. However, SER has the limitations of data labeling, misinterpretation of speech, annotation of audio, and time complexity. This work presents the evaluation of SER based on the features extracted from Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) to study the emotions from different versions of audio signals. The sound signals are segmented by extracting and parametrizing each frequency calls using MFCC, GFCC, and combined features (M-GFCC) in the feature extraction stage. With the recent advances in Deep Learning techniques, this paper proposes a Deep Convolutional-Recurrent Neural Network (Deep C-RNN) approach to classify the effectiveness of learning emotion variations in the classification stage. We use a fusion of Mel–Gammatone filter in convolutional layers to first extract high-level spectral features then recurrent layers is adopted to learn the long-term temporal context from high-level features. Also, the proposed work differentiates the emotions from neutral speech with suitable binary tree diagrammatic illustrations. The methodology of the proposed work is applied on a large dataset covering Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Finally, the proposed results which obtained accuracy more than 80% and have less loss are compared with the state of the art approaches, and an experimental result provides evidence that fusion results outperform in recognizing emotions from speech signals.

Cite this Research Publication : Kumaran U., S. Rammohan, R., Nagarajan, S. Murugan, and Prathik, A., “Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN”, International Journal of Speech Technology, vol. 24, no. 2, pp. 303 - 314, 2021.

Admissions Apply Now