Back close

Speaker-independent Expressive Voice Synthesis using Learning-based Hybrid Network Model

Publication Type : Journal Article

Publisher : International Journal of Speech Technology

Source : International Journal of Speech Technology, p.1-17 (2020)

Url : https://link.springer.com/article/10.1007/s10772-020-09691-1

Campus : Bengaluru

School : Department of Computer Science and Engineering, School of Engineering

Department : Computer Science, Electronics and Communication

Year : 2020

Abstract : Emotional voice conversion systems are used to formulate mapping functions to transform the neutral speech from output of text-to-speech systems to that of target emotion appropriate to the context. In this work, a learning-based hybrid model is proposed for speaker-independent emotional voice conversion using a combination of deep belief nets (DBN-DNN) and general regression neural net (GRNN). The main acoustic features considered for mapping are shape of the vocal tract given by line spectral frequencies (LSF), glottal excitation given by LP residual and long term prosodic features viz. pitch contour and energy. GRNN is used to attain the transformation function between source and target LSFs. Source and target LP residuals are subjected to wavelet transform before DBN-DNN training. This is helpful to remove phase-change induced distortions which may affect the performance of neural networks when transforming time-domain residual. Low-dimensional pitch (intonation) contour is subjected to feed-forward neural network mapping (ANN). Energy modification is achieved by taking average transformation scales across entire utterance. The system is tested on three different datasets viz. EmoDB (German), IITKGP (Telugu) and SAVEE (English). Relative performances of proposed model are compared with constrained variance GMM (CV-GMM) using objective and subjective metrics. The results obtained show a significant performance improvement of 41% in RMSE (Hz) and 9.72% in Pearson’s correlation coefficient for fundamental frequency (F0) (Fear) compared to CV-GMM across all 3 datasets. Subjective results indicate a maximum MOS score of 3.85 (Fear) and CMOS score of 3.9 (Happiness) across the three datasets considered.

Cite this Research Publication : Susmitha Vekkot and Gupta, D., “Speaker-independent Expressive Voice Synthesis using Learning-based Hybrid Network Model”, International Journal of Speech Technology, pp. 1-17, 2020.

Admissions Apply Now