Qualification: 
Ph.D
d_govind@cb.amrita.edu

Dr. Govind D. joined Amrita School of Engineering, Amritanagar, Coimbatore, in the Center for Computational Engineering & Networking as Assistant Professor in August 2012. He completed PhD from Indian Institute of Technology Guwahati and his core area of research is in Speech Signal Processing. He also worked as Project Lead in the UK-India education research initiative project (2007-2011) titled “Study of source features for speech synthesis and speaker recognition”, between IIT Guwahati and University of Edinburgh. Govind is currently investigating the ongoing DST sponsored project titled “Analysis, Processing and Synthesis of Emotions in Speech”, at Amrita Viswa Vidyapeetham, Coimbatore. He also has more than 25 research publications in reputed conferences and journals. Govind's publications include prestigious conferences in the area of speech processing, like INTERSPEECH and SPEECH PROSODY organized by International Speech Communication Association (ISCA) .

Achievements

Dr. Govind has secured an outstanding grade of "AS" in the MHRD sponsored Global Initiative of Academic Networks (GIAN) course on the topic " Advanced Sinusoidal Modelling of Speech and Applications", organized at Indian Institute of Technology Guwahati during December 2016. The course was handled by Prof. Yannis Stylianou from University of Crete, Greece

Sponsored Projects

Title: Analysis, processing and synthesis of emotions in speech
Funding Agency : Science and Engineering Research Board (SERB), DST, New Delhi
Duration : 3 years (2013 Feb-2016 Dec)

PhD students

Completed PhD Students

  1. Sowmya V (Working in the area of color image processing)
    PhD thesis Title: SIGNIFICANCE OF INCORPORATING CHROMINANCE
    INFORMATION FOR SCENE CLASSIFICATION ( PhD defended successfully on 6/7/2018)
  2. Dr. Pravena (working in the area of Speech Emotion Recognition
    PhD Thesis Title: Significance of incorporating excitation source features for improved
    speech emotion recognition (Thesis Submitted in March 2018)

Ongoing PhD students

  • Divya Pankaj (Co-guidance with Prof. K. A. Narayanankutty) (Area : Image Processing)
  • B. Ganga Gowri (Co-guidance with Prof. K. P. Soman) (Area: Speech Processing)
  • G. Jyothish Lal (Co-guidance with Dr. E. A. Gopalakrishnan) (Speech Processing)

Publications

Publication Type: Journal Article

Year of Conference Publication Type Title

2018

Journal Article

J. G. Lal, Dr. E. A. Gopalakrishnan, and Dr. Govind D., “Accurate Estimation of Glottal Closure Instants and Glottal Opening Instants from Electroglottographic Signal Using Variational Mode Decomposition”, Circuits, Systems, and Signal Processing, vol. 37, pp. 810–830, 2018.[Abstract]


The objective of the proposed work is to accurately estimate the glottal closure instants (GCIs) and glottal opening instant (GOIs) from electroglottographic (EGG) signals. This work also addresses the issues with existing EGG-based GCI/GOI detection methods. GCIs are the instants at which excitation to the vocal tract is maximum and GOIs, on the other hand, have minimum excitation compared to GCIs. Both these instants occur instantaneously with a fundamental frequency defined for each glottal cycle in a given EGG signal. Accurate detection of these instants from the EGG signal is essential for the performance evaluation of GCIs and GOIs estimated from the speech signal directly. This work proposes a new method for accurate detection of GCIs and GOIs from the EGG signal using variational mode decomposition (VMD) algorithm. The EGG signal has been decomposed into sub-signals using the VMD algorithm. It is shown that VMD captures the center frequency close to the fundamental frequency of the EGG signal through one of its modes. This property of the corresponding mode helps to estimate GCIs and GOIs from the same. Besides, instantaneous pitch frequency is estimated from the obtained GCIs. The proposed method has been evaluated on the CMU-arctic database for GCI/GOI estimation and the Keele pitch extraction reference database for instantaneous pitch frequency estimation. The effectiveness of the proposed method is confirmed by comparison with state-of-the-art methods. Experimental results show that the proposed method has better accuracy and identification rate compared to state-of-the-art methods. More »»

2018

Journal Article

J. G. Lal, Dr. E. A. Gopalakrishnan, and Dr. Govind D., “Epoch Estimation from Emotional Speech Signals Using Variational Mode Decomposition”, Circuits, Systems, and Signal Processing, vol. 37, pp. 3245–3274, 2018.[Abstract]


This paper presents a novel approach for the estimation of epochs from the emotional speech signal. Epochs are the locations of significant excitation in the vocal tract during the production of voiced sound by the vibration of vocal folds. The estimation of epoch locations is essential for deriving instantaneous pitch contours for accurate emotion analysis. Many well-known algorithms for epoch extraction are found to show degraded performance due to the varying nature of excitation characteristics in the emotional speech signal. The proposed approach exploits the effectiveness of a new adaptive time series decomposition technique called variational mode decomposition (VMD) for the estimation of epochs. The VMD algorithm is applied on the emotional speech signal for decomposition of the signal into various sub-signals. Analysis of these signals shows that the VMD algorithm captures the center frequency close to the fundamental frequency defined for each glottal cycle of emotional speech utterance through its modes. This center frequency characteristic of the corresponding mode signal helps in the accurate estimation of epoch locations from the emotional speech signal. The performance evaluation of the proposed method is carried out on six different emotions taken from the German emotional speech database with simultaneous electroglottographic signals. Experimental results on clean emotive speech signals show that the proposed method provides identification rate and accuracy comparable to that of the best performing algorithm. Besides, the proposed method provides better reliability in epoch estimation from emotive speech signals degraded by the presence of noise. More »»

2017

Journal Article

D. Pravena and Dr. Govind D., “Development of simulated emotion speech database for excitation source analysis”, International Journal of Speech Technology, pp. 1-12, 2017.[Abstract]


The work presented in this paper is focused on the development of a simulated emotion database particularly for the excitation source analysis. The presence of simultaneous electroglottogram (EGG) recordings for each emotion utterance helps to accurately analyze the variations in the source parameters according to different emotions. The work presented in this paper describes the development of comparatively large simulated emotion database for three emotions (Anger, Happy and Sad) along with neutrally spoken utterances in three languages (Tamil, Malayalam and Indian English). Emotion utterances in each language are recorded from 10 speakers in multiple sessions (Tamil and Malayalam). Unlike the existing simulated emotion databases, instead of emotionally neutral utterances, emotionally biased utterances are used for recording. Based on the emotion recognition experiments, the emotions elicited from emotionally biased utterances are found to show more emotion discrimination as compared to emotionally neutral utterances. Also, based on the comparative experimental analysis, the speech and EGG utterances of the proposed simulated emotion database are found to preserve the general trend in the excitation source characteristics (instantaneous F0 and strength of excitation parameters) for different emotions as that of the classical German emotion speech-EGG database (EmoDb). Finally, the emotion recognition rates obtained for the proposed speech-EGG emotion database using the conventional mel frequency cepstral coefficients and Gaussian mixture model based emotion recognition system, are found to be comparable with that of the existing German (EmoDb) and IITKGP-SESC Telugu speech emotion databases. © 2017 Springer Science+Business Media New York

More »»

2017

Journal Article

D. Pravena and Dr. Govind D., “Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals”, International Journal of Speech Technology, vol. 20, pp. 787–797, 2017.[Abstract]


The work presented in this paper explores the effectiveness of incorporating the excitation source parameters such as strength of excitation and instantaneous fundamental frequency ( {\$}{\$}F{\_}0{\$}{\$} F 0 ) for emotion recognition task from speech and electroglottographic (EGG) signals. The strength of excitation (SoE) is an important parameter indicating the pressure with which glottis closes at the glottal closure instants (GCIs). The SoE is computed by the popular zero frequency filtering (ZFF) method which accurately estimates the glottal signal characteristics by attenuating or removing the high frequency vocaltract interactions in speech. The arbitrary impulse sequence, obtained from the estimated GCIs, is used to derive the instantaneous {\$}{\$}F{\_}0{\$}{\$} F 0 . The SoE and the instantaneous {\$}{\$}F{\_}0{\$}{\$} F 0 parameters are combined with the conventional mel frequency cepstral coefficients (MFCC) to improve the recognition rates of distinct emotions (Anger, Happy and Sad) using Gaussian mixture models as classifier. The performances of the proposed combination of SoE and instantaneous {\$}{\$}F{\_}0{\$}{\$} F 0 and their dynamic features with MFCC coefficients are compared with the emotion utterances (4 emotions and neutral) from classical German full blown emotion speech database (EmoDb) having simultaneous speech and EGG signals and Surrey Audio Visual Expressed Emotion database (3 emotions and neutral) for both speaker dependent and speaker independent emotion recognition scenarios. To reinforce the effectiveness of the proposed features and for better statistical consistency of the emotion analysis, a fairly large emotion speech database of 220 utterances per emotion in Tamil language with simultaneous EGG recordings, is used in addition to EmoDb. The effectiveness of SoE and instantaneous {\$}{\$}F{\_}0{\$}{\$} F 0 in characterizing different emotions is also confirmed by the improved emotion recognition performance in Tamil speech-EGG emotion database. More »»

2017

Journal Article

Sowmya V., Dr. Govind D., and Soman, K. Padanyl, “Significance of perceptually relevant image decolorization for scene classification”,  Journal of Electronic Imaging, 2017.[Abstract]


A color image contains luminance and chrominance components representing the intensity and color information respectively. The objective of the work presented in this paper is to show the significance of incorporating the chrominance information for the task of scene classification. An improved color-to-grayscale image conversion algorithm by effectively incorporating the chrominance information is proposed using color-to-gay structure similarity index (C2G-SSIM) and singular value decomposition (SVD) to improve the perceptual quality of the converted grayscale images. The experimental result analysis based on the image quality assessment for image decolorization called C2G-SSIM and success rate (Cadik and COLOR250 datasets) shows that the proposed image decolorization technique performs better than 8 existing benchmark algorithms for image decolorization. In the second part of the paper, the effectiveness of incorporating the chrominance component in scene classification task is demonstrated using the deep belief network (DBN) based image classification system developed using dense scale invariant feature transform (SIFT) as features. The levels of chrominance information incorporated by the proposed image decolorization technique is confirmed by the improvement in the overall scene classification accuracy . Also, the overall scene classification performance is improved by the combination of models obtained using the proposed and the conventional decolorization methods. More »»

2016

Journal Article

R. Surya, Ashwini, R., Pravena, D., and Dr. Govind D., “Issues in formant analysis of emotive speech using vowel-like region onset points”, Advances in Intelligent Systems and Computing, vol. 384, pp. 139-146, 2016.[Abstract]


The emotions carry crucial extra linguistic information in speech. A preliminary study on the significance and issues in processing the emotive speech anchored around the vowel-like region onset points (VLROP) is presented in this paper. The onset of each vowel-like region (VLR) in speech signals is termed as the VLROP. VLROPs are estimated by exploiting the impulse like characteristics in excitation components of speech signals. Also the work presented in the paper identifies the issue of falsified estimation of VLROPs in emotional speech. Despite the falsely estimated VLROPs, the formant based vocaltract characteristics are analyzed around the correctly estimated VLROPs from the emotional speech. The VLROPs retained for the emotion analysis are selected from those syllables which have uniquely estimated VLROPs without false detection from each emotion of same text and speaker. Based on the formant analysis performed around the VLROPs, there are significant variations in the location of the formant frequencies for the emotion utterances with respect to neutral speech utterances. This paper presents a formant frequency analysis performed from 20 syllables selected from 10 texts, 10 speakers across 4 emotions (Anger, Happy, Fear and Boredom) and neutral speech signals of German emotion speech database. The experiments presented in this paper suggest, firstly, the need for devising a new robust VLROP estimation for emotional speech. Secondly, the need for further exploring the formant characteristics for emotion speech analysis. © Springer International Publishing Switzerland 2016. More »»

2015

Journal Article

Dr. Govind D. and Joy, T. T., “Improving the Flexibility of Dynamic Prosody Modification Using Instants of Significant Excitation”, International Journal of Circuits, Systems, and Signal Processing, pp. 1-26, 2015.[Abstract]


Modification of supra-segmental features such as pitch and duration of original speech by fixed scaling factors is referred to as static prosody modification. In dynamic prosody modification, the prosodic scaling factors (time-varying modification factors) are defined for all the pitch cycles present in the original speech. The present work is focused on improving the naturalness of the prosody modified speech by reducing the generation of piecewise constant segments in the modified pitch contour. The prosody modification is performed by anchoring around the accurate instants of significant excitation estimated from the original speech. The division of longer pitch intervals into many equal intervals over long speech segments introduces step-like discontinuities in the form of piecewise constant segments in the modified pitch contours. The effectiveness of proposed dynamic modification method is initially confirmed from the smooth modified pitch contour plot obtained for finer static prosody scaling factors, waveforms, spectrogram plots and comparison subjective evaluations. Also, the average F0 jitter computed from the pitch segments of each glottal activity region in the modified speech is proposed as an objective measure for the prosody modification. The naturalness of the prosody modified speech using the proposed method is objectively and subjectively compared with that of the existing zero frequency filtered signal-based dynamic prosody modification. Also, the proposed algorithm effectively preserves the dynamics of the prosodic patterns in singing voices where in the F0 parameters rapidly and continuously fluctuate within a higher F0 range.

More »»

2013

Journal Article

D. Pravena and Dr. Govind D., “Expressive Speech Synthesis: A Review”, International Journal of Speech Technology, vol. 16, pp. 237–260, 2013.[Abstract]


The objective of the present work is to provide a detailed review of expressive speech synthesis (ESS). Among various approaches for ESS, the present paper focuses the development of ESS systems by explicit control. In this approach, the ESS is achieved by modifying the parameters of the neutral speech which is synthesized from the text. The present paper reviews the works addressing various issues related to the development of ESS systems by explicit control. The review provided in this paper include, review of the various approaches for text to speech synthesis, various studies on the analysis and estimation of expressive parameters and various studies on methods to incorporate expressive parameters. Finally the review is concluded by mentioning the scope of future work for ESS by explicit control. More »»

2013

Journal Article

Dr. Govind D. and Prasanna, S. R. Mahadeva, “Dynamic prosody modification using zero frequency filtered signal”, International Journal of Speech Technology, vol. 16, pp. 41–54, 2013.[Abstract]


Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://www.iitg.ac.in/eee/emstlab/demos/demo5.php.

More »»

2013

Journal Article

Dr. Govind D. and Prasanna, S. R. Mahadeva, “In this paper, a new and novel Automatic Speaker Recognition (ASR) system is presented. The new ASR system includes novel feature extraction and vector classification steps utilizing distributed Discrete Cosine Transform (DCT-II) based Mel Frequency Cepst”, International Journal of Speech Technology, vol. 16, no. 1, 2013.[Abstract]


Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of  zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the

More »»

2013

Journal Article

Dr. Govind D., “Epoch based dynamic prosody modification for neutral to expressive conversion”, Indian Institute of Technology Guwahati, 2013.[Abstract]


The objective of this thesis is to address the issues in the analysis, estimation and incor-poration of prosodic parameters for neutral to expressive speech conversion. The prosodic parameters like instantaneous pitch, duration and strength of excitation are used as the expression dependent parameters. For the expressive speech analysis, refinements in the conventional methods are proposed to accurately estimate the prosodic parameters from different expressions. The variations in the prosodic parameters for different expressions

More »»

2009

Journal Article

Dr. Govind D. and Prasanna, S. R. M., “Expressive speech synthesis using prosodic modification and dynamic time warping”, NCC 2009, pp. 285 - 289, 2009.[Abstract]


This work proposes a method for synthesizing expressive speech from the given neutral speech. The neutral speech is processed by the Linear Prediction (LP) analysis to extract LP coefficients (LPCs) and LP residual. The LP residual is subjected to prosodic modification using the pitch, duration and amplitude parameters of the target expression. The LPCs of the neutral speech are replaced with that of the target expression using the Dynamic Time Warping (DTW). The synthesized speech using prosody modified LP residual and replaced LPCs sounds like the target expression speech. This can also be observed by the waveform, spectrogram and objective measures.

More »»

2009

Journal Article

Soman K. P., Peter, R., Dr. Govind D., and Sathian, S. P., “Simplified Framework for Designing Biorthogonal and Orthogonal Wavelets”, International Journal of Recent Trends in Engineering, vol. 13, 2009.[Abstract]


We initially discuss a new and simple method of parameterization of compactly supported biorthogonal wavelet systems with more than one vanishing moment. To this end we express both primal and dual scaling function filters (low pass) as products of two Laurent polynomials. The first factor ensures required vanishing moments and the second factor is parameterized and adjusted to provide required length and other low pass filter requirements. We then impose double shift orthogonality conditions on the resulting two sets of filter coefficients that make them ‘Perfect Reconstruction’ filters. This modification avoids the use of Diophantine equations and associated spectral factorization method[1,2,3,4] for its derivation. The method is then modified for the parametric and non-parametric orthogonal cases, which includes the derivation for Daubechies filters.

More »»

Publication Type: Conference Proceedings

Year of Conference Publication Type Title

2018

Conference Proceedings

Dr. Soman K. P., B. Gowri, G., and Dr. Govind D., “Improved Epoch Extraction from Telehonic Speech Signals using Chebfun and zero frequency filtering”, Accepted for publication in INTERSPEECH 2018. INTERSPEECH 2018, Hyderabad, INDIA, 2018.

2018

Conference Proceedings

I. Chandra Yadav, Shahnawazuddin, S., Dr. Govind D., and Pradhan, G., “Spectral Smoothing by Variational Mode Decomposition and its Effect on Noise and Pitch Robustness of ASR System”, in proc. IEEE international conference on acoustics, speech and signal processing (ICASSP) 2018, Vancuer, Canada . 2018.[Abstract]


A novel front-end speech parameterization technique that is robust towards ambient noise and pitch variations is proposed in this paper. In the proposed technique, the short-time magnitude spectrum obtained by discrete Fourier transform is first decomposed in several components using variational mode decomposition (VMD). For sufficiently smoothing the spectrum, the higher-order components are discarded. The smoothed spectrum is then obtained by reconstructing the spectrum using the first-two modes only. The Mel-frequency cepstral coefficients computed using the VMD-based smoothed spectra are observed to be affected less by ambient noise and pitch variations. To validate the same, an automatic speech recognition system is developed on clean speech from adult speakers and evaluated under noisy test conditions. Furthermore, experimental evaluations are also performed on another test set which consists of speech data from children to simulate large pitch differences. The experimental evaluations as well as signal domain analyses presented in this paper support these claims.

More »»

2018

Conference Proceedings

Dr. Govind D., Pravena, D., and Ajay, G., “Improved Epoch Extraction Using Variational Mode Decomposition Based Spectral Smoothing of Zero Frequency Filtered Emotive Speech Signals (Accepted)”, National Conference on Communications (NCC) 2018, Indian Institute of Technology Hyderabad. 2018.

2018

Conference Proceedings

M. Aiswarya, Pravena, D., and Dr. Govind D., “Identifying Issues in Estimating Parameters from Speech Under Lombard Effect”, International Symposium on Signal Processing and Intelligent Recognition Systems, vol. 678. Springer, Cham, pp. 252-262, 2018.[Abstract]


Lombard effect (LE) is the phenomena in which a person tends to speak louder in the presence of loud noise, due to the obstruction of self-auditory feedback. The main objective of this work is to develop a dataset for the study of LE on speech parameters. The proposed dataset comprising of 230 utterances each from 10 speakers, consists of the simultaneous recording of speech and ElectroGlottoGram (EGG) of speech under LE as well as neutral speech recorded in a noise free condition. The speech under LE is recorded at 5 different levels (30 dB, 15 dB, 5 dB, 0 dB and −20 dB) of babble noise. The level of LE in the developed dataset is demonstrated by comparing (a) the source parameters, (b) speaker recognition rates and (c) epoch extraction performance. For the comparison of source parameters like pitch and Strength of Excitation (SoE), the neutral speech and speech under LE are compared. Based on the comparison, high pitch and low SoE are observed for the speech under LE. Also, lower recognition performance is observed when a Mel Frequency Cepstral Coefficient (MFCC) - Gaussian Mixture Model (GMM) based speaker recognition system built using the neutral speech, is tested with the speech under LE obtained from the same set of speakers. Finally, on the basis of the comparison of epoch extraction from neutral speech and speech under LE, the utterances with LE is observed to have higher epoch deviation than that for neutral speech. All these experiments confirm the level of LE in the prepared database and also reinforces the issues in processing the speech under LE, for different speech processing tasks.

More »»

2018

Conference Proceedings

M. Srikanth, Pravena, D., and Dr. Govind D., “Tamil Speech Emotion Recognition Using Deep Belief Network(DBN)”, International Symposium on Signal Processing and Intelligent Recognition Systems. pp. 319-327, 2018.[Abstract]


The proposed system shows the effectiveness of Deep Belief Network(DBN) over Gaussian Mixture model(GMM). The development of the proposed GMM-DBN system is by modeling GMM for each emotion independently using the extracted Mel frequency Cepstral Coefficient(MFCC) features from speech. The minimum distance between the distribution of features for each utterance with respect to each emotion model is derived as Bag of acoustic features(BoF) and plotted as histogram. In histogram, the count represents the number of feature distributions that are close to each emotion model. The BoF is passed in to DBN for developing train models. The effectiveness of the emotion recognition using DBN is empirically observed by increasing the Restricted Boltzmann machine(RBM) layers and further by tuning available parameters. The motivation is by testing the Classical German Speech emotion database(EmodB) with the proposed GMM-DBN system which gives the performance rate increase by 5% than the conventional MFCC-GMM system by empirical observation. Further testing of the proposed system over the recently developed simulated speech emotion database for Tamil language gives a comparable result for the emotion recognition. The effectiveness of the proposed model is empirically observed in EmodB.

More »»

2017

Conference Proceedings

D. Pravena, Dr. Govind D., Pradeep, D., and Ajay, S. G., “Exploring the Significance of Low Frequency Regions in Electroglottographic Signals for Emotion Recognition”, International Symposium on Signal Processing and Intelligent Recognition Systems. 2017.[Abstract]


Electroglottographic (EGG) signals are acquired directly from the glottis. Hence EGG signals effectively represent the excitation source part of the human speech production system. Compared to speech signals, EGG signals are smooth and carry perceptually relevant emotional information. The work presented in this paper includes a sequence of experiments conducted on the emotion recognition system developed by the Gaussian Mixture Modeling (GMM) of perceptually motivated Mel Frequency Cepstral Coefficients (MFCC) features extracted from the EGG. The conclusions drawn from these experiments are two folds. (1) The 13 static MFCC features showed improved emotion recognition performance than 39 MFCC features with dynamic coefficients (by adding Δ and Δ Δ ). (2) Low frequency regions in the EGG are emphasized by increasing the number of Mel filters for MFCC computation found to improve the performance of emotion recognition for EGG. These experimental results are verified on the EGG data available in the classic German emotional speech database (EmoDb) for four emotions such as (Anger, Happy, Boredom and Fear) apart from Neutral signals.

More »»

2017

Conference Proceedings

Dr. Govind D., Sowmya, V., Sachin, R., and Dr. Soman K. P., “Dependency of Various Color and Intensity Planes on CNN Based Image Classification”, International Symposium on Signal Processing and Intelligent Recognition Systems. 2017.[Abstract]


Scene classification systems have become an integral part of computer vision. Recent developments have seen the use of deep scene networks based on convolutional neural networks (CNN), trained using millions of images to classify scenes into various categories. This paper proposes the use of one such pre-trained network to classify specific scene categories. The pre-trained network is combined with the simple classifiers namely, random forest and extra tree classifiers to classify scenes into 8 different scene categories. Also, the effect of different color spaces such as RGB, YCbCr, CIEL*a*b* and HSV on the performance of the proposed CNN based scene classification system is analyzed based on the classification accuracy. In addition to this, various intensity planes extracted from the said color spaces coupled with color-to-gray image conversion techniques such as weighted average, and singular value decomposition (SVD) are also taken into consideration and their effects on the performance of the proposed CNN based scene classification system are also analyzed based on the classification accuracy. The experiments are conducted on the standard Oliva Torralba (OT) scene data set which comprises of 8 classes. The analysis of classification accuracy obtained for the experiments conducted on OT scene data shows that the different color spaces and the intensity planes extracted from various color spaces and color-to-gray image conversion techniques do affect the performance of proposed CNN based scene classification system. More »»

2017

Conference Proceedings

A. Chandran, Pravena, D., and Dr. Govind D., “Development of speech emotion recognition system using deep belief networks in malayalam language”, 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, Udupi, India, 2017.[Abstract]


The goal of this work is to validate the impact of natural elicitation of emotions by the speakers during the development of speech emotion databases for Malayalam language. The work also proposes a Gaussian Mixture Model-Deep Belief Networks (GMM-DBN) based speech emotion recognition system. To test the effect of emotion elicitation by the speakers, two independent datasets with emotionally biased and emotionally neutral utterances are recorded in three emotions (Angry, Happy and Sad) in Malayalam language. The speech utterances of both datasets are recorded in two sessions. To develop GMM-DBN systems, the GMM models for each emotion is independently developed using Mel Frequency Cepstral Coefficients (MFCC) features and the distribution of these features for each utterance with respect to each emotion model is derived in terms of histograms with mean vectors as the frequency bins. The mean of these histograms obtained in such a manner from each emotion model is used as a feature to train the DBN. The performance of the proposed GMM-DBN system is evaluated on the developed emotionally biased and emotionally neutral datasets for Malayalam language. Based on the comparison of the emotion recognition rates obtained, a higher emotion recognition rate is observed for utterances in emotionally biased dataset which implies that the use of emotionally biased prompts during recording identify emotions more effectively. Also, the dependency of the language for the contextual prompts is observed.

More »»

2015

Conference Proceedings

A. Vishakh, Dr. Govind D., and Pravena, D., “Preliminary Studies towards Improving the Isolated Digit Recognition Performance of Dysarthric Speech by Prosodic Analysis”, Proceedings of Symposium of Computer vision and internet (VisionNet), Procedia Computer Science, vol. 58. pp. 395–400, 2015.[Abstract]


Abstract The objective of the present work is to improve the digit recognition performance of speech signals affected with dysarthria. The paper presents preliminary studies performed on universal access dysarthric speech recognition (UADSR) database. The works presented in the paper are organized into three stages. Firstly, the degradation in the digit recognition performance is demonstrated by testing the dysarthric digits with the acoustic models built using the digit samples spoken by controlled speakers. Secondly, the prosodic analysis is performed on the dysarthric isolated digits that are available in the database. Finally, the prosodic parameters of the dysarthric speech is manipulated to match with the normal speech which is used to build the acoustic models. Based on the experiments conducted, the manipulation of duration parameters using the state of the art time-domain pitch synchronous overlap add (TD-PSOLA) method observed to be significantly improving the recognition rates in contrast to other prosodic parameters. The improvement in the word recognition rates are also found to be in accordance with the intelligibility of the dysarthric speakers and hence proves the significance of using customized prosodic scaling factors according to the intelligibility levels of each of the subjects.

More »»

2015

Conference Proceedings

Dr. Govind D., Hisham, P. M., and Pravena, D., “A Robust Algorithm for Speech Polarity Detection Using Epochs and Hilbert Phase Information”, Proceedings of Symposium of Computer vision and internet (VisionNet), Procedia Computer Science, vol. 58. pp. 524 - 529, 2015.[Abstract]


Abstract The aim of the proposed work presented in this paper is to determine the speech polarity using the knowledge of epochs and the cosine phase information derived from the complex analytic representation of original speech signal. The work presented in this paper is motivated by the observation of variations in the cosine phase of speech around the Hilbert envelope (HE) peaks according to the polarity changes. As the \{HE\} peaks represent approximate epochs location, the phase analysis is performed by using algorithms which provide better resolution and accuracy of estimated epochs in the present work. In the present work, accurate epochs locations are initially estimated and significant \{HE\} peaks are only selected from the near vicinity of the epochs location for phase analysis. The cosine phase of the speech signal is then computed as the ratio of signal to the \{HE\} of speech. The trend in the cosine phase around the selected significant \{HE\} peaks are observed to be varying according to the speech polarity. The proposed polarity detection algorithm shows better results as compared with the state of the residual skewness based speech polarity detection (RESKEW) method. Thus, the improvement in the polarity detection rates confirms significant polarity information present in the excitation source characteristics around epochs location in speech. The polarity detection rates are also found to be less affected for different levels of noise addition which indicates the effectiveness of the approach against noises. Also, based on the analysis of mean execution time, the proposed polarity detection algorithm is confirmed to be 10 times faster than the \{RESKEW\} algorithm.

More »»

2014

Conference Proceedings

T. T. Joy and Dr. Govind D., “Analysis of Segmental Durations and Signaificance of Dynamic Duration Moification for Emotion Conversion”, International Conference on Speech and Signal processing (ICSSP 2014). Kollam, Kerala, 2014.[Abstract]


The objective of the present work is to demonstrate the need for dynamically incorporating segmental durations for emotion conversion. Emotion conversion is the task of converting speech in one emotion to another. Most of the existing techniques incorporate the static variations in the prosodic parameters according to target emotion to achieve emotion conversion. The present work analyzes the segmental duration of various phonemes in a large emotion speech corpus and demonstrate the dynamic variations in the duration of various phonetic segments across emotions. The CSTR emotional speech corpus having two emotions (Angry and Happy) other than neutral and with 400 utterances per emotion for one speaker is used as the database for experimental studies. The segmental duration of the phonemes are statistically obtained by the classification and regression tree (CART) modeling of each emotion in the database.

More »»

2014

Conference Proceedings

Dr. Govind D., Biju, A. Susan, and Smily, A., “Empirical Selection of Scaling Factors for Prosody Modification Applications”, International Conference on Speech and Signal processing (ICSSP 2014). Kollam, Kerala, 2014.[Abstract]


Prosody modification is the process of manipulating pitch and duration of a given speech. The objective of the present work is to empirically determine the extend to which prosody of the original speech can be modified without affecting the intelligibility. The intelligibility of the prosody modified speech is estimated from the word error rates obtained by listening to the prosody modified speech. The recorded utterances of phonetically balanced non-sense text materials, generated using a random set of 200 sentences selected from CMU-Arctic database, are the data set used for the present study. The subjective evaluations resulted in the range of pitch and duration scale factors, which can be used for improving the effectiveness of the prosody modification without hampering the intelligibility of the original speech.

More »»

2014

Conference Proceedings

T. T. Joy and Dr. Govind D., “Analysis of segmental durations and significance of dynamic duration modification for emotion conversion”, International conference on Signal and speech processing. TKM college of Engineering, 2014.

2013

Conference Proceedings

Dr. Govind D., Prasanna, S. R. M., and Ramesh, K., “Improved method for epoch extraction in high pass filtered speech”, IEEE- INDICON 2013. IIT Bombay, Mumbai, 2013.[Abstract]


The objective of present work is to improve the epoch estimation performance in high pass filtered (HPF) speech using conventional zero frequency filtering (ZFF) approach. The strength of impulse at zero frequency is significantly attenuated in case of HPF speech and hence shows significant degradation in epoch estimation performance by ZFF approach. Since linear prediction (LP) residual of speech is characterized by sharper impulse discontinuities at epochs location compared to speech waveform, the present work uses LP residual of HPF speech for epoch estimation using ZFF method. The Gabor filtering on LP residual is carried out for further increasing strength of impulses at epochs location of LP residual. The epochs location are estimated by ZFF of Gabor filtered LP residual. The performance of proposed method is better compared to that of existing Hilbert envelope based ZFF approach with improved epoch identification accuracy.

More »»

2013

Conference Proceedings

K. Ramesh, Prasanna, S. R. M., and Dr. Govind D., “Detection of Glottal Opening Instants Using Hilbert Envelope”, INTERSPEECH 2013. Lyon, France, pp. 44-48, 2013.[Abstract]


The objective of this work is to develop an automatic method for estimating glottal opening instants (GOIs) using Hilbert envelope (HE). The GOIs are secondary major excitations after glottal closure instants (GCIs) during the production of voiced speech. The HE is defined as the magnitude of complex time function (CTF) of a given signal. The unipolar property of HE is exploited for picking the second largest peak present in a given glottal cycle and hypothesize as glottal opening instant (GOI). The electroglottogram (EGG) / speech signal is first passed through the zero frequency filtering (ZFF) method to extract GCIs. With the help of detected GCIs, the secondary peaks present in the HE of dEGG / residual are hypothesized as GOIs. The hypothesized GOIs are compared with secondary peaks estimated from the dEGG / residual. The GOIs hypothesized by the proposed method show less variance compared to peak picking from dEGG / residual.

More »»

2006

Conference Proceedings

Dr. Santhosh Kumar C., Dr. Govind D., C., N., and Narwaria, M., “Grapheme to Phone Conversion for Hindi Oriental”, COCOSDA. Penang, Malaysia, 2006.

Publication Type: Conference Paper

Year of Conference Publication Type Title

2017

Conference Paper

Sowmya V., Dr. Govind D., and Dr. Soman K. P., “Significance of contrast and structure features for an improved color image classification system”, in 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), 2017, pp. 12-14 .[Abstract]


In general, the three main modules of color image classification systems are: color-to-grayscale image conversion, feature extraction and classification. The color-to-grayscale image conversion is the important pre-processing step which must incorporate the significant and discriminative contrast and structure information in the converted grayscale images as in the original color image. All the existing techniques for color-to-grayscale image conversion preserves the significant contrast and structure information in the converted grayscale images in different manners. Hence, the present work is to analyze the significant and discriminative contrast and structure information preserved in the converted grayscale images using two different decolorization techniques called rgb2gray and singular value decomposition based color-to-grayscale image conversion (SVD) applied in the color image classification systems using the three different proposed features. The three different features for color image classification systems are proposed based on the combination of the existing dense SIFT features and the contrast & structure content computed using color-to-gray structure similarity index (C2G-SSIM) metric. More »»

2017

Conference Paper

Sowmya V., Ajay, A., Dr. Govind D., and Dr. Soman K. P., “Improved color scene classification system using deep belief networks and support vector machines”, in 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), 2017, pp. 12-14 .[Abstract]


In general, the three main modules of the color scene classification systems are image decolorization, feature extraction and classification. The work presented in this paper focuses on image decolorization and classification as two stages. The first stage or objective of this paper is to improve the performance of the color scene classification system using deep belief networks (DBN) and support vector machines (SVM). Therefore, color scene classification system termed as AGMM-DBN-SVM is proposed using the existing feature extraction technique called bags of visual words (BoW) derived from the dense scale-invariant feature transform (SIFT) and adapted gaussian mixture models (AGMM). The second stage of the presented work is to combine the proposed AGMM-DBN-SVM classification models obtained for the two different image decolorization methods called rgb2gray and singular value decomposition (SVD) based color-to-grayscale image mapping techniques to significantly increase the performance of the proposed color scene classification system. The effectiveness of the proposed framework is experimented on Oliva Torralba (OT) scene dataset containing 8 different classes. The classification rate of the proposed color scene classification system applied on OT 8 scene dataset is significantly greater than the one of the existing benchmarks color scene classification system developed using AGMM and SVM. More »»

2017

Conference Paper

A. Pooja, Pravena, D., and Dr. Govind D., “Significance of exploring pitch only features for the recognition of spontaneous emotions from speech signals”, in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017.[Abstract]


The emotional database can be classified as spontaneous and simulated emotions. Spontaneous emotions can be identified based on the two parameters 1) Arousal and 2) Valence values represented in a two dimensional plane. Arousal measures how calming or exciting the information is, whereas valence measures postive or negative affectivity of information. The objective of the paper is to predict the arousal and valence values from the speech signal which in turn provides the spontaneous emotion information. This paper also provides the significance of using pitch contours obtained from the speech signals having the spontaneous emotion information and is used as features for predicting the arousal and valence values using Deep learning based LSTM models. During testing pitch contours are extracted from the speech signals and used as the features for predicting the arousal and valence values so as to predict the emotions. In this paper a spontaneous database, REmote COLlaborative Affective interaction (RECOLA) database is used. The arousal and valence values predicted in this work has low RMSE. The effectiveness of using pitch to predict 2D values of emotional wheel is also compared on the full blown simulated German emotional database (German EmoDb) and better results were obtained on simulated database also. More »»

2017

Conference Paper

K. S. G. Krishnan and Dr. Govind D., “Comparison of glottal closure instant estimation algorithms for singing voices in Indian context”, in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017.[Abstract]


Glottal closure instant (GCI) is an important feature in many speech processing applications. Many algorithms have been proposed for GCI estimation from speech signals. The objective of the proposed work is to provide a comprehensive analysis of the performance of various GCI estimation algorithms for singing voice in Indian context. GCI estimation algorithms such as Dynamic Programming Phase Slope Algorithm (DYPSA), Zero Frequency Filtering based technique (ZFF), Speech Event Detection using Residual Excitation And Mean-based Signal (SE-DREAMS), and a modified version of ZFF technique (Modified ZFF) has been used for analysis. The accuracy and reliability of the GCI estimation algorithms are analyzed for different regions based on pitch variation, rapid pitch variation, transition of laryngeal mechanisms and singing styles in Indian classical music. Modified ZFF shows significantly better performance than other GCI estimation algorithms. The robustness of the algorithms to noise is also discussed. More »»

2017

Conference Paper

D. Mudatkar, S., A., and Dr. Govind D., “Robust pitch estimation in distant speech signals collected from vehicle”, in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017.[Abstract]


Due to significant signal attenuation, the speech signals collected at different distances show degradation in the estimation of speech parameters. Therefore the work presented in this paper proposes an alternate method for improving the F0 parameter estimation from distant speech (DS) signals which are collected through microphones at various distances. The proposed method achieves improved F0 estimation by adding a post processing module for the conventional zero frequency filtering (ZFF) method used for F0 estimation. The conventional ZFF method estimates the F0 parameters by estimating the impulse like discontinuities at the glottal closure instants (GCIs). Since strength of the impulses are heavily attenuated in the case DS signals by reducing the impulse strengths at the GCIs, the F0 estimation from the these signals becomes unreliable and inaccurate. Hence the present work proposes an alternate method for enhancing the impulse strength for DS signals by subjecting the one more level of zero frequency filtering to the conventional zero frequency filtered signal at every fixed windows having 25 ms length. The smoothed zero frequency filtered signal windows are further refined by low pass filtering using locally estimated F0 as cut off frequency. The effectiveness of the proposed method of F0 estimation using refined ZFF method is confirmed based on the comparative performance analysis with the other existing zero frequency post filtering approaches on the state of the art SPEECON database. The performance of the proposed method are evaluated based on the Gross error rates obtained by comparing the estimated F0 values computed from the DS signals and with that of the ground truth close speaking speech signals. More »»

2016

Conference Paper

D. Pravena and Dr. Govind D., “Expressive Speech Analysis for Epoch Extraction Using Zero Frequency Filtering Approach”, in in Proc. IEEE Tech Symposium, IIT Kharagpur, 2016, 2016.[Abstract]


The present work discusses the issues of epoch extraction from expressive speech signals. Epochs represent the accurate glottal closure instants in voiced speech which in turn give the accurate instants of maximum excitation of the vocal tract. Even though, there are many existing methods for epoch extraction, which provide near perfect epoch estimation from clean or neutral speech, these methods show significant drop in the epoch extraction performance for expressive speech signals. The occurrence of uncontrolled and rapid pitch variations in expressive speech signals cause degradation in the epoch extraction performance. The objective of the present work is to improve the epoch extraction performance of the speech signals with various perceptually distinct expressions compared to neutral speech using zero frequency filtering (ZFF) approach. In order to capture the rapid and uncontrolled variations in expressive speech utterances, trend removal is performed on short segments (25 ms) of the output obtained from the cascade of three zero frequency resonators (ZFR). The epoch estimation performance of the proposed method is compared with the conventional ZFF method, existing refined ZFF method proposed for expressive speech and recently proposed zero band filtering (ZBF) approach. The effectiveness of the approach is confirmed by the improved epoch identification rate and reduced miss and false alarm rates compared with that of the existing methods. More »»

2016

Conference Paper

D. Pravena, Nandakumar, S., and Dr. Govind D., “Significance of Natural Elicitation in Developing Simulated Full Blown Speech Emotion Databases”, in in Proc. IEEE Tech Symposium, IIT Kharagpur,2016, IIT Kharagpur, 2016.[Abstract]


The work presented in this paper investigates the significance of natural elicitation of emotions during the development of simulated full blown emotion speech databases emotion analysis. A subset of primary emotions such as anger, happy and sad emotions along with neutral utterances are used in the present work. The first part of the work discusses the development of a simulated full blown emotion database by selecting 50 emotionally biased prompts for the recording the emotional speech data in Tamil language. For the comparative study, another simulated emotion database is developed by recording 50 neutral utterances for recording the emotion speech from the same speakers. The second part of the work is the comparison of emotion recognition performance of the simulated emotion speech databases using the basic Gaussian mixture model (GMM) based system with mel frequency cepstral coefficients (MFCC). A significant variations in the recognition rates of different emotions are observed for both the databases with emotionally biased utterances and emotionally neutral emotion utterances. Where the emotionally biased utterances observed to be more effective in discriminating emotions than emotionally neutral simulated emotion database. Also, the emotion recognition rates obtained for the simulated emotionally neutral emotion utterances follow the same trend as that of the classical German full blown simulated emotion database. More »»

2016

Conference Paper

Dr. Govind D., Hisham, M., and Pravena, D., “Effectiveness of polarity detection for improved epoch extraction from speech”, in 2016 22nd National Conference on Communication, NCC 2016, 2016.[Abstract]


The objective of the present work is to demonstrate the significance of speech polarity detection in improving the accuracy of the estimated epochs in speech. The paper also proposes a method to extract the speech polarity information using the properties of the Hilbert transform. The Hilbert transform of the speech is computed as the imaginary part of the complex analytic signal representation of the original speech. The Hilbert envelope (HE) is then computed as the magnitude of the analytic signal. The average slope of the signal amplitudes of speech and Hilbert transform of speech around the peaks in the HE are observed to be varying in accordance with the polarity of the speech signal. The effectiveness of the proposed approach is confirmed by the performance evaluation over 7 voices of the phonetically balanced CMU-Arctic database and German emotional speech database. The performance of the proposed approach is also observed to be comparable with that of the existing algorithms such as residual skewness based polarity detection and Hilbert phase based speech polarity detection. Finally, a significant improvement in the identification accuracies of the estimated epochs in speech using the popular zero frequency filtering (ZFF) method is demonstrated as an application of the speech polarity detection. © 2016 IEEE.

More »»

2015

Conference Paper

Dr. Govind D., Vishnu, R., and Pravena, D., “Improved Method for Epoch Estimation in Telephonic Speech Signals Using Zero Frequency Filtering”, in International Conference on signal and image processing applications (ICSIPA), 2015.[Abstract]


Epochs are the locations correspond to glottal closure instants for voiced speech segments and onset of bursts or frication in unvoiced segments. In the recent years, the zero frequency filtering (ZFF) based epoch estimation has received a growing attention for clean or studio speech signals. The ZFF based epoch estimation exploits the impulse like excitation characteristics at the zero frequency (DC) region in speech. As the lower frequency regions in telephonic speech are significantly attenuated, ZFF approach gives degraded epoch estimation performance. Therefore, the objective of the present work is to propose refinements to the existing ZFF based epoch estimation algorithm for improved epoch estimation in telephonic speech. The strength of the impulses at the zero frequency region are enhanced by computing the Hilbert envelope (HE) of the speech which in turn improve the epoch estimation performance. The resonators located at the approximate F0 locations of the short term blocks of conventional zero frequency filtered signal, are also found to improve the epoch estimation performance in telephonic speech. The performance of the refined ZFF method is evaluated on 3 speaker voices (JMK, SLT and BDL) of CMU Arctic database having simultaneous speech and EGG recordings. The telephonic version of CMU Arctic database is simulated using tools provided by the international telecommunication union (ITU).

More »»

2015

Conference Paper

R. Surya, Ashwini, R., Pravena, D., and Dr. Govind D., “Issues in the Formant Analysis of Emotive Speech Using Vowel-like Region Onset Points”, in In Proceedings of International Symposium on Intelligent Systems Technologies and Applications (ISTA), 2015.

2015

Conference Paper

P. M. Hisham, Pravena, D., Pardhu, Y., Gokul, V., Abhitej, B., and Dr. Govind D., “Improved Phone Recognition Using Excitation Source Features”, in In Proceedings of International Symposium on Intelligent Systems Technologies and Applications (ISTA) , 2015.

2015

Conference Paper

B. Deepak and Dr. Govind D., “Significance of implementing polarity detection circuits in audio preamplifiers”, in 2015 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2015, SCMS Group of Institutions, Corporate Office CampusPrathap Nagar , Muttom, Aluva, Kochi (Ernakulam)Kerala; India, 2015.[Abstract]


The reversal of the current directions in audio circuit elements causes polarity inversion of the acquired audio signal with respect to the reference input signal. The objective of the work presented in this paper is to implement a simple polarity detection circuit in audio preamplifiers which provides an indication of the signal polarity inversion. The present work also demonstrates the possibilities of polarity inversion in audio circuits of audio data acquisition devices. Inputs fed in the inverting/noninverting terminals of audio operational amplifiers (Op-Amps) cause polarity reversal of the amplitude values of the speech/audio signals. Even though, polarity inversion in audio circuits are perceptually indistinguishable, provides inaccurate values of speech parameters estimated by processing the speech. The work presented in this paper discusses, how polarity inversion is introduced at the circuit level and proposes a polarity detection circuit which provides an indication of polarity reversal after the pre amplification. The effectiveness of the proposed polarity inversion circuit is confirmed by 100 % polarity detection rate for the 100 randomly selected audio files of the CMU-Arctic database when simulated using Proteus 8.0. The paper is concluded by discussing the significance of VLSI implementation proposed polarity detection circuit in the most commonly used audio preamplifier systems. © 2015 IEEE. More »»

2014

Conference Paper

Dr. Govind D., Biju, A. S., and Smily, A., “Automatic speech polarity detection using phase information from complex analytic signal representations”, in 2014 International Conference on Signal Processing and Communications, SPCOM 2014, Indian Institute of ScienceBangalore; India;, 2014.[Abstract]


The objective of the present work is to propose an automatic polarity detection algorithm for speech or electro-glottogram (EGG) using the phase information obtained from the complex analytic signals. The analytic signals (sa(n)) are the complex time representation of the given signal derived using the Hilbert transform. The polarity of the signal is determined from the nature of the slope in the cosine phase of sa(n) corresponding to the peaks in the magnitude of sa(n) (Hilbert envelope). The effectiveness of the proposed algorithm is evaluated for speech and EGG utterances of CMU-Arctic database and German emotional speech database (Emo-DB). Also, the performance of the proposed method is found to be comparable with the recently proposed polarity detection algorithm based on residual excitation skewness. © 2014 IEEE. More »»

2014

Conference Paper

N. Adiga, Dr. Govind D., and Prasanna, S. R. M., “Significance of epoch identification accuracy for prosody modification”, in 2014 International Conference on Signal Processing and Communications, SPCOM 2014, Indian Institute of ScienceBangalore; India, 2014.[Abstract]


Epoch refers to instant of significant excitation in speech [1]. Prosody modification is the process of manipulating the pitch and duration of speech by fixed or dynamic modification factors. In epoch based prosody modification, the prosodic features of the speech signal are modified by anchoring around the epochs location in speech. The objective of the present work is to demonstrate the significance of epoch identification accuracy for prosody modification. Epoch identification accuracy is defined as standard deviation of identification timing error between estimated epochs with the reference epochs. Initially, the epochs location of the original speech are randomly varied for arbitrary time factors and corresponding prosody modified speech is generated. The perceptual quality of the prosody modified speech is evaluated from the mean opinion scores (MOS) and objective measure. The issues in the prosody modification of telephonic speech signals are also presented. © 2014 IEEE. More »»

2013

Conference Paper

S. R. M. Prasanna and Dr. Govind D., “Unified pitch markers generation method for pitch and duration modification”, in Communications (NCC), 2013 National Conference on, 2013.[Abstract]


This paper proposes a modified pitch markers generation method that can be used for both pitch and duration modification. Except for changing some input parameters, the method remains common for both. The original pitch markers, modification and scaling factors are the input to the method. The modified pitch markers will be the output, generated according to the given modification and scaling factors. Thus providing simplified and modular approach for pitch and duration modification. The proposed method is illustrated for both static and dynamic pitch and duration modification cases. The experimental results indicate that the method can be used without any modification and with equal ease in both the cases.

More »»

2012

Conference Paper

Dr. Govind D. and Prasanna, S. R. M., “Epoch extraction from emotional speech”, in Signal Processing and Communications (SPCOM), 2012 International Conference on, 2012.[Abstract]


This work proposes a modified zero frequency filtering (ZFF) method for epoch extraction from emotional speech. Epochs refers the instants of maximum excitation of the vocal tract. In the conventional ZFF method, the epochs are estimated by trend removing the output of the zero frequency resonator (ZFR) using the window length equal to the average pitch period of the utterance. Use of this fixed window length for the epoch estimation causes spurious or missed estimation from the speech signals having rapid pitch variations like in emotional speech. This work therefore proposes a refined ZFF method for epoch estimation by trend removing the output of ZFR using the variable windows obtained by finding the average pitch periods for every fixed blocks of speech and low pass filtering the resulting trend removed signal segments using the estimated pitch as the cutoff frequency. The epoch estimation performance is evaluated for five different emotions in the German emotional speech corpus having simultaneous electro-glotto graph (EGG) recordings. The improved epoch estimation performance indicates the robustness of the proposed method against rapid pitch variations in emotional speech signals. The effectiveness of the proposed method is also confirmed by the improved epoch estimation performance on the Hindi emotional speech database. More »»

2012

Conference Paper

Dr. Govind D., Prasanna, S. R. Mahadeva, and Yegnanarayana, B., “Significance of Glottal Activity Detection for Duration Modification”, in Speech Prosody 2012, 2012.[Abstract]


The objective of the present work is to demonstrate the significance of glottal activity (GA) detection for duration modification. The accurate GA regions of the speech are derived using zero frequency filtered signal (ZFFS) obtained from zero frequency filtering (ZFF) of speech. The duration of the speech is modified according to the desired scaling factors from the epochs estimated using ZFF method. Initially, the duration modified speech is synthesized using the existing epoch based fast duration modification method by processing all the epochs present in the original speech. The final duration modified speech is derived by retaining the duration modified speech samples of the GA regions and original speech samples in the non-GA regions. The improved perceptual quality of the duration modified speech is confirmed from the waveforms, spectrograms and subjective evaluations. More »»

2012

Conference Paper

Dr. Govind D., Mahanta, S., and Prasanna, S. R. Mahadeva, “Significance of Duration in the Prosodic Analysis of Assamese”, in Proceedings of Speech Prosody, 2012.[Abstract]


The objective of the present work is to demonstrate the significance of duration in the context of phonological Focus of Assamese. Focus refers to that part of sentence which expresses assertion, putting more emphasis on that part of the sentence which introduces new information. The present work considers subject object verb (SOV) type declarative sentences in wide, object and subject focus cases for the study. Speech data was collected from native Assamese speakers in all the three types of focus. Manual duration analysis was carried for all the speech data. It was observed that compared to wide focus, the duration reduces in the object and subject focus cases. Even though the overall duration reduction in object and subject focuses is nearly same, the amount of reduction is different for subject (S), object (O) and verb (V) parts. The duration modification of wide focus speech according to the duration modification factors of either object or subject focus confirms that duration indeed influences the realization of focus. More »»

2012

Conference Paper

Dr. Govind D., Sarmah, P., and Prasanna, S. R. Mahadeva, “Role of pitch slope and duration in synthesized Mizo tones”, in Speech Prosody 2012, 2012.[Abstract]


This paper reports the results of an attempt to synthesize the lexical tones of the Mizo language. Firstly, the study reported in this paper attempts to confirm the findings of previous acoustic studies on Mizo tones. Secondly, using the parameters defined in the previous acoustic studies, the work reported in this paper synthesized Mizo tones and then confirmed the acceptability of the synthesized tones from native speakers of Mizo. The work reported in this paper confirms that (a) mean fundamental frequency (F0) alone cannot be a parameter to recognize Mizo tones; (b) mean F0 and tone slope (Fd) information integrated into synthesized Mizo tones elicit better identification and acceptance and (c) durational information is important for correct identification of rising tones in Mizo. More »»

2011

Conference Paper

Dr. Govind D., Prasanna, S. R. Mahadeva, and Yegnanarayana, B., “Neutral to Target Emotion Conversion Using Source and Suprasegmental Information.”, in Interspeech, Florence, Italy, 2011.[Abstract]


This work uses instantaneous pitch and strength of excitation along with duration of syllable-like units as the parameters for emotion conversion. Instantaneous pitch and duration of the syllable-like units of the neutral speech are modified by the prosody modification of its linear prediction (LP) residual using the instants of significant excitation. The strength of excitation is modified by scaling the Hilbert envelope (HE) of the LP residual. The target emotion speech is then synthesized using the prosody and strength modified LP residual. The pitch, duration and strength modification factors for emotion conversion are derived using the syllable-like units of initial, middle and final regions from an emotion speech database having different speakers, texts and emotions. The effectiveness of the region wise modification of source and supra segmental features over the gross level modification is confirmed by the waveforms, spectrograms and subjective evaluations. More »»

2011

Conference Paper

Dr. Govind D., Prasanna, S. R. Mahadev, and Pati, D., “Epoch Extraction in High Pass Filtered Speech Using Hilbert Envelope.”, in INTERSPEECH, Florence, Italy, 2011.[Abstract]


Hilbert envelope (HE) is defined as the magnitude of the analytic signal. This work proposes HE based zero frequency filtering (ZFF) approach for the extraction of epochs in high pass filtered speech. Epochs in speech correspond to instants of significant excitation like glottal closure instants. The ZFF method for epoch extraction is based on the signal energy around the impulse at zero frequency which seems to be significantly attenuated in case of high pass filtered speech. The low frequency nature of HE reinforces the signal energy around the impulse at zero frequency. This work therefore processes the HE of high pass filtered speech or its residual by zero frequency filtering for epoch extraction. The proposed approach shows significant improvement in performance for the high pass filtered speech compared to the conventional ZFF of speech.

More »»

2010

Conference Paper

S. R. Mahadeva Prasanna and Dr. Govind D., “Analysis of excitation source information in emotional speech.”, in INTERSPEECH, 2010.[Abstract]


The objective of this work is to analyze the effect of emotions on the excitation source of speech production. The neutral, angry, happy, boredom and fear emotions are considered for the study. Initially the electroglottogram (EGG) and its derivative signals are compared across different emotions. The mean, standard deviation and contour of instantaneous pitch, and strength of excitation parameters are derived by processing the derivative of the EGG and also speech using zero-frequency filtering (ZFF) approach. The comparative study of these features across different emotions reveals that the effect of emotions on the excitation source is distinct and significant. The comparative study of the parameters from the derivative of EGG and speech waveform indicate that both cases have the same trend and range, inferring any of them may be used. Use of the computed parameters are found to be effective in the prosodic modification task. Index Terms: source, emotion, pitch, strength.

More »»

2010

Conference Paper

S. R. M. Prasanna, Dr. Govind D., K Rao, S., and Yegnanarayana, B., “Fast prosody modification using instants of significant excitation”, in Proc Speech Prosody, Chicago, USA, 2010.[Abstract]


The objective of this work is to propose a fast method for prosody modification using the instants of significant excitation. The proposed method is significantly faster than the existing method based on finding the instants using group-delay and using the LP residual for incorporating the desired prosody features. This is achieved by (i) using the zero frequency filtering (ZFF) method for finding the instants of significant excitation instead of group-delay, and (ii) direct manipulation of the speech waveform rather than the Linear Prediction (LP) residual. Subjective studies indicate that the modified speech is of good quality with minimum distortion. More »»
207
PROGRAMS
OFFERED
6
AMRITA
CAMPUSES
15
CONSTITUENT
SCHOOLS
A
GRADE BY
NAAC, MHRD
8th
RANK(INDIA):
NIRF 2018
150+
INTERNATIONAL
PARTNERS