Qualification: 
Ph.D
Email: 
d_govind@cb.amrita.edu

Dr. Govind D. joined Amrita School of Engineering, Amritanagar, Coimbatore, in the Center for Computational Engineering & Networking as Assistant Professor in August 2012. He completed PhD from Indian Institute of Technology Guwahati and his core area of research is in Speech Signal Processing. He also worked as Project Lead in the UK-India education research initiative project (2007-2011) titled “Study of source features for speech synthesis and speaker recognition”, between IIT Guwahati and University of Edinburgh. Govind is currently investigating the ongoing DST sponsored project titled “Analysis, Processing and Synthesis of Emotions in Speech”, at Amrita Viswa Vidyapeetham, Coimbatore. He also has more than 25 research publications in reputed conferences and journals. Govind's publications include prestigious conferences in the area of speech processing, like INTERSPEECH and SPEECH PROSODY organized by International Speech Communication Association (ISCA) .

Achievements

Dr. Govind has secured an outstanding grade of "AS" in the MHRD sponsored Global Initiative of Academic Networks (GIAN) course on the topic " Advanced Sinusoidal Modelling of Speech and Applications", organized at Indian Institute of Technology Guwahati during December 2016. The course was handled by Prof. Yannis Stylianou from University of Crete, Greece

Publications

Publication Type: Journal Article

Year of Conference Publication Type Title

2017

Journal Article

D. Pravena and Dr. Govind D., “Development of simulated emotion speech database for excitation source analysis”, International Journal of Speech Technology, pp. 1-12, 2017.[Abstract]


The work presented in this paper is focused on the development of a simulated emotion database particularly for the excitation source analysis. The presence of simultaneous electroglottogram (EGG) recordings for each emotion utterance helps to accurately analyze the variations in the source parameters according to different emotions. The work presented in this paper describes the development of comparatively large simulated emotion database for three emotions (Anger, Happy and Sad) along with neutrally spoken utterances in three languages (Tamil, Malayalam and Indian English). Emotion utterances in each language are recorded from 10 speakers in multiple sessions (Tamil and Malayalam). Unlike the existing simulated emotion databases, instead of emotionally neutral utterances, emotionally biased utterances are used for recording. Based on the emotion recognition experiments, the emotions elicited from emotionally biased utterances are found to show more emotion discrimination as compared to emotionally neutral utterances. Also, based on the comparative experimental analysis, the speech and EGG utterances of the proposed simulated emotion database are found to preserve the general trend in the excitation source characteristics (instantaneous F0 and strength of excitation parameters) for different emotions as that of the classical German emotion speech-EGG database (EmoDb). Finally, the emotion recognition rates obtained for the proposed speech-EGG emotion database using the conventional mel frequency cepstral coefficients and Gaussian mixture model based emotion recognition system, are found to be comparable with that of the existing German (EmoDb) and IITKGP-SESC Telugu speech emotion databases. © 2017 Springer Science+Business Media New York

More »»

2016

Journal Article

R. Surya, Ashwini, R., Pravena, D., and Dr. Govind D., “Issues in formant analysis of emotive speech using vowel-like region onset points”, Advances in Intelligent Systems and Computing, vol. 384, pp. 139-146, 2016.[Abstract]


The emotions carry crucial extra linguistic information in speech. A preliminary study on the significance and issues in processing the emotive speech anchored around the vowel-like region onset points (VLROP) is presented in this paper. The onset of each vowel-like region (VLR) in speech signals is termed as the VLROP. VLROPs are estimated by exploiting the impulse like characteristics in excitation components of speech signals. Also the work presented in the paper identifies the issue of falsified estimation of VLROPs in emotional speech. Despite the falsely estimated VLROPs, the formant based vocaltract characteristics are analyzed around the correctly estimated VLROPs from the emotional speech. The VLROPs retained for the emotion analysis are selected from those syllables which have uniquely estimated VLROPs without false detection from each emotion of same text and speaker. Based on the formant analysis performed around the VLROPs, there are significant variations in the location of the formant frequencies for the emotion utterances with respect to neutral speech utterances. This paper presents a formant frequency analysis performed from 20 syllables selected from 10 texts, 10 speakers across 4 emotions (Anger, Happy, Fear and Boredom) and neutral speech signals of German emotion speech database. The experiments presented in this paper suggest, firstly, the need for devising a new robust VLROP estimation for emotional speech. Secondly, the need for further exploring the formant characteristics for emotion speech analysis. © Springer International Publishing Switzerland 2016. More »»

2015

Journal Article

Dr. Govind D. and Joy, T. T., “Improving the Flexibility of Dynamic Prosody Modification Using Instants of Significant Excitation”, International Journal of Circuits, Systems, and Signal Processing, pp. 1-26, 2015.[Abstract]


Modification of supra-segmental features such as pitch and duration of original speech by fixed scaling factors is referred to as static prosody modification. In dynamic prosody modification, the prosodic scaling factors (time-varying modification factors) are defined for all the pitch cycles present in the original speech. The present work is focused on improving the naturalness of the prosody modified speech by reducing the generation of piecewise constant segments in the modified pitch contour. The prosody modification is performed by anchoring around the accurate instants of significant excitation estimated from the original speech. The division of longer pitch intervals into many equal intervals over long speech segments introduces step-like discontinuities in the form of piecewise constant segments in the modified pitch contours. The effectiveness of proposed dynamic modification method is initially confirmed from the smooth modified pitch contour plot obtained for finer static prosody scaling factors, waveforms, spectrogram plots and comparison subjective evaluations. Also, the average F0 jitter computed from the pitch segments of each glottal activity region in the modified speech is proposed as an objective measure for the prosody modification. The naturalness of the prosody modified speech using the proposed method is objectively and subjectively compared with that of the existing zero frequency filtered signal-based dynamic prosody modification. Also, the proposed algorithm effectively preserves the dynamics of the prosodic patterns in singing voices where in the F0 parameters rapidly and continuously fluctuate within a higher F0 range.

More »»

2013

Journal Article

D. Pravena and Dr. Govind D., “Expressive Speech Synthesis: A Review”, International Journal of Speech Technology, vol. 16, pp. 237–260, 2013.[Abstract]


The objective of the present work is to provide a detailed review of expressive speech synthesis (ESS). Among various approaches for ESS, the present paper focuses the development of ESS systems by explicit control. In this approach, the ESS is achieved by modifying the parameters of the neutral speech which is synthesized from the text. The present paper reviews the works addressing various issues related to the development of ESS systems by explicit control. The review provided in this paper include, review of the various approaches for text to speech synthesis, various studies on the analysis and estimation of expressive parameters and various studies on methods to incorporate expressive parameters. Finally the review is concluded by mentioning the scope of future work for ESS by explicit control. More »»

2013

Journal Article

Dr. Govind D. and Prasanna, S. R. Mahadeva, “Dynamic prosody modification using zero frequency filtered signal”, International Journal of Speech Technology, vol. 16, pp. 41–54, 2013.[Abstract]


Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://www.iitg.ac.in/eee/emstlab/demos/demo5.php.

More »»

2013

Journal Article

Dr. Govind D. and Prasanna, S. R. Mahadeva, “In this paper, a new and novel Automatic Speaker Recognition (ASR) system is presented. The new ASR system includes novel feature extraction and vector classification steps utilizing distributed Discrete Cosine Transform (DCT-II) based Mel Frequency Cepst”, International Journal of Speech Technology, vol. 16, no. 1, 2013.[Abstract]


Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of  zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the

More »»

2013

Journal Article

Dr. Govind D., “Epoch based dynamic prosody modification for neutral to expressive conversion”, Indian Institute of Technology Guwahati, 2013.[Abstract]


The objective of this thesis is to address the issues in the analysis, estimation and incor-poration of prosodic parameters for neutral to expressive speech conversion. The prosodic parameters like instantaneous pitch, duration and strength of excitation are used as the expression dependent parameters. For the expressive speech analysis, refinements in the conventional methods are proposed to accurately estimate the prosodic parameters from different expressions. The variations in the prosodic parameters for different expressions

More »»

2009

Journal Article

Dr. Govind D. and Prasanna, S. R. M., “Expressive speech synthesis using prosodic modification and dynamic time warping”, NCC 2009, pp. 285 - 289, 2009.[Abstract]


This work proposes a method for synthesizing expressive speech from the given neutral speech. The neutral speech is processed by the Linear Prediction (LP) analysis to extract LP coefficients (LPCs) and LP residual. The LP residual is subjected to prosodic modification using the pitch, duration and amplitude parameters of the target expression. The LPCs of the neutral speech are replaced with that of the target expression using the Dynamic Time Warping (DTW). The synthesized speech using prosody modified LP residual and replaced LPCs sounds like the target expression speech. This can also be observed by the waveform, spectrogram and objective measures.

More »»

2009

Journal Article

Soman K. P., Peter, R., Dr. Govind D., and Sathian, S. P., “Simplified Framework for Designing Biorthogonal and Orthogonal Wavelets”, International Journal of Recent Trends in Engineering, vol. 13, 2009.[Abstract]


We initially discuss a new and simple method of parameterization of compactly supported biorthogonal wavelet systems with more than one vanishing moment. To this end we express both primal and dual scaling function filters (low pass) as products of two Laurent polynomials. The first factor ensures required vanishing moments and the second factor is parameterized and adjusted to provide required length and other low pass filter requirements. We then impose double shift orthogonality conditions on the resulting two sets of filter coefficients that make them ‘Perfect Reconstruction’ filters. This modification avoids the use of Diophantine equations and associated spectral factorization method[1,2,3,4] for its derivation. The method is then modified for the parametric and non-parametric orthogonal cases, which includes the derivation for Daubechies filters.

More »»

Publication Type: Conference Paper

Year of Conference Publication Type Title

2016

Conference Paper

D. Pravena and Dr. Govind D., “Expressive Speech Analysis for Epoch Extraction Using Zero Frequency Filtering Approach”, in in Proc. IEEE Tech Symposium, IIT Kharagpur, 2016, 2016.[Abstract]


The present work discusses the issues of epoch extraction from expressive speech signals. Epochs represent the accurate glottal closure instants in voiced speech which in turn give the accurate instants of maximum excitation of the vocal tract. Even though, there are many existing methods for epoch extraction, which provide near perfect epoch estimation from clean or neutral speech, these methods show significant drop in the epoch extraction performance for expressive speech signals. The occurrence of uncontrolled and rapid pitch variations in expressive speech signals cause degradation in the epoch extraction performance. The objective of the present work is to improve the epoch extraction performance of the speech signals with various perceptually distinct expressions compared to neutral speech using zero frequency filtering (ZFF) approach. In order to capture the rapid and uncontrolled variations in expressive speech utterances, trend removal is performed on short segments (25 ms) of the output obtained from the cascade of three zero frequency resonators (ZFR). The epoch estimation performance of the proposed method is compared with the conventional ZFF method, existing refined ZFF method proposed for expressive speech and recently proposed zero band filtering (ZBF) approach. The effectiveness of the approach is confirmed by the improved epoch identification rate and reduced miss and false alarm rates compared with that of the existing methods. More »»

2016

Conference Paper

D. Pravena, Nandakumar, S., and Dr. Govind D., “Significance of Natural Elicitation in Developing Simulated Full Blown Speech Emotion Databases”, in in Proc. IEEE Tech Symposium, IIT Kharagpur,2016, IIT Kharagpur, 2016.[Abstract]


The work presented in this paper investigates the significance of natural elicitation of emotions during the development of simulated full blown emotion speech databases emotion analysis. A subset of primary emotions such as anger, happy and sad emotions along with neutral utterances are used in the present work. The first part of the work discusses the development of a simulated full blown emotion database by selecting 50 emotionally biased prompts for the recording the emotional speech data in Tamil language. For the comparative study, another simulated emotion database is developed by recording 50 neutral utterances for recording the emotion speech from the same speakers. The second part of the work is the comparison of emotion recognition performance of the simulated emotion speech databases using the basic Gaussian mixture model (GMM) based system with mel frequency cepstral coefficients (MFCC). A significant variations in the recognition rates of different emotions are observed for both the databases with emotionally biased utterances and emotionally neutral emotion utterances. Where the emotionally biased utterances observed to be more effective in discriminating emotions than emotionally neutral simulated emotion database. Also, the emotion recognition rates obtained for the simulated emotionally neutral emotion utterances follow the same trend as that of the classical German full blown simulated emotion database. More »»

2016

Conference Paper

Dr. Govind D., Hisham, M., and Pravena, D., “Effectiveness of polarity detection for improved epoch extraction from speech”, in 2016 22nd National Conference on Communication, NCC 2016, 2016.[Abstract]


The objective of the present work is to demonstrate the significance of speech polarity detection in improving the accuracy of the estimated epochs in speech. The paper also proposes a method to extract the speech polarity information using the properties of the Hilbert transform. The Hilbert transform of the speech is computed as the imaginary part of the complex analytic signal representation of the original speech. The Hilbert envelope (HE) is then computed as the magnitude of the analytic signal. The average slope of the signal amplitudes of speech and Hilbert transform of speech around the peaks in the HE are observed to be varying in accordance with the polarity of the speech signal. The effectiveness of the proposed approach is confirmed by the performance evaluation over 7 voices of the phonetically balanced CMU-Arctic database and German emotional speech database. The performance of the proposed approach is also observed to be comparable with that of the existing algorithms such as residual skewness based polarity detection and Hilbert phase based speech polarity detection. Finally, a significant improvement in the identification accuracies of the estimated epochs in speech using the popular zero frequency filtering (ZFF) method is demonstrated as an application of the speech polarity detection. © 2016 IEEE.

More »»

2015

Conference Paper

Dr. Govind D., Vishnu, R., and Pravena, D., “Improved Method for Epoch Estimation in Telephonic Speech Signals Using Zero Frequency Filtering”, in International Conference on signal and image processing applications (ICSIPA), 2015.[Abstract]


Epochs are the locations correspond to glottal closure instants for voiced speech segments and onset of bursts or frication in unvoiced segments. In the recent years, the zero frequency filtering (ZFF) based epoch estimation has received a growing attention for clean or studio speech signals. The ZFF based epoch estimation exploits the impulse like excitation characteristics at the zero frequency (DC) region in speech. As the lower frequency regions in telephonic speech are significantly attenuated, ZFF approach gives degraded epoch estimation performance. Therefore, the objective of the present work is to propose refinements to the existing ZFF based epoch estimation algorithm for improved epoch estimation in telephonic speech. The strength of the impulses at the zero frequency region are enhanced by computing the Hilbert envelope (HE) of the speech which in turn improve the epoch estimation performance. The resonators located at the approximate F0 locations of the short term blocks of conventional zero frequency filtered signal, are also found to improve the epoch estimation performance in telephonic speech. The performance of the refined ZFF method is evaluated on 3 speaker voices (JMK, SLT and BDL) of CMU Arctic database having simultaneous speech and EGG recordings. The telephonic version of CMU Arctic database is simulated using tools provided by the international telecommunication union (ITU).

More »»

2015

Conference Paper

R. Surya, Ashwini, R., Pravena, D., and Dr. Govind D., “Issues in the Formant Analysis of Emotive Speech Using Vowel-like Region Onset Points”, in In Proceedings of International Symposium on Intelligent Systems Technologies and Applications (ISTA), 2015.

2015

Conference Paper

P. M. Hisham, Pravena, D., Pardhu, Y., Gokul, V., Abhitej, B., and Dr. Govind D., “Improved Phone Recognition Using Excitation Source Features”, in In Proceedings of International Symposium on Intelligent Systems Technologies and Applications (ISTA) , 2015.

2015

Conference Paper

B. Deepak and Dr. Govind D., “Significance of implementing polarity detection circuits in audio preamplifiers”, in 2015 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2015, SCMS Group of Institutions, Corporate Office CampusPrathap Nagar , Muttom, Aluva, Kochi (Ernakulam)Kerala; India, 2015.[Abstract]


The reversal of the current directions in audio circuit elements causes polarity inversion of the acquired audio signal with respect to the reference input signal. The objective of the work presented in this paper is to implement a simple polarity detection circuit in audio preamplifiers which provides an indication of the signal polarity inversion. The present work also demonstrates the possibilities of polarity inversion in audio circuits of audio data acquisition devices. Inputs fed in the inverting/noninverting terminals of audio operational amplifiers (Op-Amps) cause polarity reversal of the amplitude values of the speech/audio signals. Even though, polarity inversion in audio circuits are perceptually indistinguishable, provides inaccurate values of speech parameters estimated by processing the speech. The work presented in this paper discusses, how polarity inversion is introduced at the circuit level and proposes a polarity detection circuit which provides an indication of polarity reversal after the pre amplification. The effectiveness of the proposed polarity inversion circuit is confirmed by 100 % polarity detection rate for the 100 randomly selected audio files of the CMU-Arctic database when simulated using Proteus 8.0. The paper is concluded by discussing the significance of VLSI implementation proposed polarity detection circuit in the most commonly used audio preamplifier systems. © 2015 IEEE. More »»

2014

Conference Paper

Dr. Govind D., Biju, A. S., and Smily, A., “Automatic speech polarity detection using phase information from complex analytic signal representations”, in 2014 International Conference on Signal Processing and Communications, SPCOM 2014, Indian Institute of ScienceBangalore; India;, 2014.[Abstract]


The objective of the present work is to propose an automatic polarity detection algorithm for speech or electro-glottogram (EGG) using the phase information obtained from the complex analytic signals. The analytic signals (sa(n)) are the complex time representation of the given signal derived using the Hilbert transform. The polarity of the signal is determined from the nature of the slope in the cosine phase of sa(n) corresponding to the peaks in the magnitude of sa(n) (Hilbert envelope). The effectiveness of the proposed algorithm is evaluated for speech and EGG utterances of CMU-Arctic database and German emotional speech database (Emo-DB). Also, the performance of the proposed method is found to be comparable with the recently proposed polarity detection algorithm based on residual excitation skewness. © 2014 IEEE. More »»

2014

Conference Paper

N. Adiga, Dr. Govind D., and Prasanna, S. R. M., “Significance of epoch identification accuracy for prosody modification”, in 2014 International Conference on Signal Processing and Communications, SPCOM 2014, Indian Institute of ScienceBangalore; India, 2014.[Abstract]


Epoch refers to instant of significant excitation in speech [1]. Prosody modification is the process of manipulating the pitch and duration of speech by fixed or dynamic modification factors. In epoch based prosody modification, the prosodic features of the speech signal are modified by anchoring around the epochs location in speech. The objective of the present work is to demonstrate the significance of epoch identification accuracy for prosody modification. Epoch identification accuracy is defined as standard deviation of identification timing error between estimated epochs with the reference epochs. Initially, the epochs location of the original speech are randomly varied for arbitrary time factors and corresponding prosody modified speech is generated. The perceptual quality of the prosody modified speech is evaluated from the mean opinion scores (MOS) and objective measure. The issues in the prosody modification of telephonic speech signals are also presented. © 2014 IEEE. More »»

2013

Conference Paper

S. R. M. Prasanna and Dr. Govind D., “Unified pitch markers generation method for pitch and duration modification”, in Communications (NCC), 2013 National Conference on, 2013.[Abstract]


This paper proposes a modified pitch markers generation method that can be used for both pitch and duration modification. Except for changing some input parameters, the method remains common for both. The original pitch markers, modification and scaling factors are the input to the method. The modified pitch markers will be the output, generated according to the given modification and scaling factors. Thus providing simplified and modular approach for pitch and duration modification. The proposed method is illustrated for both static and dynamic pitch and duration modification cases. The experimental results indicate that the method can be used without any modification and with equal ease in both the cases.

More »»

2012

Conference Paper

Dr. Govind D. and Prasanna, S. R. M., “Epoch extraction from emotional speech”, in Signal Processing and Communications (SPCOM), 2012 International Conference on, 2012.[Abstract]


This work proposes a modified zero frequency filtering (ZFF) method for epoch extraction from emotional speech. Epochs refers the instants of maximum excitation of the vocal tract. In the conventional ZFF method, the epochs are estimated by trend removing the output of the zero frequency resonator (ZFR) using the window length equal to the average pitch period of the utterance. Use of this fixed window length for the epoch estimation causes spurious or missed estimation from the speech signals having rapid pitch variations like in emotional speech. This work therefore proposes a refined ZFF method for epoch estimation by trend removing the output of ZFR using the variable windows obtained by finding the average pitch periods for every fixed blocks of speech and low pass filtering the resulting trend removed signal segments using the estimated pitch as the cutoff frequency. The epoch estimation performance is evaluated for five different emotions in the German emotional speech corpus having simultaneous electro-glotto graph (EGG) recordings. The improved epoch estimation performance indicates the robustness of the proposed method against rapid pitch variations in emotional speech signals. The effectiveness of the proposed method is also confirmed by the improved epoch estimation performance on the Hindi emotional speech database. More »»

2012

Conference Paper

Dr. Govind D., Prasanna, S. R. Mahadeva, and Yegnanarayana, B., “Significance of Glottal Activity Detection for Duration Modification”, in Speech Prosody 2012, 2012.[Abstract]


The objective of the present work is to demonstrate the significance of glottal activity (GA) detection for duration modification. The accurate GA regions of the speech are derived using zero frequency filtered signal (ZFFS) obtained from zero frequency filtering (ZFF) of speech. The duration of the speech is modified according to the desired scaling factors from the epochs estimated using ZFF method. Initially, the duration modified speech is synthesized using the existing epoch based fast duration modification method by processing all the epochs present in the original speech. The final duration modified speech is derived by retaining the duration modified speech samples of the GA regions and original speech samples in the non-GA regions. The improved perceptual quality of the duration modified speech is confirmed from the waveforms, spectrograms and subjective evaluations. More »»

2012

Conference Paper

Dr. Govind D., Mahanta, S., and Prasanna, S. R. Mahadeva, “Significance of Duration in the Prosodic Analysis of Assamese”, in Proceedings of Speech Prosody, 2012.[Abstract]


The objective of the present work is to demonstrate the significance of duration in the context of phonological Focus of Assamese. Focus refers to that part of sentence which expresses assertion, putting more emphasis on that part of the sentence which introduces new information. The present work considers subject object verb (SOV) type declarative sentences in wide, object and subject focus cases for the study. Speech data was collected from native Assamese speakers in all the three types of focus. Manual duration analysis was carried for all the speech data. It was observed that compared to wide focus, the duration reduces in the object and subject focus cases. Even though the overall duration reduction in object and subject focuses is nearly same, the amount of reduction is different for subject (S), object (O) and verb (V) parts. The duration modification of wide focus speech according to the duration modification factors of either object or subject focus confirms that duration indeed influences the realization of focus. More »»

2012

Conference Paper

Dr. Govind D., Sarmah, P., and Prasanna, S. R. Mahadeva, “Role of pitch slope and duration in synthesized Mizo tones”, in Speech Prosody 2012, 2012.[Abstract]


This paper reports the results of an attempt to synthesize the lexical tones of the Mizo language. Firstly, the study reported in this paper attempts to confirm the findings of previous acoustic studies on Mizo tones. Secondly, using the parameters defined in the previous acoustic studies, the work reported in this paper synthesized Mizo tones and then confirmed the acceptability of the synthesized tones from native speakers of Mizo. The work reported in this paper confirms that (a) mean fundamental frequency (F0) alone cannot be a parameter to recognize Mizo tones; (b) mean F0 and tone slope (Fd) information integrated into synthesized Mizo tones elicit better identification and acceptance and (c) durational information is important for correct identification of rising tones in Mizo. More »»

2011

Conference Paper

Dr. Govind D., Prasanna, S. R. Mahadeva, and Yegnanarayana, B., “Neutral to Target Emotion Conversion Using Source and Suprasegmental Information.”, in Interspeech, Florence, Italy, 2011.[Abstract]


This work uses instantaneous pitch and strength of excitation along with duration of syllable-like units as the parameters for emotion conversion. Instantaneous pitch and duration of the syllable-like units of the neutral speech are modified by the prosody modification of its linear prediction (LP) residual using the instants of significant excitation. The strength of excitation is modified by scaling the Hilbert envelope (HE) of the LP residual. The target emotion speech is then synthesized using the prosody and strength modified LP residual. The pitch, duration and strength modification factors for emotion conversion are derived using the syllable-like units of initial, middle and final regions from an emotion speech database having different speakers, texts and emotions. The effectiveness of the region wise modification of source and supra segmental features over the gross level modification is confirmed by the waveforms, spectrograms and subjective evaluations. More »»

2011

Conference Paper

Dr. Govind D., Prasanna, S. R. Mahadev, and Pati, D., “Epoch Extraction in High Pass Filtered Speech Using Hilbert Envelope.”, in INTERSPEECH, Florence, Italy, 2011.[Abstract]


Hilbert envelope (HE) is defined as the magnitude of the analytic signal. This work proposes HE based zero frequency filtering (ZFF) approach for the extraction of epochs in high pass filtered speech. Epochs in speech correspond to instants of significant excitation like glottal closure instants. The ZFF method for epoch extraction is based on the signal energy around the impulse at zero frequency which seems to be significantly attenuated in case of high pass filtered speech. The low frequency nature of HE reinforces the signal energy around the impulse at zero frequency. This work therefore processes the HE of high pass filtered speech or its residual by zero frequency filtering for epoch extraction. The proposed approach shows significant improvement in performance for the high pass filtered speech compared to the conventional ZFF of speech.

More »»

2010

Conference Paper

S. R. Mahadeva Prasanna and Dr. Govind D., “Analysis of excitation source information in emotional speech.”, in INTERSPEECH, 2010.[Abstract]


The objective of this work is to analyze the effect of emotions on the excitation source of speech production. The neutral, angry, happy, boredom and fear emotions are considered for the study. Initially the electroglottogram (EGG) and its derivative signals are compared across different emotions. The mean, standard deviation and contour of instantaneous pitch, and strength of excitation parameters are derived by processing the derivative of the EGG and also speech using zero-frequency filtering (ZFF) approach. The comparative study of these features across different emotions reveals that the effect of emotions on the excitation source is distinct and significant. The comparative study of the parameters from the derivative of EGG and speech waveform indicate that both cases have the same trend and range, inferring any of them may be used. Use of the computed parameters are found to be effective in the prosodic modification task. Index Terms: source, emotion, pitch, strength.

More »»

2010

Conference Paper

S. R. M. Prasanna, Dr. Govind D., K Rao, S., and Yegnanarayana, B., “Fast prosody modification using instants of significant excitation”, in Proc Speech Prosody, Chicago, USA, 2010.[Abstract]


The objective of this work is to propose a fast method for prosody modification using the instants of significant excitation. The proposed method is significantly faster than the existing method based on finding the instants using group-delay and using the LP residual for incorporating the desired prosody features. This is achieved by (i) using the zero frequency filtering (ZFF) method for finding the instants of significant excitation instead of group-delay, and (ii) direct manipulation of the speech waveform rather than the Linear Prediction (LP) residual. Subjective studies indicate that the modified speech is of good quality with minimum distortion. More »»

Publication Type: Conference Proceedings

Year of Conference Publication Type Title

2015

Conference Proceedings

A. Vishakh, Dr. Govind D., and Pravena, D., “Preliminary Studies towards Improving the Isolated Digit Recognition Performance of Dysarthric Speech by Prosodic Analysis”, Proceedings of Symposium of Computer vision and internet (VisionNet), Procedia Computer Science, vol. 58. pp. 395–400, 2015.[Abstract]


Abstract The objective of the present work is to improve the digit recognition performance of speech signals affected with dysarthria. The paper presents preliminary studies performed on universal access dysarthric speech recognition (UADSR) database. The works presented in the paper are organized into three stages. Firstly, the degradation in the digit recognition performance is demonstrated by testing the dysarthric digits with the acoustic models built using the digit samples spoken by controlled speakers. Secondly, the prosodic analysis is performed on the dysarthric isolated digits that are available in the database. Finally, the prosodic parameters of the dysarthric speech is manipulated to match with the normal speech which is used to build the acoustic models. Based on the experiments conducted, the manipulation of duration parameters using the state of the art time-domain pitch synchronous overlap add (TD-PSOLA) method observed to be significantly improving the recognition rates in contrast to other prosodic parameters. The improvement in the word recognition rates are also found to be in accordance with the intelligibility of the dysarthric speakers and hence proves the significance of using customized prosodic scaling factors according to the intelligibility levels of each of the subjects.

More »»

2015

Conference Proceedings

Dr. Govind D., Hisham, P. M., and Pravena, D., “A Robust Algorithm for Speech Polarity Detection Using Epochs and Hilbert Phase Information”, Proceedings of Symposium of Computer vision and internet (VisionNet), Procedia Computer Science, vol. 58. pp. 524 - 529, 2015.[Abstract]


Abstract The aim of the proposed work presented in this paper is to determine the speech polarity using the knowledge of epochs and the cosine phase information derived from the complex analytic representation of original speech signal. The work presented in this paper is motivated by the observation of variations in the cosine phase of speech around the Hilbert envelope (HE) peaks according to the polarity changes. As the \{HE\} peaks represent approximate epochs location, the phase analysis is performed by using algorithms which provide better resolution and accuracy of estimated epochs in the present work. In the present work, accurate epochs locations are initially estimated and significant \{HE\} peaks are only selected from the near vicinity of the epochs location for phase analysis. The cosine phase of the speech signal is then computed as the ratio of signal to the \{HE\} of speech. The trend in the cosine phase around the selected significant \{HE\} peaks are observed to be varying according to the speech polarity. The proposed polarity detection algorithm shows better results as compared with the state of the residual skewness based speech polarity detection (RESKEW) method. Thus, the improvement in the polarity detection rates confirms significant polarity information present in the excitation source characteristics around epochs location in speech. The polarity detection rates are also found to be less affected for different levels of noise addition which indicates the effectiveness of the approach against noises. Also, based on the analysis of mean execution time, the proposed polarity detection algorithm is confirmed to be 10 times faster than the \{RESKEW\} algorithm.

More »»

2014

Conference Proceedings

T. T. Joy and Dr. Govind D., “Analysis of Segmental Durations and Signaificance of Dynamic Duration Moification for Emotion Conversion”, International Conference on Speech and Signal processing (ICSSP 2014). Kollam, Kerala, 2014.[Abstract]


The objective of the present work is to demonstrate the need for dynamically incorporating segmental durations for emotion conversion. Emotion conversion is the task of converting speech in one emotion to another. Most of the existing techniques incorporate the static variations in the prosodic parameters according to target emotion to achieve emotion conversion. The present work analyzes the segmental duration of various phonemes in a large emotion speech corpus and demonstrate the dynamic variations in the duration of various phonetic segments across emotions. The CSTR emotional speech corpus having two emotions (Angry and Happy) other than neutral and with 400 utterances per emotion for one speaker is used as the database for experimental studies. The segmental duration of the phonemes are statistically obtained by the classification and regression tree (CART) modeling of each emotion in the database.

More »»

2014

Conference Proceedings

Dr. Govind D., Biju, A. Susan, and Smily, A., “Empirical Selection of Scaling Factors for Prosody Modification Applications”, International Conference on Speech and Signal processing (ICSSP 2014). Kollam, Kerala, 2014.[Abstract]


Prosody modification is the process of manipulating pitch and duration of a given speech. The objective of the present work is to empirically determine the extend to which prosody of the original speech can be modified without affecting the intelligibility. The intelligibility of the prosody modified speech is estimated from the word error rates obtained by listening to the prosody modified speech. The recorded utterances of phonetically balanced non-sense text materials, generated using a random set of 200 sentences selected from CMU-Arctic database, are the data set used for the present study. The subjective evaluations resulted in the range of pitch and duration scale factors, which can be used for improving the effectiveness of the prosody modification without hampering the intelligibility of the original speech.

More »»

2014

Conference Proceedings

T. T. Joy and Dr. Govind D., “Analysis of segmental durations and significance of dynamic duration modification for emotion conversion”, International conference on Signal and speech processing. TKM college of Engineering, 2014.

2013

Conference Proceedings

Dr. Govind D., Prasanna, S. R. M., and Ramesh, K., “Improved method for epoch extraction in high pass filtered speech”, IEEE- INDICON 2013. IIT Bombay, Mumbai, 2013.[Abstract]


The objective of present work is to improve the epoch estimation performance in high pass filtered (HPF) speech using conventional zero frequency filtering (ZFF) approach. The strength of impulse at zero frequency is significantly attenuated in case of HPF speech and hence shows significant degradation in epoch estimation performance by ZFF approach. Since linear prediction (LP) residual of speech is characterized by sharper impulse discontinuities at epochs location compared to speech waveform, the present work uses LP residual of HPF speech for epoch estimation using ZFF method. The Gabor filtering on LP residual is carried out for further increasing strength of impulses at epochs location of LP residual. The epochs location are estimated by ZFF of Gabor filtered LP residual. The performance of proposed method is better compared to that of existing Hilbert envelope based ZFF approach with improved epoch identification accuracy.

More »»

2013

Conference Proceedings

K. Ramesh, Prasanna, S. R. M., and Dr. Govind D., “Detection of Glottal Opening Instants Using Hilbert Envelope”, INTERSPEECH 2013. Lyon, France, pp. 44-48, 2013.[Abstract]


The objective of this work is to develop an automatic method for estimating glottal opening instants (GOIs) using Hilbert envelope (HE). The GOIs are secondary major excitations after glottal closure instants (GCIs) during the production of voiced speech. The HE is defined as the magnitude of complex time function (CTF) of a given signal. The unipolar property of HE is exploited for picking the second largest peak present in a given glottal cycle and hypothesize as glottal opening instant (GOI). The electroglottogram (EGG) / speech signal is first passed through the zero frequency filtering (ZFF) method to extract GCIs. With the help of detected GCIs, the secondary peaks present in the HE of dEGG / residual are hypothesized as GOIs. The hypothesized GOIs are compared with secondary peaks estimated from the dEGG / residual. The GOIs hypothesized by the proposed method show less variance compared to peak picking from dEGG / residual.

More »»

2006

Conference Proceedings

Dr. Santhosh Kumar C., Dr. Govind D., C., N., and Narwaria, M., “Grapheme to Phone Conversion for Hindi Oriental”, COCOSDA. Penang, Malaysia, 2006.

207
PROGRAMS
OFFERED
5
AMRITA
CAMPUSES
15
CONSTITUENT
SCHOOLS
A
GRADE BY
NAAC, MHRD
9th
RANK(INDIA):
NIRF 2017
150+
INTERNATIONAL
PARTNERS