Model-Based Classification of Emotional Speech Using Non-Linear Dynamics Features
Subject Areas : electrical and computer engineeringA. Harimi 1 * , A. Ahmadyfard 2 , A. Shahzadi 3 , K. Yaghmaie 4
1 -
2 -
3 -
4 -
Abstract :
Recent developments in interactive and robotic systems have motivated researchers for recognizing human’s emotion from speech. The present study aimed to classify emotional speech signals using a two stage classifier based on arousal-valence emotion model. In this method, samples are firstly classified based on the arousal level using conventional prosodic and spectral features. Then, valence related emotions are classified using the proposed non-linear dynamics features (NLDs). NLDs are extracted from the geometrical properties of the reconstructed phase space of speech signal. For this purpose, four descriptor contours are employed to represent the geometrical properties of the reconstructed phase space. Then, the discrete cosine transform (DCT) is used to compress the information of these contours into a set of low order coefficients. The significant DCT coefficients of the descriptor contours form the proposed NLDs. The classification accuracy of the proposed system has been evaluated using the 10-fold cross-validation technique on the Berlin database. The average recognition rate of 96.35% and 87.18% were achieved for females and males, respectively. By considering the total number of male and female samples, the overall recognition rate of 92.34% is obtained for the proposed speech emotion recognition system.
[1] J. Nicholson, K. Takahashi, and R. Nakatsu, "Emotion recognition in speech using neural networks," Neural Comput. Appl., vol. 9, no. 4, pp. 290-296, 2000.
[2] B. Schuller, G. Rigoll, and M. Lang, "Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture," in Proc. of Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP'04, vol. 1, pp. 577-580, 7-21 May 2004.
[3] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk," IEEE Trans. Biomedical Eng., vol. 47, no. 7, pp. 829-837, Jul. 2007.
[4] J. Hansen and D. C. Icarus, "Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments," Speech Commun., vol. 16, no. 4, pp. 391-422, Jun. 1995.
[5] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: features, classification schemes, and databases," Pattern Recognition, vol. 44, no. 3, pp. 572-587, May 2011.
[6] S. Wu, T. H. Falk, and W. Y. Chan, "Automatic speech emotion recognition using modulation spectral features," Speech Communication, vol. 53, no. 5, pp. 768-785, May 2011.
[7] E. Kim, K. Hyun, S. Kim, and Y. Kwak, "Speech emotion recognition using eigen-fft in clean and noisy environments," in Proc. 16th IEEE Int. Symp. on Robot and Human Interactive Communication, RO-MAN'07, pp. 689-694, 26-29 Aug. 2007.
[8] R. Cowie, et al., "Emotion recognition in human-computer interaction," IEEE Signal Process. Mag., vol. 18, no. 1, pp. 32-80, Jan. 2001.
[9] F. Takens, "Detecting strange attractors in turbulence," Dynamical Systems and Turbulence, Warwick, vol. 898, pp. 366-381, 1980.
[10] H. M. Teager and S. M. Teager"Evidence for nonlinear sound production mechanisms in the vocal tract," in Speech Production and Speech Modelling, W. J. Hardcastle and A. Marchal, Eds., NATO Advanced Study Institute Series D, vol. 55, Bonas, France, 1989.
[11] X. Huang, A. Acero, and H. W. Hon, Spoken Language Processing, Upper Saddle River, NJn and Prentice Hall, 2001.
[12] K. M. Indrebo, R. J. Povinelli, and M. T. Johnson, "Sub-banded reconstructed phase spaces for speech recognition," Speech Communication, vol. 48, no. 7, pp. 760-774, Dec. 2006.
[13] P. Prajith, Investigation on the Applications of Dynamical Instabilities and Deterministic Chaos for Speech Signal Processing, Ph.D Thesis, University of Calicut, 2008.
[14] J. Sun, N. Zheng, and X. Wang, "Enhancement of chinese speech based on nonlinear dynamics," Signal Processing, vol. 87, no. 10, pp. 2431-2445, Oct. 2007.
[15] J. Krajewski, S. Schnieder, D. Sommer, A. Batliner, and B. Schuller, "Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech," Neurocomputing, vol. 84, no. 1, pp. 65-75, May 2012.
[16] T. Drugman, B. Bozkurt, and T. Dutoit, "Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation," Speech Communication, vol. 53, no. 6, pp. 855-866, Jul. 2011.
[17] S. Gonzalez and M. Brookes, "A pitch estimation filter robust to high levels of noise (PEFAC)," in Proc. European Signal Processing Conf,, EUSIPCO'11, pp. 451-455, Barcelona, Spain, 29 Aug.-2 Sept. 2011.
[18] J. Kaiser, "On a simple algorithm to calculate the 'energy' of a signal," in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP'90, vol. 1, pp. 381-384, 3-6 Apr. 1990.
[19] G. Zhou, J. Hansen, and J. Kaiser, "Nonlinear feature based classification of speech under stress," IEEE Trans. Audio Speech Language Process, vol. 9, no. 3, pp. 201-216, Mar. 2001.
[20] T. Polzehl, A. Schmitt, F. Metze, and M. Wagner, "Anger recognition in speech using acoustic and linguistic cues," Speech Communication, vol. 53, no. 5, pp. 1198-1209, Nov. 2011.
[21] C. C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, "Emotion recognition using a hierarchical binary decision tree approach," Speech Communication, vol. 53, no. 9, pp. 1162-1171, Nov. 2011.
[22] L. He, M. Lech, N. C. Maddage, and N. B. Allen, "Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech," Biomedical Signal Processing and Control, vol. 6, no. 1, pp. 139-146, May 2011.
[23] P. Laukka, D. Neiberg, M. Forsell, I. Karlsson, and K. Elenius, "Expression of affect in spontaneous speech: acoustic correlates and automatic detection of irritation and resignation," Computer Speech and Language, vol. 25, no. 4, pp. 84-104, Sep. 2011.
[24] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Proc. Interspeech, vol. 5, pp. 1517-1520, Sept. 2005.
[25] B. Schuller, D. Seppi, A. Batliner, A. Maier, and S. Steidl, "Emotion recognition in the noise applying large acoustic feature sets," in Proc. Speech Prosody, vol. 6, pp. 1802-1805, May. 2006.
[26] M. Lugger and B. Yang, "Cascaded emotion classification via psychological emotion dimensions using a large set of voice quality parameters," in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, vol. 4, pp. 4945-4948, May 2008.
[27] E. M. Albornoz, D. H. Milone, and H. L. Rufiner, "Spoken emotion recognition using hierarchical classifiers," Computer Speech and Language, vol. 25, no. 3, pp. 556-570, Aug. 2011.
[28] N. Kamaruddin, A. Wahab, and C. Quek, "Cultural dependency analysis for understanding speech emotion," Expert Systems with Applications, vol. 39, no. 5, pp. 5115-5133, Apr. 2012.
[29] M. Kotti and C. Kotropoulos, "Gender classification in two emotional speech databases," in Proc. 19th Int. Conf. on Pattern Recognition, ICPR'08, vol 3, pp. 380-386, Aug. 2008..
[30] C. Bishop, Pattern Recognition and Machine Learning, New York: Springer, 2006.
[31] J. R. Raudays and A. K. Jain, "Small sample size effects in statistical pattern recognition: recommendations for practitioners," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264, Mar. 1991.
[32] B. Yang and M. Lugger, "Emotion recognition from speech signals using new harmony features," Signal Processing, vol. 90, no. 5, pp. 1415-1423, May 2010.
[33] S. Whittle, M. Yucel, M. B. H. Yap, and N. B. Allen, "Sex differences in the neural correlates of emotion: evidence from neuroimaging," Biological Psychology, vol. 87, no. 3, pp. 319-333, May 2011.
[34] H. Altun and G. Polat, "Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection," Expert Systems with Applications, vol. 36, no. 4, pp. 8197-8203, May 2009.