- 无标题文档
查看论文信息

论文中文题名:

 融合语音和表情的双模态情感识别技术研究    

姓名:

 刘娜    

学号:

 20207040035    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 081002    

学科名称:

 工学 - 信息与通信工程 - 信号与信息处理    

学生类型:

 硕士    

学位级别:

 工学硕士    

学位年度:

 2023    

培养单位:

 西安科技大学    

院系:

 通信与信息工程学院    

专业:

 信息与通信工程    

研究方向:

 信号与信息处理    

第一导师姓名:

 李国民    

第一导师单位:

 西安科技大学    

论文提交日期:

 2023-06-15    

论文答辩日期:

 2023-06-06    

论文外文题名:

 Research on Bimodal Emotion Recognition Technology Combining Speech and Expression    

论文中文关键词:

 情感识别 ; 希尔伯特变换 ; 特征融合 ; 多模态特征选择    

论文外文关键词:

 Emotion recognition ; Hilbert transform ; Feature fusion ; Multimodal feature selection    

论文中文摘要:

情感识别是计算机通过提取信号中蕴含的情感信息从而推断人的情感状态的技术,涉及语音处理、图像处理、模式识别等多个学科,广泛应用于人机交互、医疗、交通等多个领域。因此,对情感识别的研究具有重要的意义。现有的情感识别主要存在特征提取不全面及多模态特征融合维度高等问题,本课题主要针对以上问题,研究了语音模态的情感特征提取方法及多模态融合的情感特征选择方法。

为了提取更全面的语音情感特征,改善情感识别效果,本文结合希尔伯特-黄变换与短时傅里叶变换,提出了一种语音情感特征提取方法。在对预处理的语音信号进行变分模态分解,得到一系列本征模态分量后,首先进行希尔伯特变换,提取每个分量的能量、能量熵、平均瞬时频率以及总边际谱的峰值和峰值频率等特征组成一帧特征向量,将所有帧的特征向量按照帧序拼接成一个多维特征矩阵;其次剔除信号模态分解后的余波分量,将各主导分量重新聚合后提取傅里叶变换频谱特征,同样按照帧序拼接为特征矩阵。为充分体现特征在帧间的变化特性,保留每帧信号中各分量的固有特性,将两个特征矩阵可视化为特征谱图后进行融合,共同作为语音模态的输入特征,然后保留预训练神经网络的卷积部分作为特征提取器,最后通过支持向量机实现情感识别。实验结果表明,相比于MFCCs、语谱图、Mel谱图等特征,本文所提特征对eNTERFACE’05数据集的语音情感识别率提高了4.7%以上,对RAVDESS数据集的识别率提高了6.6%以上。

单模态情感信号所包含的情感信息有限,导致情感识别准确率偏低,多模态之间情感信息的有效互补则能显著提升情感识别效果,然而多模态融合又不可避免地造成了情感特征维度偏高的问题,特征间存在的冗余将增加模型计算负担,降低情感识别性能。针对该问题,本文提出一种混合特征选择方法。利用距离相关系数和Fisher Score过滤式特征选择策略从相关性和辨别能力两个重要性度量角度评估特征性能,将两种选择策略得到的特征集进行加权融合,从中筛选出维数更低、性能更优的特征子集。实验结果表明,混合特征选择策略在一定程度上提升情感识别效果的同时,有效降低了特征维度,在eNTERFACE’05数据集上经特征选择后的双模态情感识别率较选择前提高了1.3%,数据维度降低了96.7%,在RAVDESS数据集上的识别率提高了1.0%,特征选择后的数据维度相比之前降低了99.2%。

论文外文摘要:

Emotion recognition is a technology that the computer infers the emotional state by extracting the emotional information contained in the signal. It covers speech processing, image processing, pattern recognition and other subjects, and is broadly used in human-computer interaction, medical care, transportation and other domains. Therefore, the research on emotion recognition is of great significance. The existing emotion recognition mainly has the problems of incomplete feature extraction and high dimension of multi-modal feature fusion. This paper mainly studies the emotional feature extraction method of speech modality and the emotional feature selection method of multi-modal fusion.

To extract more comprehensive speech emotion features and enhance the performance of emotion recognition, a speech emotion feature extraction method is proposed by combining Hilbert-Huang transform and short-time Fourier transform. After performing variational mode decomposition on the preprocessed speech signal to obtain a series of intrinsic mode components, the Hilbert transform is performed first, and the energy, entropy, average instantaneous frequency of each component, peak and peak frequency of the total marginal spectrum are extracted to form a frame vector. The characteristics of all speech frames are spliced into a multi-dimensional matrix according to the frame order. Secondly, the residual wave components after signal modal decomposition are eliminated, and the Fourier transform spectrum features are extracted after the main components are re-aggregated, and the feature matrix is also spliced in frame order. To fully embody the varying characteristics of features between frames and retain the inherent characteristics of each component in each frame signal, the two feature matrices are visualized as feature spectra and fused as the input features of speech modality. Then the convolution part of the pre-trained network is retained as the characteristics extractor. Finally, the emotion recognition is realized by support vector machine. The simulation results indicate that compared with MFCCs, spectrogram, Mel spectrogram features, the speech emotion recognition rate of eNTERFACE'05 dataset is improved by more than 4.7 %, and the recognition rate of RAVDESS dataset is improved by more than 6.6 %.

The single-modal emotional signal contains limited emotional information, and the accuracy of emotional recognition is low. The effective complementarity of emotional information between multiple modalities can significantly improve the performance of emotional recognition. However, bimodal fusion inevitably causes the problem of high dimension of features, and there will be varying degrees of redundancy between features, which will increase the computational burden of the model and reduce the performance of emotion recognition. Aiming at this problem, a hybrid feature selection method is put forward. The Fisher Score and distance correlation coefficient filtered feature selection strategy are used to evaluate feature performance in terms of two importance measures, namely relevance and discriminative power, and a subset of features with lower dimensionality and better performance is selected by weighted fusion of the reordered feature sets of the two strategies. The simulation results indicate that, while improving the emotion recognition effect to a certain extent the hybrid feature selection method effectively decreases the feature dimensionality. The bimodal emotion recognition rate after feature selection on the eNTERFACE'05 dataset is improved by 1.3% and the data dimensionality is reduced by 96.7% compared to the original, and the recognition rate on the RAVDESS dataset is improved by 1.0% and 99.2% reduction in data dimensionality after feature selection compared to the previous.

参考文献:

[1] Jz A, Zhong Y B, Peng C C, et al. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review[J]. Information Fusion, 2020, 59(01): 103-126.

[2] Xiao S, Sun C, Quan C, et al. Fine-grained emotion analysis based on mixed model for product review[J]. International Journal of Networked and Distributed Computing, 2017, 5(01): 1.

[3] Joshi A, Allessio D, Magee J, et al. Affect-driven learning outcomes prediction in intelligent tutoring systems[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019: 1-5.

[4] Gopikrishnan U, Jose R. DriveCare: A real-time vision based driver drowsiness detection using multiple convolutional neural networks with kernelized correlation filters (MCNN-KCF)[C]//2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), IEEE, 2020: 408-413.

[5] Liu Z, Wu M, Cao W, et al. A facial expression emotion recognition based human-robot interaction system[J]. IEEE/CAA Journal of Automatica Sinica, 2017, 4(04): 668-676.

[6] Aslam A R, Altaf M. An on-chip processor for chronic neurological disorders assistance using negative affectivity classification[J]. IEEE Transactions on Biomedical Circuits and Systems, 2020, 14(04): 838-851.

[7] Zhang T, Zheng W, Cui Z, et al. Spatial-temporal recurrent neural network for emotion recognition[J]. IEEE Transactions on Cybernetics, 2019, 49(03): 839-847.

[8] 付心仪, 薛程, 李希, 等. 基于姿态的情感计算综述[J]. 计算机辅助设计与图形学学报, 2020, 32(07): 1052-1061.

[9] 陈景霞, 郝为, 张鹏伟, 等. 基于混合神经网络的脑电时空特征情感分类[J]. 软件学报, 2021, 32(12): 3869-3883.

[10] Wani T M, Gunawan T S, Qadri S. A comprehensive review of speech emotion recognition systems[J]. IEEE Access, 2021, 9, 47795-47814.

[11] Hamid L, Shaker N H, Emara I. Analysis of linguistic and prosodic features of bilingual Arabic-English speakers for speech emotion recognition[J]. IEEE Access, 2020, 8: 72957-72970.

[12] Lee J. Generating robotic speech prosody for human robot interaction: a preliminary study[J]. Applied Sciences, 2021, 11, 3468.

[13] 任杰, 郭卉, 姜囡. 不同情感的语音声学特征分析[J]. 光电技术应用, 2019, 34(05): 31-36+62.

[14] Aouani H, Ayed Y B. Emotion recognition in speech using MFCC with SVM, DSVM and auto-encoder[C]//2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP) (2018): 1-5.

[15] Lugger M, Janoir M E, Yang B. Combining classifiers with diverse feature sets for robust speaker independent emotion recognition[C]//European Signal Processing Conference. IEEE, 2015: 1225-1229.

[16] Ozseven T. Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition[J]. Applied Acoustics, 2018, 142(11), 70-77.

[17] Anusha K, Hima B V, Anil K B. Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology, 2020, 23, 45-55.

[18] Ancilin J, Milton A. Improved speech emotion recognition with Mel frequency magnitude coefficient[J]. Applied Acoustics, 2021, 179(03), 108046.

[19] Wang K, An N, Bing N L, et al. Speech emotion recognition using Fourier parameters[J]. IEEE Transactions on Affective Computing, 2015, 6(01), 69-75.

[20] 屠彬彬, 于凤芹. 基于EMD的改进MFCC的语音情感识别[J]. 计算机工程与应用, 2012, 48(18): 119-122.

[21] 叶吉祥, 胡海翔. Hilbert边际能量谱在语音情感识别中的应用[J]. 计算机工程与应用, 2014, 50(07): 203-207.

[22] Renjith S, Manju K G. Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters-A comparitive study using KNN and ANN classifiers[C]//Interna- tional Conference on Circuit, Power and Computing Technologies (ICCPCT), 2017: 1-6.

[23] Zeng X, Dong L, Chen G, et al. Multi-feature fusion speech emotion recognition based on SVM[C]//2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 2020: 77-80.

[24] Singh J B, Lehana P K. Emotional speech analysis using harmonic plus noise model and Gaussian mixture model[J]. International Journal of Speech Technology, 2019, 22(03): 483-496.

[25] Zhang B, Quan C, Ren F. Study on CNN in the recognition of emotion in audio and images[C]//IEEE/ACIS International Conference on Computer and Information Science. IEEE, 2016: 1-5.

[26] Lee C, Song K, Jeong J, et al. Convolutional attention networks for multimodal emotion recognition from speech and text data[J]. ACL, 2018, 28-34.

[27] 乔栋, 陈章进, 邓良, 等. 基于改进语音处理的卷积神经网络中文语音情感识别方法[J]. 计算机工程, 2022, 48(02): 281-290.

[28] Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1D&2D CNN LSTM networks[J]. Biomedical signal processing and control, 2019, 47(01): 312-323.

[29] 姜特, 陈志刚, 万永菁. 基于注意力机制的多任务3D CNN-BLSTM情感语音识别[J]. 华东理工大学学报(自然科学版), 2022, 48(04): 534-542.

[30] Mustaqeem, Sajjad M, Kwon S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM[J]. IEEE Access, 2020, 8: 79861-79875.

[31] Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention[C]//International Conference on Acoustic, Speech and Signal Processing (ICASSP). 2017: 2227- 2231.

[32] Zhang S, Zhang S, Huang T, et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J]. IEEE Transactions on Multimedia, 2018, 20(06): 1576-1590.

[33] Ekman P, Friesen W V. Facial action coding system (FACS): A technique for the measurement of facial actions[J]. Rivista Di Psichiatria, 1978, 47(02): 126-138.

[34] 叶继华, 祝锦泰, 江爱文, 等. 人脸表情识别综述[J].数据采集与处理, 2020, 35(01): 21-34.

[35] 吕开云, 鞠厦轶, 龚循强, 等. 基于PCA和IGG权函数的人脸图像鲁棒线性回归分类方法[J]. 国外电子测量技术, 2021, 44(21): 168-172.

[36] 张洁玉, 赵鸿萍, 陈曙. 自适应阈值及加权局部二值模式的人脸识别[J]. 电子与信息学报, 2014, 36(06): 1327-1333.

[37] 刘胜昔, 程春玲. 改进的Gabor小波变换特征提取算法[J]. 计算机应用研究, 2020, 37(02): 606-610.

[38] Deeba F, Ahmed A, Memon H, et al. LBPH-based enhanced real-time face recognition[J]. International Journal of Advanced Computer Science and Applications, 2019, 10(05): 274-280.

[39] Sen D, Datta S, Balasubramanian R. Facial emotion classification using concatenated geometric and textural features[J]. Multimedia Tools and Applications, 2019, 78(08): 10287-10323.

[40] Lee Y, Kim H. Emotion recognition based on tracking facial keypoints[J]. Journal of the Semiconductor & Display Technology, 2019, 18(01): 97-101.

[41] Hivi I D, Maiwan B A. Facial expression classification based on SVM, KNN and MLP classifiers[C]//2019 International Conference on Advanced Science and Engineering (ICOASE). IEEE, 2019: 70-75.

[42] 罗元, 余朝靖, 张毅. 基于改进的局部方向模式人脸表情识别算法[J]. 重庆大学学报, 2019, 42(03): 85-91.

[43] Dong Y C, Dae H K, Byung C S. Recognizing fine facial micro-expressions using two-dimensional landmark feature[C]//2018 25th IEEE International Conference on Image Processing (ICIP). 2018: 1962-1966.

[44] Li S, Deng W. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition[J]. IEEE Transactions on Image Processing, 2019, 28(01): 356-370.

[45] 何秀玲, 高倩, 李洋洋, 等. 基于深度学习模型的自发学习表情识别方法研究[J]. 计算机应用与软件, 2019, 36(03): 180-186.

[46] 丁名都, 李琳. 基于CNN和HOG双路特征融合的人脸表情识别[J]. 信息与控制, 2020, 49(01): 47-54.

[47] Ramya R, Mala K, Nidhyananthan Selva S. 3D facial expression recognition using multi-channel deep learning framework[J]. Circuits, Systems, and Signal Processing, 2020, 39(02):789-804.

[48] Chen J, Luo X, Meng Z, et al. Research on Facial Expression Recognition based on improved deep residual network model[C]//Journal of Physics: Conference Series. 2021, 2010(01): 012139.

[49] Lunajimenez C, Kleinlein R, Griol D, et al. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset[J]. Applied Sciences, 2021, 12(01): 327-327.

[50] Chen L S, Huang T S, Miyasato T, et al. Multimodal human emotion/expression recognition[C]//IEEE International Conference on Automatic Face & Gesture Recognition. IEEE Computer Society, 1998: 366-371.

[51] Ouyang X, Kawaai S, Goh E G H, et al. Audio-visual emotion recognition using deep transfer learning and multiple temporal models[C]//19th ACM International Conference on Multimodal Interaction. ACM, 2017: 577-582.

[52] Kim Y, Provost E M. ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition[J]. IEEE Transactions on Affective Computing, 2017, 15(01): 1-14.

[53] 刘菁菁, 吴晓峰. 基于长短时记忆网络的多模态情感识别和空间标注[J]. 复旦学报(自然科学版), 2020, 59(05): 565-574.

[54] Zhang S, Zhang S, Huang T, et al. Learning affective features with a hybrid deep model for audio-visual emotion recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10): 3030-3043.

[55] Ma Y, Hao Y, Chen M, et al. Audio-visual emotion fusion (AVEF): a deep efficient weighted approach[J]. Information Fusion, 2018, 15(06): 183-192.

[56] Hossain M S, Muhammad G. Emotion recognition using deep learning approach from audio-visual emotional big data[J]. Information Fusion, 2019, 49: 69-78.

[57] Liu J X, Chen S, Wang L B, et al. Multimodal emotion recognition with capsule graph convolutional based representation fusion[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021: 6339-6343.

[58] 范习健, 杨绪兵, 张礼, 等. 一种融合视觉和听觉信息的双模态情感识别算法[J]. 南京大学学报(自然科学), 2021, 57(02): 309-317.

[59] 刘振焘, 徐建平, 吴敏, 等. 语音情感特征提取及其降维方法综述[J]. 计算机学报, 2018, 41(12): 199-217.

[60] 施启军, 潘峰, 龙福海, 等. 特征选择方法研究综述[J]. 微电子学与计算机, 2022, 39(03): 1-8.

[61] Saha S, Ghosh M, Ghosh S, et al. Feature Selection for Facial Emotion Recognition Using Cosine Similarity-Based Harmony Search Algorithm[J]. Applied Sciences, 2020, 10(08): 2816.

[62] Farooq M, Hussain F, Baloch NK, et al. Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network[J]. Sensors, 2020, 20(21): 6008.

[63] 孙晓虎, 李洪均. 语音情感识别综述[J].计算机工程与应用, 2020, 56(11): 1-9.

[64] Yonezawa T K . Detecting Nonlinear and Nonstationary Properties of Post-Apneic Snoring Sounds Using Hilbert-Huang Transform[J]. Biomedical Engineering: Applications, Basis and Communications, 2019, 31(03).

[65] Dragomiretskiy K, Zosso D. Variational mode decomposition[J]. IEEE Transactions on Signal Processing, 2014, 62(3): 531-544.

[66] Vazhenina D, Markov K. End-to-End Noisy Speech Recognition Using Fourier and Hilbert Spectrum Features[J]. Electronics. 2020, 9(07): 1157.

[67] Chen X H. Simulation of English speech emotion recognition based on transfer learning and CNN neural network[J]. Journal of Intelligent and Fuzzy Systems, 2021, 40(02): 2349-2360.

[68] 李锦明, 曲毅, 裴禹豪, 等. 预训练卷积神经网络模型微调的行人重识别[J]. 计算机工程与应用, 2018, 54(20): 219-222+229.

[69] Alex K, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(02): 1097-1105.

[70] Sahoo S, Routray A. Emotion recognition from audio-visual data using rule based decision level fusion[C]// Technology Symposium. IEEE, 2017.

[71] Livingstone S R, Russo F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English[J]. Plos One, 2018, 13(05): 1-35.

[72] Solorio F. A review of unsupervised feature selection methods[J]. Artificial Intelligence Review[J]. An International Science and Engineering Journal, 2020, 53(02).

[73] Fernandez S, Ochoa J, Trinidad J. A systematic evaluation of filter Unsupervised Feature Selection methods[J]. Expert Systems with Application, 2020, 162: 1-26.

[74] Qiu C Y. A novel multi-swarm particle swarm optimization for feature selection[J]. Genetic Programming and Evolvable Machines, 2019, 20(04): 503-529.

[75] Edelmann D, Móri T, Székely G, et al. On relationships between the Pearson and the distance correlation coefficients[J]. Statistics & Probability Letters, 2021, 169.

中图分类号:

 TP391    

开放日期:

 2023-06-16    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式