论文中文题名: | 融合语音和表情的双模态情感识别技术研究 |
姓名: | |
学号: | 20207040035 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 081002 |
学科名称: | 工学 - 信息与通信工程 - 信号与信息处理 |
学生类型: | 硕士 |
学位级别: | 工学硕士 |
学位年度: | 2023 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 信号与信息处理 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2023-06-15 |
论文答辩日期: | 2023-06-06 |
论文外文题名: | Research on Bimodal Emotion Recognition Technology Combining Speech and Expression |
论文中文关键词: | |
论文外文关键词: | Emotion recognition ; Hilbert transform ; Feature fusion ; Multimodal feature selection |
论文中文摘要: |
情感识别是计算机通过提取信号中蕴含的情感信息从而推断人的情感状态的技术,涉及语音处理、图像处理、模式识别等多个学科,广泛应用于人机交互、医疗、交通等多个领域。因此,对情感识别的研究具有重要的意义。现有的情感识别主要存在特征提取不全面及多模态特征融合维度高等问题,本课题主要针对以上问题,研究了语音模态的情感特征提取方法及多模态融合的情感特征选择方法。 为了提取更全面的语音情感特征,改善情感识别效果,本文结合希尔伯特-黄变换与短时傅里叶变换,提出了一种语音情感特征提取方法。在对预处理的语音信号进行变分模态分解,得到一系列本征模态分量后,首先进行希尔伯特变换,提取每个分量的能量、能量熵、平均瞬时频率以及总边际谱的峰值和峰值频率等特征组成一帧特征向量,将所有帧的特征向量按照帧序拼接成一个多维特征矩阵;其次剔除信号模态分解后的余波分量,将各主导分量重新聚合后提取傅里叶变换频谱特征,同样按照帧序拼接为特征矩阵。为充分体现特征在帧间的变化特性,保留每帧信号中各分量的固有特性,将两个特征矩阵可视化为特征谱图后进行融合,共同作为语音模态的输入特征,然后保留预训练神经网络的卷积部分作为特征提取器,最后通过支持向量机实现情感识别。实验结果表明,相比于MFCCs、语谱图、Mel谱图等特征,本文所提特征对eNTERFACE’05数据集的语音情感识别率提高了4.7%以上,对RAVDESS数据集的识别率提高了6.6%以上。 单模态情感信号所包含的情感信息有限,导致情感识别准确率偏低,多模态之间情感信息的有效互补则能显著提升情感识别效果,然而多模态融合又不可避免地造成了情感特征维度偏高的问题,特征间存在的冗余将增加模型计算负担,降低情感识别性能。针对该问题,本文提出一种混合特征选择方法。利用距离相关系数和Fisher Score过滤式特征选择策略从相关性和辨别能力两个重要性度量角度评估特征性能,将两种选择策略得到的特征集进行加权融合,从中筛选出维数更低、性能更优的特征子集。实验结果表明,混合特征选择策略在一定程度上提升情感识别效果的同时,有效降低了特征维度,在eNTERFACE’05数据集上经特征选择后的双模态情感识别率较选择前提高了1.3%,数据维度降低了96.7%,在RAVDESS数据集上的识别率提高了1.0%,特征选择后的数据维度相比之前降低了99.2%。 |
论文外文摘要: |
Emotion recognition is a technology that the computer infers the emotional state by extracting the emotional information contained in the signal. It covers speech processing, image processing, pattern recognition and other subjects, and is broadly used in human-computer interaction, medical care, transportation and other domains. Therefore, the research on emotion recognition is of great significance. The existing emotion recognition mainly has the problems of incomplete feature extraction and high dimension of multi-modal feature fusion. This paper mainly studies the emotional feature extraction method of speech modality and the emotional feature selection method of multi-modal fusion. To extract more comprehensive speech emotion features and enhance the performance of emotion recognition, a speech emotion feature extraction method is proposed by combining Hilbert-Huang transform and short-time Fourier transform. After performing variational mode decomposition on the preprocessed speech signal to obtain a series of intrinsic mode components, the Hilbert transform is performed first, and the energy, entropy, average instantaneous frequency of each component, peak and peak frequency of the total marginal spectrum are extracted to form a frame vector. The characteristics of all speech frames are spliced into a multi-dimensional matrix according to the frame order. Secondly, the residual wave components after signal modal decomposition are eliminated, and the Fourier transform spectrum features are extracted after the main components are re-aggregated, and the feature matrix is also spliced in frame order. To fully embody the varying characteristics of features between frames and retain the inherent characteristics of each component in each frame signal, the two feature matrices are visualized as feature spectra and fused as the input features of speech modality. Then the convolution part of the pre-trained network is retained as the characteristics extractor. Finally, the emotion recognition is realized by support vector machine. The simulation results indicate that compared with MFCCs, spectrogram, Mel spectrogram features, the speech emotion recognition rate of eNTERFACE'05 dataset is improved by more than 4.7 %, and the recognition rate of RAVDESS dataset is improved by more than 6.6 %. The single-modal emotional signal contains limited emotional information, and the accuracy of emotional recognition is low. The effective complementarity of emotional information between multiple modalities can significantly improve the performance of emotional recognition. However, bimodal fusion inevitably causes the problem of high dimension of features, and there will be varying degrees of redundancy between features, which will increase the computational burden of the model and reduce the performance of emotion recognition. Aiming at this problem, a hybrid feature selection method is put forward. The Fisher Score and distance correlation coefficient filtered feature selection strategy are used to evaluate feature performance in terms of two importance measures, namely relevance and discriminative power, and a subset of features with lower dimensionality and better performance is selected by weighted fusion of the reordered feature sets of the two strategies. The simulation results indicate that, while improving the emotion recognition effect to a certain extent the hybrid feature selection method effectively decreases the feature dimensionality. The bimodal emotion recognition rate after feature selection on the eNTERFACE'05 dataset is improved by 1.3% and the data dimensionality is reduced by 96.7% compared to the original, and the recognition rate on the RAVDESS dataset is improved by 1.0% and 99.2% reduction in data dimensionality after feature selection compared to the previous. |
参考文献: |
[8] 付心仪, 薛程, 李希, 等. 基于姿态的情感计算综述[J]. 计算机辅助设计与图形学学报, 2020, 32(07): 1052-1061. [9] 陈景霞, 郝为, 张鹏伟, 等. 基于混合神经网络的脑电时空特征情感分类[J]. 软件学报, 2021, 32(12): 3869-3883. [13] 任杰, 郭卉, 姜囡. 不同情感的语音声学特征分析[J]. 光电技术应用, 2019, 34(05): 31-36+62. [20] 屠彬彬, 于凤芹. 基于EMD的改进MFCC的语音情感识别[J]. 计算机工程与应用, 2012, 48(18): 119-122. [21] 叶吉祥, 胡海翔. Hilbert边际能量谱在语音情感识别中的应用[J]. 计算机工程与应用, 2014, 50(07): 203-207. [27] 乔栋, 陈章进, 邓良, 等. 基于改进语音处理的卷积神经网络中文语音情感识别方法[J]. 计算机工程, 2022, 48(02): 281-290. [29] 姜特, 陈志刚, 万永菁. 基于注意力机制的多任务3D CNN-BLSTM情感语音识别[J]. 华东理工大学学报(自然科学版), 2022, 48(04): 534-542. [34] 叶继华, 祝锦泰, 江爱文, 等. 人脸表情识别综述[J].数据采集与处理, 2020, 35(01): 21-34. [35] 吕开云, 鞠厦轶, 龚循强, 等. 基于PCA和IGG权函数的人脸图像鲁棒线性回归分类方法[J]. 国外电子测量技术, 2021, 44(21): 168-172. [36] 张洁玉, 赵鸿萍, 陈曙. 自适应阈值及加权局部二值模式的人脸识别[J]. 电子与信息学报, 2014, 36(06): 1327-1333. [37] 刘胜昔, 程春玲. 改进的Gabor小波变换特征提取算法[J]. 计算机应用研究, 2020, 37(02): 606-610. [42] 罗元, 余朝靖, 张毅. 基于改进的局部方向模式人脸表情识别算法[J]. 重庆大学学报, 2019, 42(03): 85-91. [45] 何秀玲, 高倩, 李洋洋, 等. 基于深度学习模型的自发学习表情识别方法研究[J]. 计算机应用与软件, 2019, 36(03): 180-186. [46] 丁名都, 李琳. 基于CNN和HOG双路特征融合的人脸表情识别[J]. 信息与控制, 2020, 49(01): 47-54. [53] 刘菁菁, 吴晓峰. 基于长短时记忆网络的多模态情感识别和空间标注[J]. 复旦学报(自然科学版), 2020, 59(05): 565-574. [58] 范习健, 杨绪兵, 张礼, 等. 一种融合视觉和听觉信息的双模态情感识别算法[J]. 南京大学学报(自然科学), 2021, 57(02): 309-317. [59] 刘振焘, 徐建平, 吴敏, 等. 语音情感特征提取及其降维方法综述[J]. 计算机学报, 2018, 41(12): 199-217. [60] 施启军, 潘峰, 龙福海, 等. 特征选择方法研究综述[J]. 微电子学与计算机, 2022, 39(03): 1-8. [63] 孙晓虎, 李洪均. 语音情感识别综述[J].计算机工程与应用, 2020, 56(11): 1-9. [68] 李锦明, 曲毅, 裴禹豪, 等. 预训练卷积神经网络模型微调的行人重识别[J]. 计算机工程与应用, 2018, 54(20): 219-222+229. |
中图分类号: | TP391 |
开放日期: | 2023-06-16 |