查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于视听文多模态融合的情感识别
姓名：	张屿
学号：	22308223005
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2025
培养单位：	西安科技大学
院系：	人工智能与计算机学院
专业：	计算机技术
研究方向：	图像处理
第一导师姓名：	杨晓强
第一导师单位：	西安科技大学
论文提交日期：	2025-06-18
论文答辩日期：	2025-05-30
论文外文题名：	Emotion recognition based on multimodal fusion of audio-visual texts
论文中文关键词：	情感识别 ; 时序对齐 ; 多模态融合 ; 跨模态融合
论文外文关键词：	Emotion Recognition ; emporal alignment ; Multi-modal fusion ; Cross-modal fusion
论文中文摘要：	︿随着深度学习快速发展，情感识别的方法从传统手工和机器学习转向深度学习模型。然而，如何提取不同模态的情感特征以及如何有效的将其进行融合，仍是需要深入探索的关键问题。为此本文以深度学习为核心，在探究各个单模态情感识别问题的同时，探索多模态的融合方法，主要研究如下：（1）针对视频面部识别中，传统2DCNN仅捕获空间特征、忽略时序动态及深层网络梯度消失问题，提出了一种结合CBAM模块的3D-ResNet人脸情感识别模型。该模型通过三维卷积神经网络（3DCNN）捕捉视频序列的时空动态特征，采用分层递进的残差块组结构，在每个残差块中嵌入改进的CB AM注意力机制。CBAM模块对通道注意力的多层感知机（MLP）进行优化，采用先升维再降维策略（扩展比为8），有效保留人脸图像的细节特征；同时，引入SeLU激活函数替代传统ReLU，改善深层网络的神经元死亡问题。实验结果表明，该模型在CH-SIMS v2.0 和eNTERFACE'05 数据集上的准确率达到 80.61%和 81.14%。（2）针对语音情感识别中特征冗余和复杂情感类别难以区分的问题，提出了基于BLSTM_CBAM的多通路语音情感识别模型，将每个语音特征设为独立通道输入，结合局部分类器和全局分类器减少信息冗余。局部分类器中， BLSTM捕捉语音长期依赖关系，CBAM通过通道和空间注意力机制加权关键特征，增强对复杂语音情感的理解。实验显示，该模型在CH-SIMS v2.0和e NTERFACE'05 数据集上分别达到 75.33%和 75.23%的准确率。（3）针对自然语言处理中，上下文依赖关系建模不足和特征提取受限等问题，提出了一种基于自注意力机制的BERT_BLSTM模型。该模型结合BER T的全局上下文理解能力、BLSTM的时序建模优势与自注意力机制的特征聚焦能力。实验显示，在数据集CH-SIMS v2.0和SimplifiedWeibo-4-Moods上准确率达到84.54%和 84.17%。 I （4）在多模态融合中，针对模态间的特征对齐、时序对齐以及融合方式选择的问题，提出了一种基于跨模态注意力机制的多模态融合方法。先通过单模态独立训练获取高级语义表征，接着利用全连接层进行特征对齐，采用双向LSTM实现时序对齐，最后借助Transformer多头注意力机制VA-T-Multi HeadAttention进行跨模态融合。具体来说，对文本模态先通过自注意力提取高阶语义表示，文本与视频、文本和语音模态间使用多头注意力机制进行跨模态交互。为了验证本文融合策略的有效性，设计了V-A-T-SelfAttention和V A-T-CrossAttention进行对比实验。在IEMOCAP和CH-SIMS v2.0 数据集上显示，所有融合模型表现良好，其中本文模型VA-T-MultiHeadAttention性能最优，在两个数据集上准确率分别达84.66%和 84.76%。后续开展的消融实验进一步证实，相较于各个单模态和双模态融合的情感识别，本文模型在挖掘多模态数据潜在关联、提升情感识别准确性等方面具有优势。﹀
论文外文摘要：	︿ With the rapid advancement of deep learning, emotion recognition methods have transitioned from traditional manual and machine learning approaches to deep learning models. However, extracting emotional features from different modalities and achieving effective fusion remain critical challenges requiring in-depth exploration. This study focuses on deep learning, investigating single modality emotion recognition while exploring multimodal fusion methods. The main contributions are as follows: (1) To address the limitations of traditional 2DCNNs (spatial-only feature capture, ignored temporal dynamics, and deep-network gradient vanishing), a 3D-ResNet facial emotion recognition model integrated with the CBAM module is proposed. Leveraging 3DCNN for spatio-temporal feature extraction in video sequences, it employs hierarchical residual blocks with an improved CBAM attention mechanism. The CBAM module optimizes channel attention via a dimension-scaling strategy (expansion ratio = 8) and replaces ReLU with SeLU to mitigate neuron death in deep networks. On CH-SIMS v2.0 and eNTERFACE'05, it achieves 80.61% and 81.14% accuracy, respectively. (2) A BLSTM_CBAM-based multi-pathway model is developed to tackle feature redundancy and complex emotion classification. By treating each speech feature as an independent channel and combining local/global classifiers, it reduces redundancy. BLSTM captures long-term dependencies, while CBAM enhances complex emotion understanding via channel/spatial attention. It III achieves 75.33% (CH-SIMS v2.0) and 75.23% (eNTERFACE'05) accuracy, outperforming baselines. (3) Aiming at the problems of insufficient context dependency modeling and limited feature extraction in natural language processing, a BERT_BLSTM model based on self-attention mechanism is proposed. The model combines the global context understanding ability of BERT, the temporal modeling advantage of BLSTM and the feature focusing ability of the self-attention mechanism. Experiments show that the accuracy reaches 84.54% and 84.17% on the datasets CH-SIMS v2.0 and SimplifiedWeibo-4-Moods. (4) A cross-modal attention-based fusion method is introduced to solv e modality alignment and fusion challenges. It involves single-modality se mantic extraction, feature alignment via fully connected layers, temporal al ignment with bidirectional LSTM, and cross-modal fusion using Transforme r’s multi-head attention (VA-T-MultiHeadAttention). Comparative experimen ts (V-A-T-SelfAttention vs. VA-T-CrossAttention) on IEMOCAP and CH-SI MS v2.0 show VA-T-MultiHeadAttention achieves the highest accuracy (84. 66% and 84.76%, respectively). Ablation studies confirm its superiority in exploiting multimodal correlations and improving accuracy over single/dual-modality models. ﹀
参考文献：	︿ [1] 董晓晨, 赵志刚, 吕慧显, 刘成士. 基于改进的局部二值模式的微表情识别方法[J]. 青岛大学学报(自然科学版), 2018, 31(03): 32-36. [2] Levi G, Hassner T .Emotion Recognition in the Wild via Convolutiona l Neural Networks and Mapped Binary Patterns[C]//Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. S eattle Washington, USA: ACM, 2015:503–510. [3] He K, Zhang X, Ren S, et al. Deep residual learning for image recog nition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 770-778. [4] 陈浩侠. 基于时空流间注意力机制的表情识别研究[D].南京:南京邮电大学, 2022. [5] 胡明櫆. 基于RBF-GMM的蒙古语语音转换系统研究与实现[D]. 内蒙古: 内蒙古大学, 2021. [6] 戴佳惠, 黄敏, 肖仲喆. 结合迁移学习和Transformer模型的低资源多模态语音情感识别研究[J]. 声学技术, 2025, 44(0): 1-9. [7] 王兰馨, 王卫亚, 程鑫. 结合Bi-LSTM-CNN的语音文本双模态情感识别模型[J]. 计算机工程与应用, 2022, 58(04): 192-197. [8] 杨雪滢. 基于语谱图融合情感特征的语音情感识别算法研究[D]. 杭州: 杭州电子科技大学, 2023. [9] 孙永健. 延迟退休政策网络舆情的演化规律、生发机理及治理策略——基于NLP的网络大数据分析[J]. 河海大学学报(哲学社会科学版), 2025, 27 (01): 77-89. [10]吴伊萍.基于多特征融合的微博倾向性分析[J].泉州师范学院学报,2015,33 (06):69-74. [11] 左任衔, 唐振华, 黄晓, 吴江. 基于情感词典的引文文本情感识别研究[J]. 数字图书馆论坛, 2022, 213(02): 10-17. [12]刘慧慧, 王爱银, 刘禹彤. 基于SVM的文本情感分析——以新冠疫情事件为例[J]. 信息技术与信息化, 2023, 2023(01): 37-40. [13]任楚岚, 仇全涛, 劣思敏. 融合TCN和BiLSTM的文本情感分析[J]. 计算机工程与设计, 2024, 45(07): 2090-2096. [14]刘浠辰,姜囡,杜扶遥.基于语音和视频动态特征的双模态情感识别[J]. 计算机仿真, 2025, 42(02): 215-220. [15]闫静杰, 卢官明, 李海波, 王珊珊. 基于人脸表情和语音的双模态情感识别[J]. 南京邮电大学学报(自然科学版). 2018, 38(01): 60-65. [16]贾宁, 郑纯军. 融合音频、文本、表情动作的多模态情感识别[J]. 应用科学学报, 2023, 41(01): 55-70. [17]程肯.基于视觉和信道状态信息的双模态情感识别研究[D]. 合肥: 合肥工业大学, 2021. [18]陈伟. 基于多模态特征融合的情感识别系统设计[D]. 青海: 青海师范大学, 2022. [19]Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Lin guistics, 2014: 1746-1751. [20]Shi X , Chen Z , Wang H , et al.Convolutional LSTM Network: A M achine Learning Approach for Precipitation Nowcasting[J].MIT Press, 2015, 41(01): 2242-2250. [21]Zaremba W , Sutskever I , Vinyals O .Recurrent Neural Network Reg ularization[J]. Advances in Neural Information Processing Systems. 20 14: 1718-1726. [22]Syed F, Sipio R D, Sinervo P .Bidirectional Long Short-Term Memory (BLSTM) neural networks for reconstruction of top-quark pair decay kinematics[C]//Proceedings of the 29th Pacific Asia Conference on L anguage, Information and Computation. Shanghai, China: paclic, 2015: 73-78. [23]高培贤 ,魏立线 ,刘佳 ,等. 基于深度残差网络的图像隐写分析方法[J]. 计算机工程与设计, 2018, 39(10): 3045-3049. [24]Klambauer G, Unterthiner T, Mayr A, Hochreiter S. Self-Normalizing Neural Networks[C]//Advances in Neural Information Processing Syste ms. Vienna, Austria: The MIT Press, 2017: 970-978. [25]Polli E, Bersani FS, De Rose C, Liberati D, Valeriani G, Weisz F, Colletti C, Anastasia A, Bersani G. Facial Action Coding System (FACS): an instrument for the objective evaluation of facial expression and it s potential applications to the study of schizophrenia[J].rivista di psic hiatria, 2012, 47(2):126-138. [26]Zhao K, Chu W, Zhang H. Deep Region and Multi-Label Learning for Facial Action Unit Detection[C]// Proceedings of the IEEE conferenc e on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 3391-3399. [27]Baltrusaitis T, Zadeh A, Lim YC, Morency LP. OpenFace 2.0: Facial Behavior Analysis Toolkit[C]//2018 13th IEEE International Conference on Automatic Face & Gesture Recognition. Xi'an, China: IEEE, 2016: 59-66. [28]Yu W , Xu H , Meng F , et al.CH-SIMS: A Chinese Multimodal Sent iment Analysis Dataset with Fine-grained Annotation of Modality[C]//P roceedings of the 58th Annual Meeting of the Association for Comput ational Linguistics, Online: Association for Computational Linguistics, 2020: 3718–3727. [29]O. Martin, I. Kotsia, B. Macq and I. Pitas. The eNTERFACE' 05 Aud io-Visual Emotion Database[C]//Proceedings of the 22nd International Conference on Data Engineering Workshops, Atlanta, GA, USA: IEEE, 2006:8-8. [30]Shi L, Wang X, Shen Y .Research on 3D face recognition method bas ed on LBP and SVM[J].Optik- International Journal for Light and El ectron Optics, 2020, 220(165): 157-176. [31]Pramerdorfer C, Kampel M. Facial Expression Recognition using Conv olutional Neural Networks: State of the Art [J]. arXiv preprint, DOI:1 0.48550/arXiv.1612.02903. 2016. [32]Helaly R, Messaoud S, Bouaafia S, Hajjaji MA, Mtibaa A. DTL-I-Res Net18: facial emotion recognition based on deep transfer learning and improved ResNet18[J]. Signal, Image and Video Processing, 2023, 17 (6): 2731-2744. [33]He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE, 2016: 770-778. [34]王潇. 基于Faster R-CNN的人脸面部情感识别方法[J]. 信息与电脑(理论版), 2023, 35(21): 148-150. [35]黄飞, 潘洪志, 方群. 基于联邦学习与改进IS-ResNet18 的人脸识别[J]. 绵阳师范学院学报, 2025, 44(02): 101-110. [36]李浩文.煤矿员工负面情绪对不安全行为的影响机理与预控研究[D].河南: 华北水利水电大学,2023. [37]夏玉杰, 崔建华, 高雅. 基于傅里叶变换的双音多频信号识别实验设计 [J]. 电气电子教学学, 2023, 45(05): 223-226. [38]Kurpukdee N , Koriyama T , Kobayashi T , et al.Speech emotion reco gnition using convolutional long short-term memory neural network an d support vector machines[C]//2017 Asia-Pacific Signal and Informatio n Processing Association Annual Summit and Conference. Macau, Ma cao: IEEE, 2018: 1744-1749. [39]申雁. 融合双路CNN-LSTM与注意力机制的语音情感识别[D]. 山西: 太原理工大学, 2023. [40]张莉, 许世辉, 李承桓, 秦美翠, 喻玮. 基于LSTM网络优化的电力客服语音情感识别系统[J/OL]. 自动化技术与应用, 2025-04-11. [41]蔡优新, 边巴旺堆. 基于双向GRU模型的藏语语音情感识别[J]. 信息技术与信息化, 2023, 2023(10): 209-213. [42]张泽华,柴豪.国内外大模型在情感分析中对比与应用策略[J/OL]. 重庆工商大学学报(自然科学版), 2025-04-11. [43]李浩君, 汪旭辉, 廖伟霞. 在线教育弹幕情感信息智能识别模型研究— —融合变式情感词典与深度学习技术[J]. 现代远距离教育, 2023, 13(0 1): 19-31. [44]曾小芹. 基于Python的中文结巴分词技术实现[J]. 信息与电脑(理论版), 2019, 31(18): 38-39+42. [45]He H , Choi J D .The Stem Cell Hypothesis: Dilemma behind Multi-T ask Learning with Transformer Encoders[C]//Empirical Methods in Nat ural Language Processing. Singapore: Association for Computational Li nguistics, 2021: 5555–5577. [46]李卫卫, 葛欣杭, 刘晓丹. 基于改进TF-IDF的FastText新闻文本分类算法 [J]. 电脑与电信, 2025(01): 27-31. [47]张爱军.基于注意力机制的蒙汉机器翻译研究[D]. 内蒙古: 内蒙古科技大学, 2023. [48]Baevski A, Zhou H, Mohamed A, et al. wav2vec 2.0: A Framework fo r Self-Supervised Learning of Speech Representations[C]// Proceedings of the 34th International Conference on Neural Information Processin g Systems . Red Hook, NY, USA: Curran Associates Inc, 2020: 12449-12460. [49]Devlin J , Chang M W , Lee K , et al.BERT: Pre-training of Deep B idirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Associ ation for Computational Linguistics: Human Language Technologies (N AACL-HLT 2019). 2019: 4171-4186. [50]郭世伟, 马博, 马玉鹏, 等. 基于预训练模型和图卷积网络的中文短文本实体链接[J]. 中文信息学报, 2022, 36(12): 104-114. [51]汤世松, 练丽萍, 贺成龙, 等. 一种基于Bert与Bi-LSTM的舆情信息情感识别[J]. 网络安全技术与应用, 2021, 2021(07): 57-59. [52]方瑞. 基于BERT模型特征构造的社交网络文本情感识别研究[D]. 河南: 河南大学, 2020. [53]刘丹. 基于CNN-LSTM的社交媒体大数据评论文本情感元自动识别方法 [J]. 微型电脑应用, 2024, 40(04): 195-197+201. [54]C. I. V and S. K. J. Text-Based Emotion Recognition Using Deep Lea rning[C]//2024 Second International Conference on Advances in Inform ation Technology, Chikkamagaluru, Karnataka, India: IEEE, 2024: 1-7. [55]夏梦雯. 基于DNN的高鲁棒性低功耗语音端点检测设计[D]. 南京: 东南大学, 2019. [56]郑磊, 顾书缘, 王学友, 等. 基于CNN模型的文本分类可视化系统设计与实现[J]. 电脑与电信, 2023, 2023(06): 22-27. [57]毛君宇.图像与文本融合的多模态在线学习情感分类研究[D]. 甘肃: 西北师范大学, 2022. [58]凌文芬, 陈思含, 彭勇, 等. 基于 3D分层卷积融合的多模态生理信号情绪识别[J]. 智能科学与技术学报, 2021, 3(01): 76-84. [59]Busso C, Bulut M, Lee C C, et al. IEMOCAP: interactive emotional d yadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359. [60]徐金阳. 可伸缩的组件化情感识别系统研究与实现[D]. 扬州: 扬州大学, 2023. [61]陈伟. 基于多模态特征融合的情感识别系统设计[D]. 青海: 青海师范大学, 2022. ﹀
中图分类号：	TP391.41
开放日期：	2025-06-18

附件下载