查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于改进的3DCNN与LSTM的人体动作识别研究
姓名：	汶建荣
学号：	21207223068
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2024
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	电子与通信工程
研究方向：	计算机视觉
第一导师姓名：	王晓路
第一导师单位：	西安科技大学
论文提交日期：	2024-06-12
论文答辩日期：	2024-06-06
论文外文题名：	Research on Human Action Recognition Based on Improved 3DCNN and LSTM
论文中文关键词：	深度学习 ; 人体动作识别 ; 运动时间感知 ; 3DCNN ; LSTM
论文外文关键词：	Deep learning ; Human action recognition ; Action-time perception ; 3DCNN ; LSTM
论文中文摘要：	︿人体动作识别是计算机视觉领域中备受各界学者关注的重要研究方向之一。动作识别任务处理的是视频，相对于图像分类不仅需要获取空间特征也需要获取时间特征。因此，本文基于3D卷积神经网络与LSTM网络，围绕如何有效地提取时空信息和获取长时间动作特征展开研究。针对视频中存在冗余信息和含有动作信息的特征通道分布稀疏的问题，对3D卷积神经网络进行了改进，设计了一种运动-时间感知模块。该模块由运动感知模块和时间注意力模块构成。运动感知模块计算特征级别的时间差来激励运动敏感通道以此获取运动特征；时间注意力模块利用时间卷积沿时间维度计算注意力权重矩阵，将特征图与注意力权重矩阵相乘进行自适应的特征学习，从而获得时间特征。将运动-时间感知模块加入到3D卷积神经网络中，以此构造基于运动-时间感知的3D卷积神经网络（3DCNN based on Action-Time Perception，简称ATMNet网络）。实验结果表明，在公共数据集UCF101和HMDB51上，ATMNet相对其对应的基础网络，人体动作识别的准确率均有提升，其中ATMNet相对于3DResNeXt-101网络改善度最好，人体动作识别的准确率分别提升了1.6%和0.6%，说明本文对3D卷积神经网络的改进是可行的、有效的。针对ATMNet网络无法充分获取长时间动作特征的问题，引入了LSTM网络，以此获取不同序列间的依赖关系。通过将ATMNet与LSTM以级联方式构成ATMNet-LSTM网络，可以获得更加充分的动作特征信息。ATMNet网络捕获了不同片段的短期动作特征，而LSTM网络能够捕捉到各个片段特征之间的依赖关系。同时，为了提高网络的泛化性，本文采用了带有参数调节的中心损失和交叉熵损失作为网络模型的损失函数。实验结果表明，在公共数据集UCF101和HMDB51上，ATMNet-LSTM网络相对于ATMNet网络，人体动作识别的准确率分别提升了0.3%和3.5%。说明ATMNet-LSTM网络获取序列特征间的依赖关系后，能够进一步提高人体动作识别的准确率。﹀
论文外文摘要：	︿ Human action recognition is one of the important research directions in the field of computer vision that has attracted the attention of scholars from all walks of life. The action recognition task deals with video, and compared with image classification, it is necessary to obtain not only spatial features, but also temporal features. Therefore, based on 3D convolutional neural network and LSTM network, this paper focuses on how to effectively extract spatiotemporal information and obtain long-term action features. In order to solve the problem of redundant information and sparse distribution of feature channels containing action information in the video, the 3D convolutional neural network was improved, and a motion-time sensing module was designed. The module consists of a motion perception module and a temporal attention module. The motion sensing module calculates the time difference at the feature level to excite the motion-sensitive channel to obtain the motion features. The temporal attention module uses temporal convolution to calculate the attention weight matrix along the time dimension, and multiplies the feature map with the attention weight matrix for adaptive feature learning, so as to obtain the temporal features. The 3DCNN based on Action-Time Perception (ATMNet network) was constructed by adding the Motion-Time Perception module to the 3D Convolutional Neural Network. Experimental results show that on the public datasets UCF101 and HMDB51, ATMNet has improved the accuracy of human action recognition compared with the corresponding basic networks, and ATMNet has the best improvement compared with the 3DResNeXt-101 network, and the accuracy of human action recognition is increased by 1.6% and 0.6%, respectively, indicating that the improvement of 3D convolutional neural network in this paper is feasible and effective. In order to solve the problem that the ATMNet network cannot fully obtain the long-term action characteristics, the LSTM network is introduced to obtain the dependencies between different sequences. By cascading ATMNet and LSTM to form an ATMNet-LSTM network, more sufficient action characteristic information can be obtained. The ATMNet network captures the short-term action features of different fragments, while the LSTM network captures the dependencies between the features of each fragment. At the same time, in order to improve the generalization of the network, the center loss and cross-entropy loss with parameter adjustment are used as the loss functions of the network model. Experimental results show that on the public datasets UCF101 and HMDB51, the accuracy of ATMNet-LSTM network is improved by 0.3% and 3.5%, respectively, compared with ATMNet network. It is shown that the dependence between sequence features obtained by ATMNet-LSTM network can further improve the accuracy of human action recognition. ﹀
参考文献：	︿ [1]Mabrouk A B, Zagrouba E. Abnormal behavior recognition for intelligent video surveillance systems: A review[J].Expert Systems with Applications,2018,91:480-491. [2]黄新瑞,黄河颂,黄渝川,等.基于视频的动作智能识别在医学中的应用[J].中国医学物理学杂志,2024,41(01):1-7. [3]李璇. VR课堂中的多模动作识别算法研究[D].重庆:重庆大学,2022. [4]王中石. 面向自动驾驶的交警手势识别研究[D].北京:北京交通大学,2023. [5]吴婷,刘瑞欣,刘明甫,等.基于深度学习的人体行为识别综述[J].现代信息科技,2024,8(04):50-55. [6]Jiang G, Jiang X, Fang Z, et al. An efficient attention module for 3d convolutional neural networks in action recognition[J].Applied Intelligence,2021,51(10):7043-7057. [7]Elakkiya R, Selvamani K, Kanimozhi S, et al. Intelligent system for human computer interface using hand gesture recognition[J].Procedia engineering,2012,38:3180-3191. [8]赵祥涛,刘银华,李志晗,等.基于人体关节点多特征融合的暴力行为识别[J].自动化与仪器仪表,2024,(02):1-5+10. [9]Li A, Liu F. A BiLSTM-attention-based point-of-interest recommendation algorithm[J]. Journal of Intelligent Systems,2023,32(1):20230033. [10]龙雨馨,赖文杰,张怀元,等. 基于梯度方向直方图的红外与可见光融合网络的损失函数[J].激光与光电子学进展,2023,60(24):170-179. [11]陈辛,杨江涛,许新云. 基于光流法优化的人群异常行为检测研究[J].现代电子技术,2023,46(12):168-174. [12]Ladjailia A, Bouchrika I, Merouani H F, et al. Human activity recognition via optical flow: decomposing activities into basic actions[J].Neural Computing and Applications, 2020,32:16387-16400. [13]陈敏. 基于轨迹特征和深度学习的视频人体行为识别研究[D].江苏:江苏大学,2022. [14]李莉. 基于Transformer的视频表征学习[D].合肥:中国科学技术大学,2022. [15] Carmona J M, Climent J. Human action recognition by means of subtensor projections and dense trajectories[J].Pattern Recognition,2018,81:443-455. [16] Ye W, Cheng J, Yang F, et al. Two-stream convolutional network for improving activity recognition using convolutional long short-term memory networks[J].IEEE Access, 2019,7:67772-67780. [17]吕淑平,黄毅,王莹莹. 基于双流卷积神经网络的人体动作识别研究[J].实验技术与管理,2021,38(08):144-148. [18]Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE transactions on pattern analysis and machine intelligence,2012, 35(1):221-231. [19]Tran D,Bourdev L,Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE international conference on computer vision.Santiago,Chile:IEEE,2015:4489-4497. [20]Zhang H, Zu K, Lu J, et al. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network[C]//Proceedings of the Asian Conference on Computer Vision.Macau SAR,China:Springer,2022:1161-1177. [21]Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision.Munich, Germany:Springer,2018:3-19. [22]Cai J, Hu J. 3D RANs : 3D residual attention networks for action recognition[J]. The Visual Computer,2020,36:1261-1270. [23]张宁.基于FCN的循环卷积网络的变化检测方法研究[J].航空计算技术,2021,51(04):71-75. [24]Cao Y, Liu C, Huang Z, et al. Skeleton-based action recognition with temporal action graph and temporal adaptive graph convolution structure[J].Multimedia Tools and Applications,2021,80(19):29139-29162. [25]Li C, Xie C, Zhang B, et al. Memory attention networks for skeleton-based action recognition[J].IEEE Transactions on Neural Networks and Learning Systems,2021, 33(9):4800-4814. [26]高德勇,康自兵,王松等. 利用卷积块注意力机制识别人体动作的方法[J].西安电子科技大学学报,2022,49(04):144-155+200. [27]Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J].IEEE transactions on pattern analysis and machine intelligence,2017, 40(6):1510-1517. [28]Jeff D,Anne L H,Marcus R, et al. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.[J].IEEE transactions on pattern analysis and machine intelligence,2017,39(4):677-691. [29]Krizhevsky A, Sutskever I,Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM,2017,60(6):84-90. [30]牛为华,翟瑞冰. 基于改进3D ResNet的视频人体行为识别方法研究[J].计算机工程与科学,2023,45(10):1814-1821. [31]蒋圣南,陈恩庆,郑铭耀,等. 基于ResNeXt的人体动作识别[J].图学学报,2020,41(02):277-282. [32]Ghadimi S, Ruszczynski A, Wang M. A single timescale stochastic approximation method for nested stochastic optimization[J]. SIAM Journal on Optimization,2020, 30(1):960-979. [33]Cao X, Zhang J, Poor H V. Online stochastic optimization with time-varying distributions[J].IEEE Transactions on Automatic Control,2020,66(4):1840-1847. [34]Monfort M, Andonian A, Zhou B, et al. Moments in time dataset: one million videos for eventunderstanding[J]. IEEE Transactions on Pattern Analysis And Machine Intelligence,2019,42(2):502-508. [35]江励,周鹏飞,汤健华. 基于深度学习的人体动作识别算法[J].机电工程技术,2023,52(11):23-27+80. [36]张坤,杨静,张栋,等. MRTP:时间-动作感知的多尺度时间序列实时行为识别方法[J].西安交通大学学报,2022,56(03):22-32. [37]Materzynska J, Xiao T, Herzig R, et al. Something-else: Compositional action recognition with spatial-temporal interaction networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Virtual:IEEE,2020: 1049-1059. [38]Grizioti M, Kynigos C. Code the mime: A 3D programmable charades game for computational thinking in MaLT2[J]. British Journal of Educational Technology,2021, 52(3):1004-1023. [39]Wang J, Wang W, Gao W. Multiscale deep alternative neural network for large-scale video classification[J]. IEEE Transactions on Multimedia,2018,20(10):2578-2592. [40]Ramesh M, Mahesh K. Sports video classification with deep convolution neural network: a test on UCF101 dataset[J]. International Journal of Engineering and Advanced Technology,2019,8(4S2):2249-8958. [41]Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C]//2011 International conference on computer vision.Barcelona, Spain:IEEE,2011:2556-2563. [42]Wang Z, She Q, Smolic A. Action-net: Multipath excitation for action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.Virtual:IEEE,2021:13214-13223. [43]Majumder S, Kehtarnavaz N. Vision and inertial sensing fusion for human action recognition: A review[J].IEEE Sensors Journal,2020,21(3):2454-2467. [44]Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.Salt Lake City,UT,USA:IEEE,2018:6546-6555. [45]范银行,赵海峰,张少杰. 基于3D卷积残差网络的人体动作识别算法[J].计算机应用研究,2020,37(S2):300-301+304. [46]Shafiq M, Gu Z. Deep residual learning for image recognition: A survey[J]. Applied Sciences,2022,12(18):8972. [47]Joefrie Y Y, Aono M. Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation[J].Entropy,2022, 24(11):1663. [48]Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .Salt Lake City,UT, USA:IEEE,2018:7132-7141. [49]Srivastava A, Dutta O, Gupta J, et al. A variational information bottleneck based method to compress sequential networks for human action recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 2745-2754. [50]Cai J, Hu J. 3D RANs: 3D residual attention networks for action recognition[J].The Visual Computer,2020,36:1261-1270. [51]Ju C, Han T, Zheng K, et al. Prompting visual-language models for efficient video understanding[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 105-124. [52]屈小春.基于Transformer的双流动作识别方法研究[D].重庆:西南大学,2023. [53]Ming Y, Feng F, Li C, et al. 3D-TDC: A 3D temporal dilation convolution framework for video action recognition[J].Neurocomputing,2021,450:362-371. [54]Yang G, Zou W. Deep learning network model based on fusion of spatiotemporal features for action recognition[J]. Multimedia Tools and Applications, 2022, 81(7): 9875-9896. [55]石跃祥,朱茂清.基于骨架动作识别的协作卷积Transformer网络[J].电子与信息学报,2023,45(04):1485-1493. [56]卢先领,杨嘉琦.时空关联的Transformer骨架行为识别[J].信号处理,2024,40(04):766-775. [57]Wang L, Xiong Y, Wang Z, et al. Temporal segment networks for action recognition in videos[J].IEEE transactions on pattern analysis and machine intelligence,2018,41(11): 2740-2755. ﹀
中图分类号：	TP391.41
开放日期：	2024-06-12

附件下载