查看论文信息

查看全文

免费浏览

查看论文信息

论文中文题名：	基于深度学习的连续动作识别研究
姓名：	袁宵
学号：	18207042034
保密级别：	公开
论文语种：	chi
学科代码：	081002
学科名称：	工学 - 信息与通信工程 - 信号与信息处理
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2021
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	信号与信息处理
研究方向：	视频动作识别
第一导师姓名：	吴冬梅
第一导师单位：	西安科技大学
论文提交日期：	2021-06-18
论文答辩日期：	2021-06-03
论文外文题名：	Research on Continuous Action Recognition Based on Deep Learning
论文中文关键词：	动作识别 ; 深度学习 ; 金字塔池化 ; 注意力机制 ; 平滑窗口
论文外文关键词：	Action recognition ; Deep learning ; Pyramid pool ; Attention mechanisms ; Smooth window
论文中文摘要：	︿在基建现场通过智能视频监控实现人体连续动作识别对于保障工人安全有着重要意义。连续动作由多个动作组成，具有一定的复杂性，而已有的深度学习网络结构复杂度高、准确率低，用于人体连续动作识别还有一定缺陷。因此，本文对连续动作识别展开研究，从单一动作的角度出发，设计了一种具有注意力机制的G-ResNet网络模型，然后结合滑动窗口完成连续动作识别。针对现有模型不能更好的提取视频时空特征的问题，本文提出基于G-ResNet网络的人体动作识别模型。首先该模型使用残差网络ResNet34提取深层空间特征，解决深层网络出现退化的问题，其次使用GRU网络获取视频帧之间的时序信息，处理帧序列之间的长期依赖关系，最后采用三步训练策略优化网络模型，改善了动作识别的准确率。针对G-ResNet网络提取特征信息不充分的问题，本文提出基于FSAG-ResNet网络的人体动作识别模型。该模型是在G-ResNet网络的基础上，首先对ResNet34网络引入空间金字塔池化操作，用多尺度窗口提取特征，使提取的特征更加丰富，其次对GRU网络融合时间注意力机制，根据视频帧序列的重要程度，为其分配不同的权重值，提高GRU网络捕捉更多关键特征的能力，进一步提高了动作识别的准确率。为了实现基建现场的连续动作识别，本文提出采用滑动窗口结合FSAG-ResNet网络的方法。首先建立基建现场不同场景的单一动作和连续动作视频数据集，其次利用迁移学习的思想，将FSAG-ResNet网络应用于基建现场，采用全部单一动作片段以及部分连续动作划分后的片段训练迁移的网络模型，最后对连续动作视频采用平滑窗口去除突变结果，完成连续动作识别。实验结果表明，FSAG-ResNet网络模型在UCF101上的准确率达到96.2%，在HMDB51上的准确率达到64.3%，相比其他主流网络，有较大的提升。同时将滑动窗口结合FSAG-ResNet网络模型用于连续动作识别，可以实时检测到连续动作视频中的每个动作，平均识别率为88.79%，验证了本文算法的有效性。﹀
论文外文摘要：	︿ It is of great significance to realize the continuous action recognition of human body through intelligent video surveillance in the construction site to ensure the safety of workers. Continuous action is composed of multiple actions, each action duration is uncertain, has a certain complexity, but some deep learning network structure complexity, low accuracy, for human continuous action recognition there are some defects. hence, in this paper, continuous action recognition is studied. from the point of view of single action, a G-ResNet network model with attention mechanism is designed, and then continuous action recognition is completed with sliding window. Aiming at the problem that the existing model can not better extract the temporal and spatial features of video, this paper proposes a human action recognition model based on G-ResNet network. Firstly, the model uses ResNet34 network to extract deep spatial features to solve the problem of deep network degradation. Secondly, the GRU network is used to obtain the timing information between video frames and to process the long-term dependence between frame sequences. Finally, the three-step training strategy is used to optimize the network model and improve the accuracy of action recognition. To solve the problem of insufficient feature information extraction in G-ResNet network, this paper proposes a human action recognition model based on FSAG-ResNet network. The model is based on the G-ResNet network. Firstly, the spatial pyramid pooling operation is introduced into the ResNet34 network, and the features are extracted with multi-scale windows to enrich the extracted features. Secondly, the time - attention mechanism is integrated the GRU network. According to the importance of video frame sequence, different weight values are assigned to the GRU network to capture more key features, the accuracy of action recognition is further improved. A sliding window combined with FSAG-ResNet network method is proposed to realize continuous action recognition in infrastructure construction site. First, the single action and continuous action video data set of different scenes in the construction site is established, and then the FSAG-ResNet network is applied to the construction site by using the idea of transfer learning. Finally, the continuous action video is recognized by smoothing window to complete the continuous action recognition. The experimental results show that the accuracy of the FSAG-ResNet network model in UCF101 is 96.2 and 64.3 in HMDB51, which is greatly improved compared with other mainstream networks. At the same time, the sliding window combined with the FSAG-ResNet network model is used for continuous action recognition. Each action in the continuous action video can be detected in real time, and the average recognition rate is 88.79, which verifies the effectiveness of the algorithm in this paper. ﹀
参考文献：	︿ [1]王阳,罗云,裴晶晶,等.电力企业违章行为的风险管控模式研究[J].中国安全生产科学技术,2018,14(04):173-180. [2]张云佐,张莎莎,吕芬芬,等.监控视频浓缩进展研究[J].电视技术,2018,42(05):66-70. [3]SAIF S, TEHSEEN S, KAUSAR S.A Survey of the Techniques for The Identification and Classification of Human Actions from Visual Data[J].Sensors,2018,18(11):3979-3989. [4]陈煜平,邱卫根.基于视觉的人体行为识别算法研究综述[J/OL].计算机应用研究,2019(07):1-10. [5]付文博,孙涛,梁藉,等.深度学习原理及应用综述[J].计算机科学学,2018,45(S1):11-15+40. [6]ZHANG Z,MA X,SONG R,et al.Deep learning based human action recognition:a survey[C]//2017 Chinese Automation Congress.Jinan:IEEE,2017:3780-3785. [7]BEN MABROUK A, ZAGROUBA E. Abnormal behavior recognition for intelligent video surveillance systems:Areview[J].Expert Systems with Applications, 2018, 91(jan.): 480-491. [8]A. Roitberg, A. Perzylo, N. Somani, et al. Human Activity Recognition in the Context of Industrial Human-Robot Interaction[C]. Asia-pacific Signal & Information Processing Association Summit & Conference, Siem Reap, 2014, 1-10. [9]HERATH S,HARANDI M,PORIKLI F.Going deeper into action recognition:a survey[J].Image and Vision Compuing,2017,60(4):4-21. [10]YAO G L,LEI T,ZHONG J D.A review of convolutional-neural-network-based action recognition[J].Pattern Recognition Letters,2019,118(2):14-22. [11]Scovanner P.3-dimensional sift descriptor and its application to action recognition[J]. ACM Multimedia,2007. [12]Laptev I. On space-time interest points[J]. International journal of computer vision, 2005, 64(2-3):107-123. [13]Wang J, Chen Z, Wu Y. Action recognition with multiscale spatio-temporal contexts[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011:3185-3192. [14]H. Wang and C. Schmid. Action recognition with improved trajectories. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3551-3558, 2013.2,8. [15]Simonyan K, Zisserman A. Two-Stream Convolutional Networks for Action Recognition in Videos [J]. Neural information processing systems, 2014,1(4):568-576. [16]Ng Y H, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,2015:16(4):4694-4702. [17]D.Tran, L.Bourdev, R.Fergus, et al.Learning spatio temporal features with 3d convolutional networks[C]. Proceedings of the IEEE international conference on computer vision. 2015:4489-4497. [18]Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:1933-1941. [19]C. Feichtenhofer, A. Pinz, R.Wildes. Spatiotemporal residual networks for video action recognition[C]. Advances in neural information processing systems. 2016: 3468-3476. [20]Wang L, Xiong Y, Wang Z, et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[C]. European Conference on Computer Vision. Springer, Cham, 2016:20-36. [21]Donahue J , Hendricks L A , Rohrbach M , et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description[C]// IEEE Transactions on Pattern Analysis & Machine Intelligence. IEEE, 2017:677-691. [22]Qiu Z, Yao T, Mei T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [J], 2017. [23]Yu Yunbo,Long Mingsheng,Wang Jianmin,et al. Spatio-temporal pyramid network for video action recognition[C]//Proc of IEEE Conference on Computer Vision and Pattern Recognition.Washington DC:IEEE Computer Society,2017:2097-2016. [24]莫宏伟,汪海波.基于Faster R-CNN的人体行为检测研究[J].智能系统学报,2018,13(06):107-113. [25]裴晓敏,范慧杰,唐延东.时空特征融合深度学习网络人体行为识别方法[J].红外与激光工程, 2018,47(02):55-60. [26]余兴. 基于深度学习的视频行为识别技术研究[D].电子科技大学,2018. [27]何冰倩,魏维,张斌,等.基于改进的深度神经网络的人体动作识别模型[J].计算机应用研究,2019,36(10):3107-3111. [28]李松龄. 基于卷积神经网络的人体动作识别研究[D].电子科技大学,2019. [29]VAROLG,LAPTEVI,SHMID C. Long-Term Temporal Convolutions for Action Recognition[C]//IEEE Transactions on Pattern Analysis&Machine Intelligence. IEEE, 2019:1510-1517. [30]张瑞,李其申,储裙.基于3D卷积神经网络的人体动作识别算法[J].计算机工程,2019,45(01):259-263. [31]胡正平,刁鹏成,张瑞雪,等.基于注意力机制的时间分组深度网络行为识别算法[J].模式识别与人工智能,2019,32(10):892-900. [32]聂玮,曹悦,朱冬雪,等.复杂监控背景下基于边缘感知学习网络的行为识别算法[J].计算机应用与软件,2020,37(08):227-232. [33]张聪聪,何宁,孙琪翔,等.基于注意力机制的3D DenseNet人体动作识别方法[J/OL].计算机工程:1-9[2021-03-23].https://doi.org/10.19678/j.issn.1000-3428.0059640. [34]Zhu G, Zhang L, Shen P, et al. An online continuous human action recognition algorithm based on the kinect sensor[J]. Sensors, 2016, 16(2):161-179. [35]童佳宁,李志刚.基于BP神经网络的连续动作识别在清淤设备中的应用[J].中国航海,2018,41(03):43-46+58. [36]杨世强,罗晓宇,乔丹,等.基于滑动窗口和动态规划的连续动作分割与识别[J].计算机应用,2019,39(02):348-353. [37]丁伟利,胡博,张焱鑫.基于规则的连续动作识别[J].高技术通讯,2019,29(09):869-875. [38]Goldberg Y. Neural network methods for natural language processing[J]. Synthesis Lectures on Human Language Technologies, 2017, 10(1): 1-309. [39]Young T, Hazarika D, Poria S, et al. Recent trends in deep learning based natural language processing[J]. IEEE Computational Intelligence Magazine, 2018, 13(3): 55-75. [40]Chu W T, Tsai Y L. A hybrid recommendation system considering visual information for predicting favorite restaurants[J]. World Wide Web, 2017, 20(6):1313-1331. [41]He R, Mc Auley J. VBPR: visual bayesian personalized ranking from implicit feedback[C]. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2016. [42]Chauhan V K, Dahiya K, Sharma A. Problem formulations and solvers in linear SVM: a review[J]. Artificial Intelligence Review, 2018(06):1-53. [43]Grave E, Joulin A, Cissé, Moustapha, et al. Efficient softmax approximation for GPUs[J].Proceedings of the 34th International Conference on Machine Learning,2017,70:1302-1310. [44]Garea A S, Heras D B, Argiiello F. Caffe CNN-based classification of hyperspectral images on GPU[J]. Journal of Supercomputing, 2018(03):1-13. [45]Aizezi Y, Jiamali A, Abudurexiti R, et al. Research on Image Recognition Method Based on Deep Learning Algorithm[C]//2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2018:1-6. [46]He K,Zhang X,Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770-778. [47]Cho K,VanMerrienboer B,Gulcehre C,et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical MachineTranslatio[J]. empirical methods in natural Language processing,2015:1724-1734. [48]He K, Zhang X, Ren S, et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014,37(9):1904-16. [49]Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C].Proceedings of the European Conference on Computer Vision (ECCV). 2018:3-19. [50]Firat O, Cho K, Bengio Y. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism[J]. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference 2016, 866-875. [51]Ge H, Yan Z, Yu W, et al. An attention mechanism based convolutional LSTM network for video action recognition[J]. Journal of Intelligent and Fuzzy Systems, 2019, 775-786. ﹀
中图分类号：	TP391.4
开放日期：	2021-06-18

附件下载