查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于双流卷积神经网络的行为识别算法研究
姓名：	王佳莹
学号：	18208052008
保密级别：	公开
论文语种：	chi
学科代码：	081203
学科名称：	工学 - 计算机科学与技术（可授工学、理学学位） - 计算机应用技术
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2021
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	计算机应用技术
研究方向：	图形图像处理
第一导师姓名：	李占利
第一导师单位：	西安科技大学
论文提交日期：	2021-06-21
论文答辩日期：	2021-06-03
论文外文题名：	Research on Behavior Recognition Algorithm Based on Two-stream Convolutional Neural Network
论文中文关键词：	双流卷积神经网络 ; scSE模块 ; 非局部操作 ; 行为识别 ; 深度学习
论文外文关键词：	Two-Stream convolutional neural network ; Spatial and Channel Squeeze & Excitation Block ; Non-local operation ; Action recognition ; Deep learning
论文中文摘要：	︿ ~近年来随着科技的发展，使得行为识别技术也不断成熟，而获得外观信息的同时捕获帧间运动信息是行为识别的难点，双流卷积神经网络因能够捕获时空信息受到广泛关注。但视频包含噪声、光照变化等因素，以及视频行为持续时间较长都会影响识别的准确率。本文以双流卷积神经网络为基础进行深入研究，针对行为识别存在的问题，提出以下几种策略来提高识别率： (1) 通过scSE模块来对图像特征进行筛选，提出融合scSE的双流网络模型框架。该模型能够关注通道之间的信息，并将行为特征赋予更大的权重，弱化背景信息的影响。本文将scSE处理后的特征进行可视化并分析结果，实验结果表明，scSE模块能够关注重要特征从而提升网络的识别率。 (2) 在融合scSE双流网络模型框架基础上，提出“分段-融合”策略，采用BNInception网络，提出scSE_BNInception双流网络。该网络对特征进行筛选的同时，能更好的处理长时序视频的识别问题。首先将原始视频划分为等长不重叠的K个时序段，再从每段中稀疏采样RGB视频帧和光流图输入scSE_BNInception双流网络中，最终融合K段识别结果。与双流卷积神经网络、时序分割网络等算法相比，scSE_BNInception双流网络在保证运行速度的前提下，提高了行为识别的准确率。 (3) 采用ResNet101来构造双流模型，并在此基础上增加scSE卷积层来筛选特征，减少噪声干扰，增加非局部层来关注长距离依赖并获取全局信息，提出SC_NLResNet双流网络。在UCF101和Hmdb51数据集上与scSE_BNInception双流网络以及其他算法进行对比，实验结果表明，SC_NLResNet网络可以有效提高包含噪声、光照变化等因素的视频行为识别的准确率。论文以融合scSE的双流网络为基本框架，提出scSE_BNInception网络在保持速度的同时提高长时序视频的识别率，SC_NLResNet网络能更好的处理噪声变化和光照变化的影响。在UCF101和Hmdb51数据集上识别率分别达到96.4%、71.3%和96.9%、76.2%。﹀
论文外文摘要：	︿ ~In recent years, with the development lor='red'>of science and technology, the behavior recognition technology is becoming more and more mature. It is difficult to capture the complementary information on appearance from still frames and motion between frames. Two-stream convolutional neural network is widely concerned because it can capture spatio-temporal information. However, video contains noise, illumination changes and other factors, as well as long video behavior time will affect the accuracy lor='red'>of recognition.In this paper, based on two-stream convolutional neural network, aiming at the problems lor='red'>of behavior recognition, the following strategies are proposed to improve the recognition rate : (1) The scSE module is used to filter the image features, and a two-stream network model framework based on scSE is proposed. The model can focus on the information between channels, and give greater weight to the behavior characteristics, weakening the influence lor='red'>of background information. In this paper, the features processed by scSE are visualized and the results are analyzed. The experimental results show that the scSE module can focus on important features to improve the recognition rate lor='red'>of the network. (2) Based on the fusion scSE two-stream network model framework, the ' segmentation- fusion ' strategy is proposed, and the scSE_BNInception two-stream network is proposed by using BNInception network. The network can better deal with the recognition lor='red'>of long temporal video while filtering features. Firstly, the original video is divided into K temporal segments with equal length and no overlap. Then, RGB video frames and optical flow images are sparsely sampled from each segment and input into scSE_BNInception two-stream network. Finally, the K segment recognition results are fused. Compared with the algorithms such as two-stream convolutional neural network and sequential segmentation network, scSE_BNInception two-stream network improves the accuracy lor='red'>of behavior recognition under the premise lor='red'>of ensuring the running speed. (3) The ResNet101 is used to construct the two-stream model, and on this basis, the scSE convolution layer is added to filter the features. At the same time, the noise interference is reduced, and the nonlocal layer is added to pay attention to the long-distance dependence and obtain the global information. The SC_NLResNet two-stream network is proposed. Compared with scSE_BNInception and other algorithms on UCF101 and Hmdb51 datasets, the experimental results show that SC_NLResNet network can effectively improve the accuracy lor='red'>of video behavior recognition including noise, illumination change and other factors. Based on the basic framework lor='red'>of the two-stream network fused with scSE, the paper proposes that the scSE_BNInception network is proposed to improve the recognition rate lor='red'>of long-temporal video while maintaining the speed. SC_NLResNet network can better deal with the influence lor='red'>of noise change and illumination change. The recognition rates on UCF101 and Hmdb51 datasets were 96.4 %, 71.3 % and 96.9 %, 76.2 %, respectively. ﹀
参考文献：	︿ [1] 胡建芳, 王熊辉,郑伟诗,等. RGB-D 行为识别研究进展及展望[J].自动化学报, 2019, 45(5): 829-840. [2] 郭伏正, 孔军, 蒋敏.自适应融合 RGB和骨骼特征的行为识别[J].激光与光电子学进展, 2020, 57(20): 201506. [3] S.Carlsson, J.Sullivan. Action recognition by shape matching to key frames[J].Computer Engineering &Applications,2001,47(2):167-172. [4] 梁燕. 交互应用中的实时动作识别[D]. 北京理工大学, 2015. [5] Bobick AF, Davis JW. The recognition of human movement using temporal templates[J].Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2001, 23(3): 257-267. [6] Ke Y , Sukthankar R , Hebert M . Spatio-temporal Shape and Flow Correlation for Action Recognition[C]// 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. IEEE, 2007:1-8 [7] Oikonomopoulos A,Patras I,Pantic M. Spatiotemporal salient points for visual recognition of human actions[J]. IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society, 2006,36(3):710-719. [8] Dollar P,Rabaud V,Cottrell G, et al. Behavior recognition via sparse spatio-temporal features[C]//IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 2006:65-72. [9] A. Kovashka, D. Parikh, K. Grauman. Whittlesearch: Interactive image search with relative attribute feedback[J]. International Journal of Computer Vision, 2015, 115(2): 185-210. [10] Messing R , Pal C , Kautz H. Activity recognition using the velocity histories of tracked keypoints[C]//Computer Vision, 2009 IEEE 12th International Conference on.IEEE, 2009, 104-111. [11] Wang H, Klaser A,Schmid C, et al. Action recognition by dense trajectories[C]//IEEE Conference on Computer Vision and Pattern Recognition. 2011:3169-3176. [12] Wang H,Schmid C.Action recognition with improved trajectories[C]//IEEE International Conference on Computer Vision. 2014:3551-3558. [13] Krizhevsky A，Sutskever I，Hinton G E.ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems,Lake Tahoe,Dec 3-6,2012.Red Hook:Curran Associates,2012:1106-1114. [14] 孙月驰,平伟,徐明磊.基于优化卷积神经网络结构的人体行为识别[J].计算机应用与软件,2021,38(02):198-204+269. [15] 曹仰杰,李昊,段鹏松,王福超,王超.AHNNet: 融合注意力机制的行为识别混合神经网络模型[J/OL].西安交通大学学报:1-9[2021-03-06]. http://kns.cnki.net/kcms/detail/ 61.1069.T.20201127.1110.002.html. [16] Donahue J,Hendricks L A,Guadarrama S, et al. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description [C] // The IEEE Conference on Computer Vision and Pattern Recognition. 2015 :2625-2634. [17] Ng Y H , Hausknecht M , Vijayanarasimhan S , et al. Beyond short snippets: Deep networks for video classification[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington, DC: IEEE Computer Society, 2015: 4694-4702. [18] Ji S , Xu W , Yang M , et al. 3D Convolutional Neural Networks for Human Action Recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence,2013, 35(1):221-231. [19] Varol G,Laptev I,Shmid C.Long-Term Temporal Convolutions for Action Recognition[J]. IEEE Transactions on Pattern Analysis &Machine Intelligence, 2018,40(6): 1510-1517． [20] Qiu Zhaofan, Yao Ting, Mei Tao. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proc of IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 5534-5542. [21] Wang L,Li W,Li W,et al.Appearance-and-Relation Networks for Video Classification[C] //2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City，UT,USA: IEEE,2018:1430-1439. [22] Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos[J]. Neural Infor-mation Processing Systems,2014,1(4):568- 576. [23] Feichtenhofer C,Pinz A,Zisserman A,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition[C]//Computer Vision and Pattern Recognition.2016:1933-1941. [24] 刘松泉,胡军.双流网络构架行为识别隐含层模型仿真[J].计算机仿真,2019,36(08):394-398. [25] 马翠红,王毅,毛志强.基于注意力的双流CNN的行为识别[J].计算机工程与设计,2020,41(10): 2903-2906. [26] Wang Limin,Xiong Yuanjun,Wang Zhe,et al. Temporal segment networks: Towards good practices for deep recognition[C] //Proc of European Conf on Computer Vision. Berlin:Springer,2016:20- 36. [27] Feichtenhofer C,Pinz A,Wildes R P,et al. Spatiotemporal Residual Networks for Video Action Recognition[C]//Neural Information Processing Systems.2016:3468-3476. [28] Wang L,Ge L,Li R,et al. Three-stream CNNs for action recogni-tion[J]. Pattern Recognition Letters, 2017, 92(C):33-40. [29] Girdhar R, Ramanan D, Gupta A, et al. ActionVLAD: Learning spatio-temporal aggregation for action classification[C]//Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA:IEEE, 2017:3165-3174. [30] Bilen H,Fernando B,Gavves E,et al. Action Recognition with Dynamic Image Networks[J]. IEEE Transactions on Pattern Analysis&Machine Intelligence,2018,40(12): 2799-2813． [31] Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton- based action recognition[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, La., USA:Sheila McIlraith,2018:7444-7452. [32] Li M, Chen Siheng, Chen Xu, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach,USA, 2019: 3595–3603. [33] Wu Cong, Wu Xiaojun, Kittler J. Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop,Seoul, Korea, 2019:1–5. [34] Hochreiter S,Schmidhuber J. Long short-term memory[J].Neural Computation,1997, 9(8):1735-1780. [35] Hu J, Shen L, Albanie S, et al. Squeeze-and-excitation networks[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018:7132-7141. [36] Roy A G, Navab N, Wachinger C, et al. Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks[C]//medical image computing and computer assisted intervention, 2018:421-429. [37] 张拯, 贾鹤萍. 光流算法研究 [J]. 火力与指挥控制, 2017,42(7):105-109. [38] Sergey L, Christian S. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of 32nd International Conference on Machine Learning, 2015: 448- 456. [39] Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift[C]//International Conference on Machine Learning. 2015: 448-456. [40] Peng X, Wang L, Wang X, et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice[J].Computer vision and image understanding: CVIU, 2016,150(9):109- 125. [41] Buades A, Coll B, Morel J M. A non-local algorithm for image denoising[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE, 2005,2: 60-65. [42] Wang X, Girshick R, Gupta A, et al. Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7794-7803. [43] He K，Zhang X，Ren S，et al. Deep residual learning for image recognition[C]//The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA，2016,770-778 ﹀
中图分类号：	TP391.413
开放日期：	2021-06-22

附件下载