论文中文题名: |
基于卷积神经网络的人体姿态估计研究
|
姓名: |
李飞彪
|
学号: |
19208208051
|
保密级别: |
公开
|
论文语种: |
chi
|
学科代码: |
085212
|
学科名称: |
工学 - 工程 - 软件工程
|
学生类型: |
硕士
|
学位级别: |
工学硕士
|
学位年度: |
2022
|
培养单位: |
西安科技大学
|
院系: |
计算机科学与技术学院
|
专业: |
软件工程
|
研究方向: |
计算机视觉图像检测
|
第一导师姓名: |
罗晓霞
|
第一导师单位: |
西安科技大学
|
论文提交日期: |
2022-06-22
|
论文答辩日期: |
2022-06-07
|
论文外文题名: |
Research on Human Pose Estimation Based on Convolutional Neural Network
|
论文中文关键词: |
人体姿态估计 ; 注意力机制 ; 残差模块 ; 轻量化
|
论文外文关键词: |
Human Pose Estimation ; Attention Mechanism ; Residual Module ; Lightweighting
|
论文中文摘要: |
︿
人体姿态估计是从图像和视频中定位人体的主要关节部位并建立人体表征,有助于图像理解和行为识别。基于卷积神经网络的人体姿态估计方法使关键点的估计精度有较大的提高,但是关节被遮挡、复杂的图像背景等仍是人体姿态估计面临的挑战,并且参数量和计算量多是复杂网络待优化问题。本文进行了以下研究:
针对单人姿态估计堆叠沙漏网络对复杂背景图像中人体关键点估计准确率较低的问题对残差模块进行优化,提出融合通道注意力机制和融合极化自注意力机制两种残差模块。实验结果表明,融合极化自注意力机制比融合通道注意力机制使网络的关键点估计准确率提升更高。针对网络参数量和计算量多的问题,用深度可分离卷积替换残差模块的普通卷积,最后提出融合极化自注意力机制的轻量级堆叠沙漏网络。在MPII数据集上实验结果表示,改进后网络比原始堆叠沙漏网络估计的关键点估计准确率提升1.5%,参数量减少51.5%,计算量降低51.7%。
针对多尺度感知高分辨率多人人体姿态估计网络HigherHRNet参数量和计算量多导致训练过程消耗过多资源及不利于嵌入式设备部署的问题,提出轻量级的HigherHRNet。采用GhostNet中简单线性变换生成特征图的GhostModule替换3×3普通卷积优化网络的残差模块,降低网络的计算量与参数量。在COCO数据集上进行实验,结果表示轻量化HigherHRNet估计的不同相似度的关键点准确率比原始网络降低0.2%~0.9%,但参数量和计算量均减少约50%,表明轻量化优化有效减少网络参数量和计算量的同时会影响关键点估计准确率。为弥补准确率的下降,用scSE空间和通道混合注意力机制在后处理阶段转置卷积前激励有效空间和通道特征信息,增强高分辨率特征图中的有效特征信息,实验结果表示基于注意力机制轻量级HigherHRNet比原始网络的mAP提升了0.2%。
设计并实现学生姿态估计与识别系统。使用基于极化自注意力机制轻量级SHNet和基于注意力机制轻量级HigherHRNet分别估计单人图像和多人图像中学生姿态关键点信息,将其标记在图像中。收集八种学生姿态图像构建单人学生姿态数据集,选用分类网络MobileNetV2应用于学生姿态识别。
﹀
|
论文外文摘要: |
︿
Human pose estimation is helpful for image understanding and behavior recognition by locating the main joint parts of human body from images and videos and establishing human representation. The human pose estimation method based on convolutional neural network has greatly improved the estimation accuracy of key points, but the human pose estimation is still faced with challenges such as occlusion of joints and complex image background, and the number of parameters and the amount of calculation are the problems to be optimized in complex network. The following studies are carried out in this paper:
In order to solve the problem of low estimation accuracy of human key points in complex background images by stacked hourglass network of one-person attitude estimation, two residual modules, the fusion channel attention mechanism and the fusion polarization self-attention mechanism, were proposed to optimize the residual module. The experimental results show that the fusion polarization self-attention mechanism improves the accuracy of network key point estimation more than the fusion channel attention mechanism. To solve the problem of large number of network parameters and computation, the common convolution of residual module is replaced by deep separable convolution. Finally, a lightweight stacked hourglass network with polarization self-attention mechanism is proposed. Experimental results on MPII data set show that compared with the original stacked hourglass network, the key point estimation accuracy of the improved network is improved by 1.5%, the number of parameters is reduced by 51.5%, and the computation amount is reduced by 51.7%.
A lightweight HigherHRNet for multi-scale sensing high resolution human pose estimation network is proposed to solve the problem that the training process consumes too much resources due to the large number of parameters and the large amount of computation. GhostModule, which generates feature graphs by simple linear transformation in GhostNet, is used to replace the residual module of 3×3 ordinary convolution optimization network and reduce the computation and parameter number of the network. Experimental results on COCO data set show that the key point accuracy of different similarity estimated by lightweight HigherHRNet is reduced by 0.2%~0.9% compared with the original network, but the number of parameters and the amount of computation are reduced by about 50%, indicating that the lightweight optimization can effectively reduce the number of network parameters and the amount of computation while affecting the key point estimation accuracy. To compensate for the decrease in accuracy, the scSE spatial and channel mixed attention mechanism was used to transform the pre-convolution to stimulate the effective spatial and channel feature information in the post-processing stage to enhance the effective feature information in the high-resolution feature mAP. Experimental results show that the lightweight HigherHRNet based on attention mechanism improves 0.2% than the mAP of the original network.
A student posture estimation and recognition system is designed and implemented. The polarization self-attention-based lightweight SHNet and attention-based lightweight HigherHRNet are used to estimate the key points of student posture in single image and multi-image respectively and mark them in the image. Eight kinds of student pose images were collected to construct a single student pose data set, and the classification network MobileNetV2 was applied to student pose recognition.
﹀
|
参考文献: |
︿
[1] 都文龙. 基于多尺度级联沙漏网络的人体姿势估计[D]. 杭州: 杭州电子科技大学, 2019. [2] 赵俊男, 佘青山, 穆高原, 吴秋轩, 席旭刚. 基于MobileNetV3与ST-SRU的危险驾驶姿态识别[J]. 控制与决策, 2022, 37(05): 1320-1328. [3] 周凯烨. 基于深度学习的健身动作识别系统[J]. 工业控制计算机, 2021, 34(06): 37-39. [4] Tian S, Yang W, Le Grange J M, et al. Smart healthcare: making medical care more intelligent[J].Global Health Journal, 2019, 3(3): 62-65. [5] 杨弘. 基于改进深度神经网络的人体姿态估计方法研究[D]. 秦皇岛: 燕山大学, 2020. [6] Toshev A, Szegedy C. Deeppose: Human pose estimation via deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 1653-1660. [7] Carreira J, Agrawal P, Fragkiadaki K, et al. Human pose estimation with iterative error feedback[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4733-4742. [8] Tompson J J, Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[J]. Advances in Neural Information Processing Systems, 2014, 2: 1799-1807. [9] Tompson J, Goroshin R, Jain A, et al. Efficient object localization using convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 648-656. [10] Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation[C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2016: 483-499. [11] Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4724-4732. [12] Xiao B, Wu H, Wei Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 466-481. [13] Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5693-5703. [14] Wang J, Long X, Gao Y, et al. Graph-pcnn: Two stage human pose estimation with graph pose refinement[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 492-508. [15] Chen Y, Wang Z, Peng Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7103-7112. [16] Pishchulin L, Insafutdinov E, Tang S, et al. Deepcut: Joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE, 2016: 4929-4937. [17] Insafutdinov E, Pishchulin L, Andres B, et al. Deepercut: A deeper, stronger, and faster multi-person pose estimation model[C]//Proceedings of the European Conference on Computer Vision.Cham: Springer, 2016: 34-50. [18] Cao Z, Simon T, Wei S E, et al. Realtime multi-person 2d pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE, 2017: 7291-7299. [19] Kreiss S, Bertoni L, Alahi A. Pifpaf: Composite fields for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 11977-11986. [20] Cheng B, Xiao B, Wang J, et al. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5386-5395. [21] Ke L, Chang M C, Qi H, et al. Multi-scale structure-aware network for human pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 713-728. [22] Peng X, Tang Z, Yang F, et al. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 2226-2234. [23] Hou L, Cao J, Zhao Y, et al. Augmented Parallel-Pyramid Net for Attention Guided Pose Estimation[C]// Proceedings of the 25th International Conference on Pattern Recognition (ICPR).Piscataway: IEEE, 2021: 9658-9665. [24] Chu X, Yang W, Ouyang W, et al. Multi-context attention for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1831-1840. [25] Liu W, Chen J, Li C, et al. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation[C]//Thirty-Second AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2018: 7170-7177. [26] Su K, Yu D, Xu Z, et al. Multi-person pose estimation with enhanced channel-wise and spatial information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5674-5682. [27] Zhang F, Zhu X, Ye M. Fast human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3517-3526. [28] Sun B, Zhao M. Simple Light-Weight Network for Human Pose Estimation[C]//Pacific Rim International Conference on Artificial Intelligence. Cham: Springer, 2021: 279-292. [29] Zhong F, Li M, Zhang K, et al. DSPNet: A low computational-cost network for human pose estimation[J]. Neurocomputing, 2021, 423: 327-335. [30] Gao B, Ma K, Bi H, et al. A lightweight network based on pyramid residual module for human pose estimation[J]. Pattern Recognition and Image Analysis, 2019, 29(4): 668-675. [31] Zhang W, Fang J, Wang X, et al. Efficientpose: Efficient human pose estimation with neural architecture search[J]. Computational Visual Media, 2021, 7(3): 335-347. [32] Sun X, Xiao B, Wei F, et al. Integral human pose regression[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 529-545. [33] Nibali A, He Z, Morgan S, et al. Numerical coordinate regression with convolutional neural networks[J]. arXiv preprint arXiv, 2018, 1801(07372):1-10. [34] Guo M H, Xu T X, Liu J J, et al. Attention mechanisms in computer vision: A survey[J]. Computational Visual Media, 2022, 8(3): 331-368. [35] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 3-19. [36] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. [37] Gholami A, Kim S, Dong Z, et al. A survey of quantization methods for efficient neural network inference[J]. arXiv preprint arXiv,2021,2103(13630):1-33. [38] Xu S, Huang A, Chen L, et al. Convolutional neural network pruning: A survey[C]// Proceedings of the 39th Chinese Control Conference. Piscataway: IEEE, 2020: 7458-7463. [39] Gou J, Yu B, Maybank S J, et al. Knowledge distillation: A survey[J]. International Journal of Computer Vision, 2021, 129(6): 1789-1819. [40] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 2(25): 1097-1105. [41] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017:1492-1500. [42] Han K, Wang Y, Tian Q, et al. Ghostnet: More features from cheap operations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 1580-1589. [43] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2014: 740-755. [44] Papandreou G, Zhu T, Chen L C, et al. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]//Proceedings of the European conference on computer vision. Cham: Springer, 2018: 269-286. [45] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [46] Liu H, Liu F, Fan X, et al. Polarized self-attention: Towards high-quality pixel-wise regression[J]. arXiv preprint arXiv,2021,2107(00782):1-10. [47] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv,2017,1704(04861):1-9. [48] Andriluka M, Pishchulin L, Gehler P, et al. 2d human pose estimation: New benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 3686-3693. [49] Insafutdinov E, Pishchulin L, Andres B, et al. Deepercut: A deeper, stronger, and faster multi-person pose estimation model[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2016: 34-50. [50] Sun K, Lan C, Xing J, et al. Human pose estimation using global and local normalization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5599-5607. [51] Tang Z, Peng X, Geng S, et al. Quantized densely connected u-nets for efficient landmark localization[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 339-354. [52] Ning G, Zhang Z, He Z. Knowledge-guided deep fractal neural networks for human pose estimation[J]. IEEE Transactions on Multimedia, 2017, 20(5): 1246-1259. [53] Chu X, Yang W, Ouyang W, et al. Multi-context attention for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1831-1840. [54] Yang W, Li S, Ouyang W, et al. Learning feature pyramids for human pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 1281-1290. [55] Ke L, Chang M C, Qi H, et al. Multi-scale structure-aware network for human pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 713-728. [56] Tang W, Yu P, Wu Y. Deeply learned compositional models for human pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 190-206. [57] Roy A G, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2018: 421-429. [58] Howard A, Sandler M, Chu G, et al. Searching for mobilenetv3[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 1314-1324. [59] Newell A, Huang Z, Deng J. Associative embedding: End-to-end learning for joint detection and grouping[J]. Advances in Neural Information Processing Systems, 2017, 30:2278-2288. [60] Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear Bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520. [61] Ma N, Zhang X, Zheng H T, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 116-131. [62] Zhang X, Zhou X, Lin M, et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6848-6856.
﹀
|
中图分类号: |
TP391.4
|
开放日期: |
2022-06-22
|