查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于3D卷积神经网络的手语识别
姓名：	郭洋洋
学号：	20207223061
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2023
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	电子与通信工程
研究方向：	数字图像处理
第一导师姓名：	吴冬梅
第一导师单位：	西安科技大学
论文提交日期：	2023-06-15
论文答辩日期：	2023-06-02
论文外文题名：	Sign language recognition based on 3D convolutional neural network
论文中文关键词：	手语识别 ; 深度学习 ; 视频分类 ; C3D ; 注意力机制 ; 时空特征
论文外文关键词：	Sign language recognition ; Deep learning ; Video classification ; C3D ; Attention mechanism ; Temporal and spatial characteristics
论文中文摘要：	︿随着科技的进步，智能化自助设施在机场、车站等公共场所逐渐完善，但是针对聋哑人群的专用设施仍然未能普及，使得他们出行存在严重的沟通障碍，传统的手语识别技术主要存在的问题是识别的种类少，仅能识别字母手势，而且是对单帧识别，未考虑手语的连贯性，此外对背景要求很高，无法适应复杂背景的场所。随着计算机技术与深度学习技术的发展，使得日常手语翻译系统集成在移动智能设备上成为可能。经典的C3D模型含有8个卷积层、5个池化层和2个全连接层，主要针对动作幅度大、关键帧集中的视频进行分类。采用C3D作为手语识别的基础模型，优点是可以提取时空联合特征、增强时空特征对应关系，但缺点是手语视频本身关键帧分散且动作微小，导致识别准确率并不高，因此需要对C3D模型进行一系列改进，提升模型对手语视频的兼容性与识别准确率。针对C3D模型应用到手语识别时的缺陷提出了一种改进的模型C3D-cslr，提升了模型与手语数据集的匹配性与识别的准确率。首先，针对动作幅度小、关键帧分散等问题，在网络的输入端扩展原始特征图的数量，增强了网络对输入视频全局关键帧的获取能力；其次，针对空间特征不突出的问题，在主干特征提取网络部分加入空间注意力机制，强化网络对空间特征的提取能力；最后，针对正负样本差异过大的问题，在损失函数部分借鉴知识蒸馏思想，引入温度常数T对softmax进行优化，使网络更关注困难样本，提高模型的泛化能力。为了获得最佳模型，以损失函数Loss和准确率Acc作为主要评价指标，对改进模型进行评价并对改进点两两融合。综合实验数据，改进后的C3D-cslr模型损失值Loss由原来1.76下降至0.63，识别准确率Acc由原来56.67%上升至88.6%，为了模型可以更快应用到现实生活，实验最后部分，在个人录制的手语视频上进行了简单的连续语句识别，取得了良好的效果，验证了所改进算法的可行性与有效性，为未来普遍应用提供了可能性与参考价值。﹀
论文外文摘要：	︿ With the progress of science and technology, intelligent self-service facilities are gradually improved in airports, stations and other public places, but the special facilities for deaf and dumb people are still not popular, leading to serious communication barriers in their travel. The main problem of traditional sign language recognition technology is that it only recognizes letters and gestures in a single frame, without considering the coherence of sign language. In addition, the background requirements are very high, unable to adapt to the complex background of the site. With the development of computer technology and deep learning technology, it is possible to integrate daily sign language translation system on mobile intelligent devices. The classic C3D model contains 8 convolutional layers, 5 pooling layers and 2 fully connected layers, and mainly classifies videos with large motion amplitude and key frame concentration. The advantage of using C3D as the basic model of sign language recognition is that it can extract spatio-temporal joint features and enhance the corresponding relationship between spatio-temporal features. However, the disadvantage is that the key frames of the sign language video itself are scattered and the movements are small, resulting in low recognition accuracy. Therefore, a series of improvements should be made to the C3D model to improve the compatibility and recognition accuracy of the model of sign language video. Aiming at the defects of C3D model applied to sign language recognition, an improved model C3D-CSLR is proposed, which improves the matching and recognition accuracy between the model and sign language data set. Firstly, in view of the problems such as small motion amplitude and scattered key frames, the number of original feature maps is extended at the input end of the network, which enhances the ability of the network to obtain global key frames of input video. Secondly, to solve the problem of not prominent spatial features, a spatial attention mechanism is added to the backbone feature extraction network to strengthen the ability of the network to extract spatial features. Finally, to solve the problem of large difference between positive and negative samples, the idea of knowledge distillation is used for reference in the part of loss function, and temperature constant T is introduced to optimize softmax, so that the network pays more attention to difficult samples and improves the generalization ability of the model. In order to obtain the best model, Loss function and accuracy Acc were taken as the main evaluation indexes to evaluate the improved model and integrate the improvement points in pairs. Based on the experimental data, the Loss value of the improved C3D-cslr model decrease from 1.76 to 0.63, and the recognition accuracy Acc increase from 56.67% to 88.6%. In order to quickly apply the model to real life, in the last part of the experiment, simple continuous sentence recognition was carried out on the sign language video recorded by individuals. A good result is obtained, which verifies the feasibility and effectiveness of the improved algorithm, and provides a possibility and reference value for future application. ﹀
参考文献：	︿ [1] 王卓程，张景峤.基于三维手部骨架数据的连续手语识别[J]. 计算机辅助设计与图形学学报, 2021, 33(12): 1899-1907. [2] 黄世亮.面向大词汇量的中国手语识别方法研究[D].中国科学技术大学,2020. [3] 解迎刚,王全.基于视觉的动态手势识别研究综述[J]. 计算机工程与应用, 2021, 57(22): 68-77. [4] 蒲俊福.基于深度学习的视频手语识别研究[D].中国科学技术大学,2020. [5] 郭丹,唐申庚,洪日昌等.手语识别、翻译与生成综述[J]. 计算机科学, 2021, 48(03): 60-70. [6] 杨戈,邹武星.基于深度学习的视频行为分类方法综述[J]. 电子技术应用, 2022, 48(07): 1-7. [7] Gornale S S, Patravali P U, Marathe K S, et al. Determination of osteoarthritis using histogram of oriented gradients and multiclass SVM[J]. International Journal of Image, Graphics and Signal Processing, 2017, 9(12): 41. [8] 付敏,蒲小英,秦伟强.基于网络安全等级保护2.0标准的物联网安全体系架构[A]. 公安部第三研究所、江苏省公安厅、无锡市公安局.2019中国网络安全等级保护和关键信息基础设施保护大会论文集[C].公安部第三研究所、江苏省公安厅、无锡市公安局:《信息网络安全》北京编辑部,2019:16-19. [9] Chee K W, Teoh S S. Pedestrian detection in visual images using combination of HOG and HOM features[C]//10th International Conference on Robotics, Vision, Signal Processing and Power Applications: Enabling Research and Innovation Towards Sustainability. Springer Singapore, 2019: 591-597. [10] Carmona J M, Climent J. Human action recognition by means of subtensor projections and dense trajectories[J]. Pattern Recognition, 2018, 81: 443-455. [11] Fan M, Han Q, Zhang X, et al. Human Action Recognition Based on Dense Sampling of Motion Boundary and Histogram of Motion Gradient[C]//2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS). IEEE, 2018: 1033-1038. [12] Silva F B, Werneck R O, Goldenstein S, et al. Graph-based bag-of-words for classification[J]. Pattern Recognition, 2018, 74: 266-285. [13] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 1725-1732. [14] 陈瑞文. 基于改进的BoVW模型的图像检索方法研究[J]. 重庆科技学院学报(自然科学版), 2015,17(05):77-79. [15] 刘杰, 王月, 田明. 多尺度时空特征融合的动态手势识别网络[J]. 电子与信息学报:1-9. [16] 王剑波,朱欣娟,吴晓军. 融合静态手势特征和手部运动轨迹特征的手势交互方法[J]. 国外电子测量技术,2021,40(07):14-18. [17] Grimes G J. Digital data entry glove interface device: U.S. Patent 4,414,537[P]. 1983-11-8. [18] Grobel K, Assan M. Isolated sign language recognition using hidden Markov models[C]//1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation. IEEE, 1997, 1: 162-167. [19] 周扬, 孙玲玲, 马德. 基于HMM模型的语音识别系统的研究[J]. 物联网技术, 2017, 7(10): 74-76. [20] 吴江琴, 高文, 陈熙霖. 基于数据手套输入的汉语手指字母的识别[J]. 模式识别与人工智能, 1999, 12(01): 74-78. [21] 施强. 基于BP神经网络的计算机图像智能识别方法[J]. 电脑编程技巧与维护, 2022, (10): 134-137. [22] 祝远新, 徐光祐, 黄裕. 基于表观的动态孤立手势识别[J]. 软件学报, 2000, 11(1): 54-61. [23] 吴卓, 梁珂, 张春阳. 基于神经网络和傅里叶描述子的残缺面部图像识别与分析[J]. 无线互联科技, 2021, 18(01):53-59. [24] 张争珍, 石跃祥. YcgCr 颜色空间的肤色聚类人脸检测法[J]. 计算机工程与应用, 2009, 45(22):163-165. [25] 陈勇飞, 刘新明. 基于肤色和类 Harr 特征的人脸图像的人眼检测[J]. 计算机工程与应用, 2008, 44(33): 174-176. [26] Cheok M J, Omar Z, Jaward M H. A review of hand gesture and sign language recognition techniques[J]. International Journal of Machine Learning and Cybernetics, 2019, 10: 131-153. [27] Chai X, Liu Z, Yin F, et al. Two streams recurrent neural networks for large-scale continuous gesture recognition[C]//2016 23rd international conference on pattern recognition (ICPR). IEEE, 2016: 31-36. [28] Huang S, Mao C, Tao J, et al. A novel chinese sign language recognition method based on keyframe-centered clips[J]. IEEE Signal Processing Letters, 2018, 25(3): 442-446. [29] Li X, Mao C, Huang S, et al. Chinese sign language recognition based on shs descriptor and encoder-decoder lstm model[C]//Biometric Recognition: 12th Chinese Conference, CCBR 2017, Shenzhen, China, October 28-29,2017,Proceedings 12. Springer International Publishing, 2017:719-728. [30] Yang S, Zhu Q.Continuous Chinese sign language recognition with CNN-LSTM[C]//Ninth international conference on digital image processing (ICDIP 2017). SPIE, 2017, 10420: 83-89. [31] Pigou L, Dieleman S, Kindermans P J, et al. Sign language recognition using convolutional neural networks[C]//Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part I 13. Springer International Publishing, 2015: 572-578. [32] Lin C, Wan J, Liang Y, et al. Large-scale isolated gesture recognition using a refined fused model based on masked res-c3d network and skeleton lstm[C]//2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018: 52-58. [33] Shi B, Del Rio A M, Keane J, et al. American sign language fingerspelling recognition in the wild[C]//2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018: 145-152. [34] Cui R, Liu H, Zhang C. A deep neural framework for continuous sign language recognition by iterative training[J]. IEEE Transactions on Multimedia, 2019,21(7): 1880-1891. [35] 戴兴雨, 王卫民, 梅家俊. 基于深度学习的手语识别算法研究[J]. 现代计算机, 2021, 27(29): 63-69. [36] 赵芳, 刘新月, 李文清等. 一种基于深度学习的手语词识别算法[J]. 中国科技信息, 2022, No.671(06): 128-131. [37] Wang S, Guo D, Zhou W, et al. Connectionist temporal fusion for sign language translation[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 1483-1491. [38] 罗元, 李丹, 张毅. 基于时空注意力网络的中国手语识别[J]. 半导体光电, 2020, 41(03): 414-419. [39] 王粉花, 张强, 黄超, 等. 融合双流三维卷积和注意力机制的动态手势识别[J]. 电子与信息学报, 2021, 43(05): 1389-1396. [40] 胡瑛, 罗银, 张瀚文, 等. 基于注意力机制的3D卷积神经网络孤立词手语识别[J]. 湖南工程学院学报(自然科学版), 2022,32(01): 55-60. [41] 刘建伟, 宋志妍. 循环神经网络研究综述[J]. 控制与决策, 2022, 37(11): 2753-2768. [42] Maturana D, Scherer S. Voxnet: A 3d convolutional neural network for real-time object recognition[C]//2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2015: 922-928. [43] 仇喆磊,王莉,王晓,韦奕,梅雪. 基于I3D-CNN的自闭症分类方法[J]. 计算机工程与设计,2022,43(06):1644-1650. [44] Jiang X, Satapathy S C, Yang L, et al. A survey on artificial intelligence in Chinese sign language recognition[J]. Arabian Journal for Science and Engineering, 2020, 45: 9859-9894. [45] 李晓康. 不同损失函数下0-1分布参数的Bayes估计[J]. 廊坊师范学院学报(自然科学版), 2013, 13(02): 8-10. [46] 李卉, 杨志霞. 基于Rescaled Hinge损失函数的多子支持向量机[J]. 计算机应用, 2020, 40(11): 3139-3145. [47] 张哲岩, 王青山. 基于注意力机制的手语语序转换方法[J].合肥工业大学学报(自然科学版), 2023, 46(01): 42-46. [48] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19. [49] 赵婷婷, 高欢, 常玉广等. 基于知识蒸馏与目标区域选取的细粒度图像分类方法[J]. 计算机应用研究: 1-8. [50] Kim T, Keane J, Wang W, et al. Lexicon-free fingerspelling recognition from video: Data, models, and signer adaptation[J]. Computer Speech & Language, 2018, 46: 209-232.. [51] Tran D, Ray J, Shou Z, et al. Convnet architecture search for spatiotemporal feature learning[J]. arXiv preprint arXiv:1708.05038, 2018. [52] Kumar P, Gauba H, Roy P P, et al. A multimodal framework for sensor based sign language recognition[J]. Neurocomputing, 2018, 259: 21-38. [53] Wang H, Chai X, Chen X. Sparse observation (so) alignment for sign language recognition[J]. Neurocomputing, 2017, 175: 674-685. [54] 廖艳秋. 基于深度学习和关键帧提取的哑语手势识别算法研究[D]. 南昌大学, 2020. [55] 赵金龙, 陈春雨, 于德海, 等. 基于3D卷积神经网络的手语动作识别[J]. 通信技术, 2021, 54(02): 327-333. ﹀
中图分类号：	TP391.4
开放日期：	2023-06-16

附件下载