论文中文题名: | 基于3D卷积神经网络的手语识别 |
姓名: | |
学号: | 20207223061 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 085400 |
学科名称: | 工学 - 电子信息 |
学生类型: | 硕士 |
学位级别: | 工程硕士 |
学位年度: | 2023 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 数字图像处理 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2023-06-15 |
论文答辩日期: | 2023-06-02 |
论文外文题名: | Sign language recognition based on 3D convolutional neural network |
论文中文关键词: | |
论文外文关键词: | Sign language recognition ; Deep learning ; Video classification ; C3D ; Attention mechanism ; Temporal and spatial characteristics |
论文中文摘要: |
随着科技的进步,智能化自助设施在机场、车站等公共场所逐渐完善,但是针对聋哑人群的专用设施仍然未能普及,使得他们出行存在严重的沟通障碍,传统的手语识别技术主要存在的问题是识别的种类少,仅能识别字母手势,而且是对单帧识别,未考虑手语的连贯性,此外对背景要求很高,无法适应复杂背景的场所。随着计算机技术与深度学习技术的发展,使得日常手语翻译系统集成在移动智能设备上成为可能。 经典的C3D模型含有8个卷积层、5个池化层和2个全连接层,主要针对动作幅度大、关键帧集中的视频进行分类。采用C3D作为手语识别的基础模型,优点是可以提取时空联合特征、增强时空特征对应关系,但缺点是手语视频本身关键帧分散且动作微小,导致识别准确率并不高,因此需要对C3D模型进行一系列改进,提升模型对手语视频的兼容性与识别准确率。 针对C3D模型应用到手语识别时的缺陷提出了一种改进的模型C3D-cslr,提升了模型与手语数据集的匹配性与识别的准确率。首先,针对动作幅度小、关键帧分散等问题,在网络的输入端扩展原始特征图的数量,增强了网络对输入视频全局关键帧的获取能力;其次,针对空间特征不突出的问题,在主干特征提取网络部分加入空间注意力机制,强化网络对空间特征的提取能力;最后,针对正负样本差异过大的问题,在损失函数部分借鉴知识蒸馏思想,引入温度常数T对softmax进行优化,使网络更关注困难样本,提高模型的泛化能力。为了获得最佳模型,以损失函数Loss和准确率Acc作为主要评价指标,对改进模型进行评价并对改进点两两融合。 综合实验数据,改进后的C3D-cslr模型损失值Loss由原来1.76下降至0.63,识别准确率Acc由原来56.67%上升至88.6%,为了模型可以更快应用到现实生活,实验最后部分,在个人录制的手语视频上进行了简单的连续语句识别,取得了良好的效果,验证了所改进算法的可行性与有效性,为未来普遍应用提供了可能性与参考价值。 |
论文外文摘要: |
With the progress of science and technology, intelligent self-service facilities are gradually improved in airports, stations and other public places, but the special facilities for deaf and dumb people are still not popular, leading to serious communication barriers in their travel. The main problem of traditional sign language recognition technology is that it only recognizes letters and gestures in a single frame, without considering the coherence of sign language. In addition, the background requirements are very high, unable to adapt to the complex background of the site. With the development of computer technology and deep learning technology, it is possible to integrate daily sign language translation system on mobile intelligent devices. The classic C3D model contains 8 convolutional layers, 5 pooling layers and 2 fully connected layers, and mainly classifies videos with large motion amplitude and key frame concentration. The advantage of using C3D as the basic model of sign language recognition is that it can extract spatio-temporal joint features and enhance the corresponding relationship between spatio-temporal features. However, the disadvantage is that the key frames of the sign language video itself are scattered and the movements are small, resulting in low recognition accuracy. Therefore, a series of improvements should be made to the C3D model to improve the compatibility and recognition accuracy of the model of sign language video. Aiming at the defects of C3D model applied to sign language recognition, an improved model C3D-CSLR is proposed, which improves the matching and recognition accuracy between the model and sign language data set. Firstly, in view of the problems such as small motion amplitude and scattered key frames, the number of original feature maps is extended at the input end of the network, which enhances the ability of the network to obtain global key frames of input video. Secondly, to solve the problem of not prominent spatial features, a spatial attention mechanism is added to the backbone feature extraction network to strengthen the ability of the network to extract spatial features. Finally, to solve the problem of large difference between positive and negative samples, the idea of knowledge distillation is used for reference in the part of loss function, and temperature constant T is introduced to optimize softmax, so that the network pays more attention to difficult samples and improves the generalization ability of the model. In order to obtain the best model, Loss function and accuracy Acc were taken as the main evaluation indexes to evaluate the improved model and integrate the improvement points in pairs. Based on the experimental data, the Loss value of the improved C3D-cslr model decrease from 1.76 to 0.63, and the recognition accuracy Acc increase from 56.67% to 88.6%. In order to quickly apply the model to real life, in the last part of the experiment, simple continuous sentence recognition was carried out on the sign language video recorded by individuals. A good result is obtained, which verifies the feasibility and effectiveness of the improved algorithm, and provides a possibility and reference value for future application. |
参考文献: |
[1] 王卓程,张景峤.基于三维手部骨架数据的连续手语识别[J]. 计算机辅助设计与图形学学报, 2021, 33(12): 1899-1907. [2] 黄世亮.面向大词汇量的中国手语识别方法研究[D].中国科学技术大学,2020. [3] 解迎刚,王全.基于视觉的动态手势识别研究综述[J]. 计算机工程与应用, 2021, 57(22): 68-77. [4] 蒲俊福.基于深度学习的视频手语识别研究[D].中国科学技术大学,2020. [5] 郭丹,唐申庚,洪日昌等.手语识别、翻译与生成综述[J]. 计算机科学, 2021, 48(03): 60-70. [6] 杨戈,邹武星.基于深度学习的视频行为分类方法综述[J]. 电子技术应用, 2022, 48(07): 1-7. [14] 陈瑞文. 基于改进的BoVW模型的图像检索方法研究[J]. 重庆科技学院学报(自然科学版), 2015,17(05):77-79. [15] 刘杰, 王月, 田明. 多尺度时空特征融合的动态手势识别网络[J]. 电子与信息学报:1-9. [16] 王剑波,朱欣娟,吴晓军. 融合静态手势特征和手部运动轨迹特征的手势交互方法[J]. 国外电子测量技术,2021,40(07):14-18. [17] Grimes G J. Digital data entry glove interface device: U.S. Patent 4,414,537[P]. 1983-11-8. [19] 周扬, 孙玲玲, 马德. 基于HMM模型的语音识别系统的研究[J]. 物联网技术, 2017, 7(10): 74-76. [20] 吴江琴, 高文, 陈熙霖. 基于数据手套输入的汉语手指字母的识别[J]. 模式识别与人工智能, 1999, 12(01): 74-78. [21] 施强. 基于BP神经网络的计算机图像智能识别方法[J]. 电脑编程技巧与维护, 2022, (10): 134-137. [22] 祝远新, 徐光祐, 黄裕. 基于表观的动态孤立手势识别[J]. 软件学报, 2000, 11(1): 54-61. [23] 吴卓, 梁珂, 张春阳. 基于神经网络和傅里叶描述子的残缺面部图像识别与分析[J]. 无线互联科技, 2021, 18(01):53-59. [24] 张争珍, 石跃祥. YcgCr 颜色空间的肤色聚类人脸检测法[J]. 计算机工程与应用, 2009, 45(22):163-165. [25] 陈勇飞, 刘新明. 基于肤色和类 Harr 特征的人脸图像的人眼检测[J]. 计算机工程与应用, 2008, 44(33): 174-176. [35] 戴兴雨, 王卫民, 梅家俊. 基于深度学习的手语识别算法研究[J]. 现代计算机, 2021, 27(29): 63-69. [36] 赵芳, 刘新月, 李文清等. 一种基于深度学习的手语词识别算法[J]. 中国科技信息, 2022, No.671(06): 128-131. [38] 罗元, 李丹, 张毅. 基于时空注意力网络的中国手语识别[J]. 半导体光电, 2020, 41(03): 414-419. [39] 王粉花, 张强, 黄超, 等. 融合双流三维卷积和注意力机制的动态手势识别[J]. 电子与信息学报, 2021, 43(05): 1389-1396. [40] 胡瑛, 罗银, 张瀚文, 等. 基于注意力机制的3D卷积神经网络孤立词手语识别[J]. 湖南工程学院学报(自然科学版), 2022,32(01): 55-60. [41] 刘建伟, 宋志妍. 循环神经网络研究综述[J]. 控制与决策, 2022, 37(11): 2753-2768. [43] 仇喆磊,王莉,王晓,韦奕,梅雪. 基于I3D-CNN的自闭症分类方法[J]. 计算机工程与设计,2022,43(06):1644-1650. [45] 李晓康. 不同损失函数下0-1分布参数的Bayes估计[J]. 廊坊师范学院学报(自然科学版), 2013, 13(02): 8-10. [46] 李卉, 杨志霞. 基于Rescaled Hinge损失函数的多子支持向量机[J]. 计算机应用, 2020, 40(11): 3139-3145. [47] 张哲岩, 王青山. 基于注意力机制的手语语序转换方法[J].合肥工业大学学报(自然科学版), 2023, 46(01): 42-46. [49] 赵婷婷, 高欢, 常玉广等. 基于知识蒸馏与目标区域选取的细粒度图像分类方法[J]. 计算机应用研究: 1-8. [54] 廖艳秋. 基于深度学习和关键帧提取的哑语手势识别算法研究[D]. 南昌大学, 2020. [55] 赵金龙, 陈春雨, 于德海, 等. 基于3D卷积神经网络的手语动作识别[J]. 通信技术, 2021, 54(02): 327-333. |
中图分类号: | TP391.4 |
开放日期: | 2023-06-16 |