论文中文题名: | 基于3D卷积神经网络的学生课堂行为识别 |
姓名: | |
学号: | 21207223092 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 085400 |
学科名称: | 工学 - 电子信息 |
学生类型: | 硕士 |
学位级别: | 工程硕士 |
学位年度: | 2024 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 计算机视觉 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2024-06-13 |
论文答辩日期: | 2024-06-05 |
论文外文题名: | Classroom BehaviorRecognition of Students Based on 3DSubject Convolutiona NeuralNetworks |
论文中文关键词: | |
论文外文关键词: | Behavior recognition ; X3D network ; 3D heat map volume ; BERT layer ; Multimodal fusion |
论文中文摘要: |
课堂教学作为教育活动的核心环节,对于学生的学习成效和发展具有至关重要的作 用。传统课堂上,教师主要通过观察学生行为状态来评估教学效果,难以掌握每位学生 的上课状态。但是基于深度学习的学生课堂行为识别技术有效克服了这个缺陷,能够全 面、准确地捕捉学生在课堂上的学习状态。因此,本文基于扩展三维卷积网络 (Expand 3D Convolutional Network,X3D)的人体行为识别模型进行改进,对学生课 堂行为识别进行研究,具体研究内容如下: (1)针对现有的基于卷积神经网络的人体行为识别方法将骨架数据转换为伪图像 或者长方体时会导致信息损失的问题,在X3D网络模型基础上,本文提出了一种改进 X3D的人体行为识别算法X3D-BERT。首先,该算法通过引入3D热图卷,以更好地保 留骨架信息中的空间和关节位置信息。其次,在网络最后添加双向编码器表示技术 (Bidirectional Encoder Representations from Transformers,BERT)层,增强网络对时间 特征信息的提取,从而提高网络模型的识别精度。最后,分别在公共数据集以及自建的 学生课堂行为数据集上进行实验评估,实验结果表明,改进的X3D-BERT模型在公共 数据集NTURGB+D60的跨主体和跨视角两个标准下的平均准确率分别达到了93.7% 和97.3%,在公共数据集NTURGB+D120的跨主体和跨设置两个标准下的平均准确率 分别达到了88.7%和89.4%,在自建的学生课堂行为数据集上的平均准确率达到了 93.3%。 (2)为了弥补单一骨架信息缺乏外貌特征信息的问题,本文在X3D-BERT模型基 础上,提出了一种早期+晚期的多模态融合算法X3DCAS-BERT。首先,通过融合RGB 视频模态信息,丰富网络对外貌特征信息的提取。同时,引入CA注意力,增强网络对 小目标细节信息的捕获能力,从而进一步提高模型的识别精度。最后,实验结果表明, 改进的X3DCAS-BERT模型在NTURGB+D60数据集的跨主体和跨视角两个标准下的 平均准确率分别达到了96.7%和99.9%,在NTURGB+D120数据集的跨主体和跨设置 两个标准下的平均准确率分别达到了94.6%和95.4%,在自建的学生课堂行为数据集上 的平均准确率达到了96.5%。 (3)在X3DCAS-BERT模型的基础上,本文结合目标检测算法Yolov7-tiny、目标 跟踪算法ByteTrack和姿态估计算法HRNet以实现对多人学生课堂行为的准确识别, 从而更好地反映学生课堂上的真实情况。最后,输入包含多人学生课堂行为的视频进行 实验,实验结果表明,在不同情况下,所提出的算法对学生课堂场景中多人行为的识别 均达到了良好的效果。 |
论文外文摘要: |
Classroom teaching, as the core component of educational activities, plays a crucial role in students' learning effectiveness and development. In traditional classrooms, teachers mainly assess the teaching effect through observing students' behavior and status, which is difficult to grasp the learning status of each student. Nevertheless, deep learning-based student classroom behavior recognition technologies effectively overcome this limitation, enabling comprehensive and accurate capture of students' learning states in the classroom. Therefore, this thesis aims to improve student classroom behavior recognition based on the Expand 3D Convolutional Network (X3D) model, delving into the following specific research content: (1)In response to the problem of information loss caused by converting skeleton data into pseudo-images or cuboids in existing convolutional neural network-based human behavior recognition methods, this thesis proposes an improved human behavior recognition algorithm called X3D-BERT based on the X3D network model. Firstly, this algorithm introduces 3D heatmap convolution to better preserve spatial and joint position information in skeleton data. Secondly, a Bidirectional Encoder Representations from Transformers (BERT) layer is added to the network's end to enhance the extraction of temporal feature information, thereby improving the recognition accuracy of the network model. Finally, experiments are conducted on both public datasets and a self-built student classroom behavior dataset. The experimental results demonstrate that the improved X3D-BERT model achieves an average accuracy of 93.7% and 97.3% under the cross-subject and cross-view standards of the public dataset NTU RGB+D 60, respectively. Under the cross-subject and cross-setting standards of the public dataset NTU RGB+D 120, the average accuracy reaches 88.7% and 89.4%, respectively. On the self-built student classroom behavior dataset, the average accuracy reaches 93.3%. (2)Toaddress the lack of external appearance information in single skeleton data, this thesis proposes an early + late multimodal fusion algorithm called X3DCAS-BERT based on the X3D-BERT model. Firstly, by integrating RGB video modal information, the network enriches the extraction of external appearance features. Simultaneously, CA attention is introduced to enhance the network's ability to capture detailed information of small targets, thereby further improving the model's recognition accuracy. Finally, experimental results demonstrate that the improved X3DCAS-BERT model achieves an average accuracy of 96.7% and 99.9% under the cross-subject and cross-view standards of the NTU RGB+D 60 dataset, respectively. Under the cross-subject and cross-setting standards of the NTU RGB+D 120 dataset, the average accuracy reaches 94.6% and 95.4%, respectively. On the self-built student classroom behavior dataset, the average accuracy reaches 96.5%. (3)Building upon the X3DCAS-BERT model, this thesis integrates the object detection algorithm Yolov7-tiny, the object tracking algorithm ByteTrack, and the pose estimation algorithm HRNet to achieve accurate recognition of multi-person student classroom behaviors, thereby better reflecting the real situations in the classroom. Finally, experiments are conducted using videos containing multi-person student classroom behaviors as input. The experimental results demonstrate that, under different circumstances, the proposed algorithm achieves good performance in recognizing multi-person behaviors in student classroom scenarios |
中图分类号: | TP391.4 |
开放日期: | 2024-06-13 |