查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于深度学习的视频文本描述研究及煤矿应用
姓名：	马钰
学号：	18208052011
保密级别：	公开
论文语种：	chi
学科代码：	081203
学科名称：	工学 - 计算机科学与技术（可授工学、理学学位） - 计算机应用技术
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2021
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	计算机应用技术
研究方向：	计算机图形图像处理
第一导师姓名：	付燕
第一导师单位：	西安科技大学
论文提交日期：	2021-06-21
论文答辩日期：	2021-06-03
论文外文题名：	Research on video captioning based on deep learning and its application in coal mine
论文中文关键词：	视频文本描述 ; 深度学习 ; 注意力机制 ; 煤矿场景 ; BERT模型
论文外文关键词：	Video captioning ; Deep learning ; Attention mechanism ; Coal mine scene ; BERT model
论文中文摘要：	︿ ~视频文本描述是一个具有挑战性的任务，它涵盖了计算机视觉和自然语言处理两个方面，其主要目标是将视觉内容转换为准确而简洁的文字描述。视频文本描述在很多领域都具有广阔应用前景，特别是在煤矿领域已经得到越来越多人的关注，把视频文本描述的技术运用到煤矿井下，降低了检索煤矿视频的难度和时间，对于煤矿井下监控视频智能化的研究具有重大意义。由于视频底层的视觉特征与高级语义之间存在着很大差异，本文通过结合视频的特征提取、视觉文本检测对基于深度学习的视频文本描述方法进行改进，主要的研究内容如下：（1）在以往的编码器-解码器的学习中，所有视频特征的长度都被编码成固定长度，随着输入的视频特征长度不断地增加，视频文本描述效果越来越差，而引入注意力机制可以有效地提升视频文本描述模型在编码器-解码器任务上的性能，使得编码器在进行视频处理时能够给予模型在视频的关键部分获得更高权重。为此，本文提出一种基于注意力3D残差网络的视频文本描述模型。首先在编码阶段，将注意力机制引入3D残差模块，通过一维通道注意力与二维空间注意力增强视频特征映射，降低无关目标与噪声的影响；其次，利用Glove模型对视频文本进行向量化，增强词与词之间的相关性；最后在解码阶段利用双层LSTM深度网络的时序性特征，输出表述视频高层语义的文本描述。本文在MSVD与MSR-VTT两个公共数据集进行实验，实验结果表明，该模型能够更加准确的利用自然语言描述视频高层语义信息。（2）针对多数的视频文本描述算法对视频内目标细节部分描述不充分，容易忽略视频潜在文本特征这一问题，提出了一种基于视觉文本和残差连接的视频本文描述方法。首先，利用BERT模型检测出视频中的视觉文本；其次，将这些视觉文本与第一层的GRU网络输出进行融合输入进第二层GRU网络中；最后，为了得到更紧密的视频与文本描述之间映射关系，在每层GRU构建了残差连接结构。实验结果表明，该模型能够对视频中的细节信息进行描述，优化了视频文本描述的质量。（3）将本文提出的视频文本描述技术运用到煤矿井下。首先对煤矿井下监控视频进行预处理，制作煤矿描述数据集，利用煤矿数据集对本文模型进行训练，其次煤矿井下监控视频往往带有事件发生的时间与地点，为了使视频描述更加具体，将视频提取的字幕引入GRU语言模型生成的文本描述中，生成煤矿监控视频的文本描述。最后由实验结果可知，本文提出的模型在煤矿描述数据集中有较好的结果。﹀
论文外文摘要：	︿ ~Video captioning is a challenging task. It covers two aspects of computer vision and natural language processing. Its main goal is to convert visual content into accurate and concise text descriptions. Video captioning has broad application prospects in many fields, especially in the coal mine field, which has attracted more and more people’s attention. The application of video captioning technology to coal mines reduces the difficulty and time of retrieving coal mine videos. The research of underground monitoring video intelligence is of great significance. Due to the big difference between the visual features and high-level semantics at the bottom of the video, this article combines video feature extraction and visual text detection to improve the video text description method based on deep learning. The main research contents are as follows: (1) In the past encoder-decoder learning, the length of all video features are encoded as fixed. As the length of the input video features continues to increase, the effect of video text description is getting worse and worse, and the introduction of attention mechanism can improve The performance of the model on the encoder-decoder task enables the machine to give the model higher weight in key areas of the video when processing the video. For this reason, this paper proposes a 3D residual network based on attention The video captioning model. First, in the encoding stage, the attention mechanism is introduced into the 3D residual module, and the video feature mapping is enhanced through one-dimensional channel attention and two-dimensional spatial attention to reduce the influence of irrelevant targets and noise; secondly, the Glove model is used to describe the text vectorization operation , In order to enhance the relevance of words and words; finally in the decoding stage, using the sequential characteristics of the double-layer LSTM deep network, output a text description that expresses the high-level semantics of the video. This paper conducts experiments on two public data sets. The experimental results show that the model can more accurately describe the high-level semantic information of videos using natural language. (2) Aiming at the problem that most video captioning algorithms do not fully describe the details of the target in the video, and it is easy to ignore the potential text features of the video, a video text description method based on visual text and residual connection is proposed. First, use the BERT model to detect the visual text in the video; secondly, merge these visual texts with the output of the first-layer GRU network and input them into the second-layer GRU network; finally, in order to obtain a closer description of the video and text The mapping relationship between each layer of GRU builds a residual connection structure. The experimental results show that the model can describe the detailed information in the video, which greatly optimizes the quality of the video captioning. (3)Apply the video captioning algorithm proposed in this paper to coal mine scenes. Firstly, the coal mine underground monitoring video is preprocessed, and the coal mine description data set is produced. The coal mine data set is used to train the model in this paper. Secondly, the underground coal mine monitoring video often contains the time and place of the event. In order to make the video description more specific, the video The extracted subtitles are introduced into the text description generated by the LSTM language model to generate the text description of the coal mine surveillance video. Finally, it can be seen from the experimental results that the model proposed in this paper has good results in the coal mine description data set. ﹀
中图分类号：	TP391.413
开放日期：	2021-06-22

附件下载