查看论文信息

查看全文

免费浏览

查看论文信息

论文中文题名：	基于编解码器的视频描述方法研究
姓名：	王咪咪
学号：	19208088022
保密级别：	公开
论文语种：	chi
学科代码：	083500
学科名称：	工学 - 软件工程
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2022
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	软件工程
研究方向：	计算机图形图像处理技术
第一导师姓名：	付燕
第一导师单位：	西安科技大学
论文提交日期：	2022-06-20
论文答辩日期：	2022-06-06
论文外文题名：	Research on Video Captioning Method based on Encoder-Decoder
论文中文关键词：	视频描述 ; 视觉场景表示 ; 语法分析 ; 语句检索 ; RS GPT-2模型
论文外文关键词：	Video captioning ; Visual scene representation ; Grammatical analysis ; Statement retrieval ; RS GPT-2 model
论文中文摘要：	︿近年来，随着互联网和多媒体技术的快速发展，视频数据急剧增加。为了方便人们对所需视频的理解和选择，针对视频描述领域的研究逐步受到学术界的关注。视频描述旨在实现对视觉内容的高层语义认知与自然表达，在视频检索、辅助视觉、监控描述等方面有广阔的应用前景。在视频描述领域，主要利用编解码器对视觉信息进行编码和解码，从而实现视频内容的文本描述。然而，该类方法仅从特征层面抽取视觉信息，较少考虑视觉特征在语句描述中的语义分析过程。此外，视频描述生成的语句过于依赖视频数据对应的标签信息，较难生成语义丰富的视频描述语句。针对上述问题，本文通过结合视频特征提取、场景表示构建和外部语料库的引入对传统编解码器方法进行改进，主要研究内容如下：（1）针对基于编解码器的视频描述方法中存在因视频语义分析不足，造成描述语句语法结构不清晰的问题，提出了一种基于场景表示中对象特征语法分析的视频描述方法。首先，在编码阶段将视频的2D、C3D特征、Faster RCNN模型抽取的对象特征和Transformer的自注意力机制相结合，构建视觉场景表示模型，表示视觉特征间的依赖关系；其次，构建视觉对象特征语法分析模型，分析视觉场景表示中对象特征在描述语句中的语法成分；最后，在解码阶段，将结合语法分析后的结果注入到LSTM网络模型，输出视频描述语句。结果表明，该方法能够生成语法结构清晰的视频描述语句。（2）目前，基于编解码器的视频描述方法较为依赖单一的视频输入源，在较少考虑利用外部语料信息引导视频描述生成的情况下，生成的视频描述语义信息有限，不利于视频内容的准确描述。为解决上述问题，提出了一种基于语句检索生成网络(ED-SRG)引导的视频描述方法。首先，该方法利用编解码器模型分别提取视频的二维特征、三维特征和对象特征，并对上述特征解码生成简单的描述语句；其次，利用sentence-transformer网络模型在外部语料库中检索与上述描述语句语义近似的语句，通过度量语句之间的相似度，筛选出候选语句集合；最后，构建一个RS GPT-2网络模型，该网络模型利用设计的随机选择器，对语料库中出现概率较大的预测单词进行随机选择，针对视频数据指导和生成符合人类自然语言表达的描述语句。本文所提方法分别在公共数据集 MSVD 和 MSR-VTT 上进行实验。结果表明，本文方法在编解码模型的基础上，BLEU-4、CIDEr、ROUGE_L和METEOR评价指标上分别提升了19.4%、13.1%、11.6%和13.5%。﹀
论文外文摘要：	︿ In recent years, with the rapid development of the Internet and multimedia technologies, video data have increased dramatically. In order to facilitate the understanding and selection of the desired videos, the studies on video captioning have gradually received academic attention, which aim to achieve high-level semantic awareness and natural representation of visual contents and have broad application prospects in video retrieval, aided vision, surveillance description and other aspects. In the works of video captioning, the encoder-decoder are mainly used to encode and decode visual information, thus enabling textual description of video contents. However, such methods only extract visual information from the feature level of video data, and with less consideration of the semantic analysis process of visual features in the utterance description. In addition, the utterances generated from video captioning are too dependent on the tagging information corresponding to the video data, so the encoder-decoder are difficult to generate semantically rich video captioning. To address the above issues, this paper improves the traditional encoder-decoder approach by combining video feature extraction, scene representation construction and the introduction of external corpuses, with the following main research elements. （1）Aiming at the issue of unclear syntactic structure of description statements because of insufficient semantic analysis in video captioning methods based encoder-decoder, a novel video captioning method based on the syntactic analysis of object features in scene representation is proposed. First, the 2D and C3D features of videos, the object features extracted by the Faster RCNN model and the self-attention mechanism of the Transformer are combined in the encoding stage to construct a visual scene representation model to represent the dependencies between the visual features. And then, a visual object feature syntax analysis model is constructed to analysis the syntactic components of the object feature in the visual scene representation in the description statement. Finally, in the decoding stage, the combined grammar analysis results are injected into the LSTM network model to output video captioning. The results show that the proposed method can generate video captioning with a clear grammatical structure. （2）At present, the video captioning methods based encoder-decoder are rely more on a single video input source, and with less consideration of using external corpus information to guide video captioning generation, the generated video captioning have limited semantic information, which is not conducive to the accurate description of video contents. To address this issue, an utterance retrieval generation network (ED-SRG)-based guided video captioning method is proposed. First, this method adopts an encoder-decoder model to respectively extract 2D features, 3D features and object features of videos, and decodes the above features to generate the simple description statements. And then, it uses a sentence-transformer network model to retrieve statements in an external corpus that are semantically similar to the above description statements, and by measuring the similarities between statements. Finally, an novel RS GPT-2 network model is constructed, which introduces a designed random selector to randomly select the predicted words with a high probability of occurrence in a corpus to guide and generate descriptive utterances that conform to natural human language expressions for the video data. The proposed method is tested on the public datasets MSVD and MSR-VTT respectively. The results show that the proposed method improves the evaluation indexes of BLEU-4, CIDEr, ROUGE_L and METEOR by 19.4%, 13.1%, 11.6% and 13.5%, respectively. ﹀
中图分类号：	TP301.6
开放日期：	2022-06-21

附件下载