论文中文题名: | 基于视觉-文本高层语义特征关联的视频描述方法研究 |
姓名: | |
学号: | 20208223035 |
保密级别: | 保密(1年后开放) |
论文语种: | chi |
学科代码: | 085400 |
学科名称: | 工学 - 电子信息 |
学生类型: | 硕士 |
学位级别: | 工程硕士 |
学位年度: | 2023 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 计算机图形图像处理技术 |
第一导师姓名: | |
第一导师单位: | |
第二导师姓名: | |
论文提交日期: | 2023-06-15 |
论文答辩日期: | 2023-06-06 |
论文外文题名: | Research on Video Captioning Method Based on High Level Semantic Feature Association of Visual-Text |
论文中文关键词: | 视频描述 ; 语义主题 ; Enhance-TopK采样 ; 语义区 ; 视觉噪声过滤策略 |
论文外文关键词: | Video captioning ; Semantic topics ; Enhance-TopK sampling ; Semantic regions ; Visual noise filtering strategy |
论文中文摘要: |
视频描述是一项重要的跨模态研究任务,通过该任务可以将视频内容转化为符合人类语法习惯的自然语言句子。现有的视频描述方法主要利用编解码器将视频内容映射为文本信息。然而,由于视频内容的多样性和多模态性,基于编解码器的视频描述方法仅通过表征视觉的高层特征并不能完整地表达视频的语义内容,难以充分反映视频中的主题、场景和对象关系,不利于生成高质量的视频描述语句。针对上述问题,本文在基于编解码器的视频描述方法基础上进行了以下研究: (1)针对现有视频描述方法难以准确捕捉和表达视频内部的主题信息,而导致生成的文本描述易出现语义偏差的问题,本文提出了一种基于语义主题引导生成的视频描述方法(VC-STG)。首先,该方法在编码阶段提取视频的时空特征;然后,借助所得视频特征检索相似视频的视觉标签,用于抽取视频的语义主题信息;在解码阶段,构建了一个基于Enhance-TopK采样的EGPT-2深度网络模型。该模型通过对视频内容的基描述及其语义主题联合解码,更好地捕捉和描述视频中的主题信息,减少了视频和文本数据之间进行映射时产生的“语义偏差”,使生成的文本描述更加符合视频的主题内容。在此过程中,设计的Enhance-TopK采样算法通过动态调整预测词的概率分布,缓解了解码阶段存在的长尾问题,使生成的视频文本描述语句更加通顺合理。实验结果表明,所提VC-STG方法能够利用视频的主题信息引导模型生成符合视频主题的文本描述。 (2)针对现有视频描述方法过于依赖单一视频输入源的有限视觉特征,而无法完全准确理解视频高层语义信息的问题,本文提出了一种基于视觉-文本语义关联的视频描述方法(VC-VTSA)。在编码阶段,该方法融合视频的二维静态特征、三维运动特征和对象级区域特征用于表征视频的全局视觉特征。然后,在语义关联阶段,采用自注意力机制将已生成的词汇组成具有上下文语义依赖的短语,并将它们与编码阶段提取的全局视觉特征进行关联,从而构建一个具有视觉内容和文本信息的多模态语义区。通过利用多模态语义区内视觉和文本模态之间潜在的语义关联互补关系,更加准确地表征视频的语义内容。在此过程中,设计了一种视觉噪声过滤策略,以帮助语义区内的词汇短语和对应视觉内容进行准确关联。最后,将构建的多模态语义区输入LSTM解码器进行下一词汇的预测,直至生成完整的视频描述。实验结果表明,所提方法利用语义区内不同模态信息之间的关联互补关系,可以提高生成文本描述的准确性。 本文提出的方法在两个公共数据集MSVD和MSR-VTT上进行了大量实验。实验结果表明,所提方法在编解码器的基础上,引入了视频语义主题和视觉-文本多模态语义区的概念,有效提高了生成的视频描述质量。 |
论文外文摘要: |
Video captioning is an important cross-modal research task that enables video content to be translated into natural language sentences that conform to human grammar. Existing video captioning methods mainly use an encoder-decoder to map video content to text information. However, due to the diversity and multimodality of video content, video captioning methods based on encoder-decoder can not completely express the semantic content of the video by only characterizing the high-level features of the vision. It is difficult to fully reflect the theme, scene, and object relationship in the video, which is not conducive to the generation of high-quality video captioning. To solve the above problems, this paper does the following research based on the video captioning method of encoder-decoder: (1) To solve the problem that existing video captioning methods are difficult to capture and represent the theme information inside the video accurately, and the resulting text description is prone to semantic bias, this paper presents a video captioning method (VC-STG) based on semantic topic-guided generation. Firstly, The method extracts the temporal and spatial characteristics of the video during the encoding phase. Then, the visual labels of similar videos are retrieved using the resulting video features to extract the semantic theme information of the videos. In the decoding phase, an EGPT-2 deep network model based on Enhance-TopK sampling is constructed. By decoding the base description of the video content and its semantic theme jointly, the model captures and describes the theme information in the video better, reduces the “semantic bias” caused by the mapping between video and text data, and makes the generated text description more consistent with the theme content of the video. In this process, the Enhance-TopK sampling algorithm is designed to alleviate the long tail problem in the decoding stage by dynamically adjusting the probability distribution of the prediction words and making the generated video text description statements more smooth and more reasonable. The experimental results show that the proposed VC-STG method can use the video theme information to guide the model to generate a text description that matches the video theme. (2) To solve the problem that existing video description methods rely too much on the limited visual features of a single video input source and cannot fully and accurately understand the high-level semantic information of the video, this paper proposes a video description method based on visual-text semantic association (VC-VTSA). In the encoding stage, the method fuses the 2D static features, 3D motion features, and object-level regional features of the video for characterizing the global visual features of the video. Then, in the semantic association stage, a self-attentive mechanism is used to combine the generated vocabulary into context-semantic dependent phrases and associate them with the global visual features extracted in the encoding stage to construct a multimodal semantic region with visual content and textual information. The semantic content of the video is more accurately characterized by exploiting the potential semantic association complementary relationships between visual and textual modalities within the multimodal semantic zone. In this process, a visual noise filtering strategy is designed to help the accurate association between lexical phrases in the semantic zone and the corresponding visual content. Finally, the constructed multimodal semantic regions are fed into the LSTM decoder for the prediction of the next vocabulary until the complete video description is generated. The experimental results show that the proposed method can improve the accuracy of the generated text descriptions by exploiting the associative complementary relationship between different modal information within the semantic region. The method proposed in this paper has extensively experimented on two public datasets, MSVD and MSR-VTT. The experimental results show that the proposed method introduces the concepts of video semantic theme and visual-text multimodal semantic area on the basis of the encoder-decoder framework, which effectively improves the quality of the generated video captioning. |
参考文献: |
[1] 汤鹏杰,王瀚漓.从视频到语言:视频标题生成与描述研究综述 [J].自动化学报,2022,48(2):375-397. [7] 亢晓勉,宗成庆.基于篇章结构多任务学习的神经机器翻译 [J].软件学报,2022,33(10):3806-3818. [10] 苗教伟,季怡,刘纯平.基于视觉特征引导融合的视频描述方法 [J].计算机工程与应用,2022,58(20):124-131. [24] 崔文靓,王玉静,康守强,等.基于改进YOLOv3算法的公路车道线检测方法 [J].自动化学报,2022,48(6):1560-1568. [31] Hochreiter S,Schmidhuber J.Long short-term memory [J].Neural Computation,1997,9(8):1735-1780. [36] 王屹超,朱慕华,许晨,等.利用图像描述与知识图谱增强表示的视觉问答 [J].清华大学学报(自然科学版),2022,62(5):900-907. [40] 张文,王强,杜宇航,等.在线商品评论有用性主题分析及预测研究 [J].系统工程理论与实践,2022,42(10):2757-2768. [43] Gage P.A new algorithm for data compression [J].C Users Journal,1994,12(2):23-38. [57] 张北辰,李亮,查正军,等.基于跨模态对比学习的视觉问答主动学习方法 [J].计算机学报,2022,45(8):1730-1745. [59] 李学龙.多模态认知计算 [J].中国科学:信息科学,2023,53(1):1-32. [62] 李卫疆,漆芳,余正涛.基于多通道特征和自注意力的情感分类方法 [J].软件学报,2021,32(9):2783-2800. |
中图分类号: | TP301.6 |
开放日期: | 2024-06-15 |