查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于视觉-文本高层语义特征关联的视频描述方法研究
姓名：	魏新力
学号：	20208223035
保密级别：	保密（1年后开放）
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2023
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	软件工程
研究方向：	计算机图形图像处理技术
第一导师姓名：	付燕
第一导师单位：	西安科技大学
第二导师姓名：	严超
论文提交日期：	2023-06-15
论文答辩日期：	2023-06-06
论文外文题名：	Research on Video Captioning Method Based on High Level Semantic Feature Association of Visual-Text
论文中文关键词：	视频描述 ; 语义主题 ; Enhance-TopK采样 ; 语义区 ; 视觉噪声过滤策略
论文外文关键词：	Video captioning ; Semantic topics ; Enhance-TopK sampling ; Semantic regions ; Visual noise filtering strategy
论文中文摘要：	︿视频描述是一项重要的跨模态研究任务，通过该任务可以将视频内容转化为符合人类语法习惯的自然语言句子。现有的视频描述方法主要利用编解码器将视频内容映射为文本信息。然而，由于视频内容的多样性和多模态性，基于编解码器的视频描述方法仅通过表征视觉的高层特征并不能完整地表达视频的语义内容，难以充分反映视频中的主题、场景和对象关系，不利于生成高质量的视频描述语句。针对上述问题，本文在基于编解码器的视频描述方法基础上进行了以下研究：（1）针对现有视频描述方法难以准确捕捉和表达视频内部的主题信息，而导致生成的文本描述易出现语义偏差的问题，本文提出了一种基于语义主题引导生成的视频描述方法（VC-STG）。首先，该方法在编码阶段提取视频的时空特征；然后，借助所得视频特征检索相似视频的视觉标签，用于抽取视频的语义主题信息；在解码阶段，构建了一个基于Enhance-TopK采样的EGPT-2深度网络模型。该模型通过对视频内容的基描述及其语义主题联合解码，更好地捕捉和描述视频中的主题信息，减少了视频和文本数据之间进行映射时产生的“语义偏差”，使生成的文本描述更加符合视频的主题内容。在此过程中，设计的Enhance-TopK采样算法通过动态调整预测词的概率分布，缓解了解码阶段存在的长尾问题，使生成的视频文本描述语句更加通顺合理。实验结果表明，所提VC-STG方法能够利用视频的主题信息引导模型生成符合视频主题的文本描述。（2）针对现有视频描述方法过于依赖单一视频输入源的有限视觉特征，而无法完全准确理解视频高层语义信息的问题，本文提出了一种基于视觉-文本语义关联的视频描述方法（VC-VTSA）。在编码阶段，该方法融合视频的二维静态特征、三维运动特征和对象级区域特征用于表征视频的全局视觉特征。然后，在语义关联阶段，采用自注意力机制将已生成的词汇组成具有上下文语义依赖的短语，并将它们与编码阶段提取的全局视觉特征进行关联，从而构建一个具有视觉内容和文本信息的多模态语义区。通过利用多模态语义区内视觉和文本模态之间潜在的语义关联互补关系，更加准确地表征视频的语义内容。在此过程中，设计了一种视觉噪声过滤策略，以帮助语义区内的词汇短语和对应视觉内容进行准确关联。最后，将构建的多模态语义区输入LSTM解码器进行下一词汇的预测，直至生成完整的视频描述。实验结果表明，所提方法利用语义区内不同模态信息之间的关联互补关系，可以提高生成文本描述的准确性。本文提出的方法在两个公共数据集MSVD和MSR-VTT上进行了大量实验。实验结果表明，所提方法在编解码器的基础上，引入了视频语义主题和视觉-文本多模态语义区的概念，有效提高了生成的视频描述质量。﹀
论文外文摘要：	︿ Video captioning is an important cross-modal research task that enables video content to be translated into natural language sentences that conform to human grammar. Existing video captioning methods mainly use an encoder-decoder to map video content to text information. However, due to the diversity and multimodality of video content, video captioning methods based on encoder-decoder can not completely express the semantic content of the video by only characterizing the high-level features of the vision. It is difficult to fully reflect the theme, scene, and object relationship in the video, which is not conducive to the generation of high-quality video captioning. To solve the above problems, this paper does the following research based on the video captioning method of encoder-decoder: (1) To solve the problem that existing video captioning methods are difficult to capture and represent the theme information inside the video accurately, and the resulting text description is prone to semantic bias, this paper presents a video captioning method (VC-STG) based on semantic topic-guided generation. Firstly, The method extracts the temporal and spatial characteristics of the video during the encoding phase. Then, the visual labels of similar videos are retrieved using the resulting video features to extract the semantic theme information of the videos. In the decoding phase, an EGPT-2 deep network model based on Enhance-TopK sampling is constructed. By decoding the base description of the video content and its semantic theme jointly, the model captures and describes the theme information in the video better, reduces the “semantic bias” caused by the mapping between video and text data, and makes the generated text description more consistent with the theme content of the video. In this process, the Enhance-TopK sampling algorithm is designed to alleviate the long tail problem in the decoding stage by dynamically adjusting the probability distribution of the prediction words and making the generated video text description statements more smooth and more reasonable. The experimental results show that the proposed VC-STG method can use the video theme information to guide the model to generate a text description that matches the video theme. (2) To solve the problem that existing video description methods rely too much on the limited visual features of a single video input source and cannot fully and accurately understand the high-level semantic information of the video, this paper proposes a video description method based on visual-text semantic association (VC-VTSA). In the encoding stage, the method fuses the 2D static features, 3D motion features, and object-level regional features of the video for characterizing the global visual features of the video. Then, in the semantic association stage, a self-attentive mechanism is used to combine the generated vocabulary into context-semantic dependent phrases and associate them with the global visual features extracted in the encoding stage to construct a multimodal semantic region with visual content and textual information. The semantic content of the video is more accurately characterized by exploiting the potential semantic association complementary relationships between visual and textual modalities within the multimodal semantic zone. In this process, a visual noise filtering strategy is designed to help the accurate association between lexical phrases in the semantic zone and the corresponding visual content. Finally, the constructed multimodal semantic regions are fed into the LSTM decoder for the prediction of the next vocabulary until the complete video description is generated. The experimental results show that the proposed method can improve the accuracy of the generated text descriptions by exploiting the associative complementary relationship between different modal information within the semantic region. The method proposed in this paper has extensively experimented on two public datasets, MSVD and MSR-VTT. The experimental results show that the proposed method introduces the concepts of video semantic theme and visual-text multimodal semantic area on the basis of the encoder-decoder framework, which effectively improves the quality of the generated video captioning. ﹀
参考文献：	︿ [1] 汤鹏杰，王瀚漓．从视频到语言:视频标题生成与描述研究综述 [J]．自动化学报，2022，48(2)：375-397. [2] Li S，Tao Z，Li K，et al．Visual to text: Survey of image and video captioning [J]．IEEE Transactions on Emerging Topics in Computational Intelligence，2019，3(4)：297-312. [3] Kojima A，Tamura T，Fukunaga K．Natural language description of human activities from video images based on concept hierarchy of actions [J]．International Journal of Computer Vision，2002，50：171-184. [4] Kulkarni G，Premraj V，Ordonez V，et al．Babytalk: Understanding and generating simple image descriptions [J]．IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35(12)：2891-2903. [5] Gong Y，Wang L，Hodosh M，et al. Improving image-sentence embeddings using large weakly annotated photo collections [C]. In Proceedings of the 13th European Conference on Computer Vision. 2014: 529-545. [6] Cheng C，Li C，Han Y，et al．A semi-supervised deep learning image caption model based on Pseudo Label and N-gram [J]．International Journal of Approximate Reasoning，2021，131：93-107. [7] 亢晓勉，宗成庆．基于篇章结构多任务学习的神经机器翻译 [J]．软件学报，2022，33(10)：3806-3818. [8] Lin K，Gan Z，Wang L. Augmented partial mutual learning with frame masking for video captioning [C]. In Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 2047-2055. [9] Wang T，Zhang R，Lu Z，et al. End-to-end dense video captioning with parallel decoding [C]. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 6847-6857. [10] 苗教伟，季怡，刘纯平．基于视觉特征引导融合的视频描述方法 [J]．计算机工程与应用，2022，58(20)：124-131. [11] Venugopalan S，Xu H，Donahue J，et al. Translating videos to natural language using deep recurrent neural networks [C]. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1494-1504. [12] Venugopalan S，Rohrbach M，Donahue J，et al. Sequence to sequence-video to text [C]. In Proceedings of the IEEE International Conference on Computer Vision. 2015: 4534-4542. [13] Song J，Gao L，Guo Z，et al. Hierarchical LSTM with adjusted temporal attention for video captioning [C]. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 2737-2743. [14] Ji S，Xu W，Yang M，et al．3D convolutional neural networks for human action recognition [J]．IEEE Transactions on Pattern Analysis and Machine Intelligence，2012，35(1)：221-231. [15] Yao L，Torabi A，Cho K，et al. Describing videos by exploiting temporal structure [C]. In Proceedings of the IEEE International Conference on Computer Vision. 2015: 4507-4515. [16] Zheng Q，Wang C，Tao D. Syntax-aware action targeting for video captioning [C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 13096-13105. [17] Yan L，Ma S，Wang Q，et al．Video captioning using global-local representation [J]．IEEE Transactions on Circuits and Systems for Video Technology，2022，32(10)：6642-6656. [18] Ji W，Wang R．A multi-instance multi-label dual learning approach for video captioning [J]．ACM Transactions on Multimidia Computing Communications and Applications，2021，17(2s)：1-18. [19] Tu Y，Zhou C，Guo J，et al．Enhancing the alignment between target words and corresponding frames for video captioning [J]．Pattern Recognition，2021，111：1-11. [20] Li Y，Cui X，Jin X. Research on video captioning method based on semantic key frame [C]. In Proceedings of the 2nd Asia-Pacific Conference on Communications Technology and Computer Science. 2022: 39-44. [21] Miao J-W，Shao H，Ji Y，et al. Video captioning with external knowledge assistance and multi-feature fusion [C]. In Proceedings of the Neural Information Processing: 28th International Conference. 2021: 12-19. [22] Liang B，Su H，Gui L，et al．Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks [J]．Knowledge-Based Systems，2022，235：1-11. [23] Ning X，Tian W，Yu Z，et al．HCFNN: High-order coverage function neural network for image classification [J]．Pattern Recognition，2022，131：1-11. [24] 崔文靓，王玉静，康守强，等．基于改进YOLOv3算法的公路车道线检测方法 [J]．自动化学报，2022，48(6)：1560-1568. [25] He K，Zhang X，Ren S，et al. Deep residual learning for image recognition [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778. [26] Tran D，Bourdev L，Fergus R，et al. Learning spatiotemporal features with 3d convolutional networks [C]. In Proceedings of the IEEE International Conference on Computer Vision. 2015: 4489-4497. [27] Ren S，He K，Girshick R，et al．Faster R-CNN: Towards real-time object detection with region proposal networks [J]．IEEE Transactions on Pattern Analysis & Machine Intelligence，2017，39(6)：1137-1149. [28] Simonyan K，Zisserman A. Very deep convolutional networks for large-scale image recognition [C]. In Proceedings of the 3rd International Conference on Learning Representations. 2014: 1–14. [29] Dey R，Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks [C]. In Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems. 2017: 1597-1600. [30] Vaswani A，Shazeer N，Parmar N，et al. Attention is all you need [C]. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010. [31] Hochreiter S，Schmidhuber J．Long short-term memory [J]．Neural Computation，1997，9(8)：1735-1780. [32] Radford A，Wu J，Child R，et al．Language models are unsupervised multitask learners [J]．OpenAI blog，2019，(8)：1-24. [33] Gao L，Guo Z，Zhang H，et al．Video captioning with attention-based LSTM and semantic consistency [J]．IEEE Transactions on Multimedia，2017，19(9)：2045-2055. [34] Ryu H，Kang S，Kang H，et al. Semantic grouping network for video captioning [C]. In Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 2514-2522. [35] Lin J，Zhong S-h，Fares A．Deep hierarchical LSTM networks with attention for video summarization [J]．Computers & Electrical Engineering，2022，97：1-12. [36] 王屹超，朱慕华，许晨，等．利用图像描述与知识图谱增强表示的视觉问答 [J]．清华大学学报（自然科学版），2022，62(5)：900-907. [37] Hu W，Wu L，Jian M，et al．Cosine metric supervised deep hashing with balanced similarity [J]．Neurocomputing，2021，448：94-105. [38] Wang X，Zhu L，Yang Y. T2vlad: Global-local sequence alignment for text-video retrieval [C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 5079-5088. [39] Blei D M，Ng A Y，Jordan M I．Latent dirichlet allocation [J]．Journal of Machine Learning Research，2003，3(Jan)：993-1022. [40] 张文，王强，杜宇航，等．在线商品评论有用性主题分析及预测研究 [J]．系统工程理论与实践，2022，42(10)：2757-2768. [41] Rani R，Lobiyal D．An extractive text summarization approach using tagged-LDA based topic modeling [J]．Multimedia Tools and Applications，2021，80：3275-3305. [42] Yang Z，Lu Y，Wang J，et al. Tap: Text-aware pre-training for text-vqa and text-caption [C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 8751-8761. [43] Gage P．A new algorithm for data compression [J]．C Users Journal，1994，12(2)：23-38. [44] Reimers N，Gurevych I. Sentence-BERT: Sentence embeddings using siamese BERT-Networks [C]. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3982-3992. [45] Carreira J，Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299-6308. [46] Chen D，Dolan W B. Collecting highly parallel data for paraphrase evaluation [C]. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011: 190-200. [47] Xu J，Mei T，Yao T，et al. Msr-vtt: A large video description dataset for bridging video and language [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5288-5296. [48] Papineni K，Roukos S，Ward T，et al. Bleu: A method for automatic evaluation of machine translation [C]. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002: 311-318. [49] Banerjee S，Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005: 65-72. [50] Lin C-Y. Rouge: A package for automatic evaluation of summaries [C]. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. 2004, 8: 74-81. [51] Vedantam R，Lawrence Zitnick C，Parikh D. Cider: Consensus-based image description evaluation [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4566-4575. [52] Aafaq N，Akhtar N，Liu W，et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning [C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 12487-12496. [53] Lei Z，Huang Y．Video captioning based on channel soft attention and semantic reconstructor [J]．Future Internet，2021，13(2)：1-18. [54] Ji W，Wang R，Tian Y，et al．An attention based dual learning approach for video captioning [J]．Applied Soft Computing，2022，117：1-9. [55] Vaidya J，Subramaniam A，Mittal A. Co-Segmentation aided two-stream architecture for video captioning [C]. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022: 2774-2784. [56] Cao S，Wang B，Zhang W，et al. Visual consensus modeling for video-text retrieval [C]. In Proceedings of the AAAI Conference on Artificial Intelligence. 2022: 167-175. [57] 张北辰，李亮，查正军，等．基于跨模态对比学习的视觉问答主动学习方法 [J]．计算机学报，2022，45(8)：1730-1745. [58] Anderson P，Wu Q，Teney D，et al. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3674-3683. [59] 李学龙．多模态认知计算 [J]．中国科学:信息科学，2023，53(1)：1-32. [60] Ernst M O，Bülthoff H H．Merging the senses into a robust percept [J]．Trends in Cognitive Sciences，2004，8(4)：162-169. [61] Wan Q，Ji H，Shen L. Self-attention based text knowledge mining for text detection [C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 5983-5992. [62] 李卫疆，漆芳，余正涛．基于多通道特征和自注意力的情感分类方法 [J]．软件学报，2021，32(9)：2783-2800. [63] Deng J，Dong W，Socher R，et al. Imagenet: A large-scale hierarchical image database [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2009: 248-255. [64] Lin T-Y，Maire M，Belongie S，et al. Microsoft coco: Common objects in context [C]. In Proceedings of the 13th European Conference on Computer Vision. 2014: 529-545. [65] Pennington J，Socher R，Manning C D. Glove: Global vectors for word representation [C]. In Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. 2014: 1532-1543. [66] Freitag M，Al-Onaizan Y. Beam search strategies for neural machine translation [C]. In Association for Computational Linguistics. 2017: 56-60. [67] Liu F，Wu X，You C，et al．Aligning source visual and target language domains for unpaired video captioning [J]．IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，44(12)：9255-9268. [68] Ma C-Y，Kadav A，Melvin I，et al. Attend and interact: Higher-order object interactions for video understanding [C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6790-6800. [69] Chen S，Jiang Y-G. Motion guided region message passing for video captioning [C]. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1543-1552. ﹀
中图分类号：	TP301.6
开放日期：	2024-06-15

附件下载