查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于深度学习的遥感影像视觉问答研究
姓名：	郭艳
学号：	20210226067
保密级别：	公开
论文语种：	chi
学科代码：	085215
学科名称：	工学 - 工程 - 测绘工程
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2023
培养单位：	西安科技大学
院系：	测绘科学与技术学院
专业：	测绘工程
研究方向：	遥感视觉问答
第一导师姓名：	黄远程
第一导师单位：	西安科技大学
论文提交日期：	2023-06-16
论文答辩日期：	2023-06-06
论文外文题名：	Research on Remote Sensing Visual Question Answering Based on Deep Learning
论文中文关键词：	视觉问答 ; 多尺度 ; 注意力机制 ; 深度学习 ; 神经网络 ; 遥感图像
论文外文关键词：	Visual Question Answering (VQA) ; Multi-scale ; Attention Mechanism ; Deep Learning ; Neural Network ; Remote Sensing Image
论文中文摘要：	︿近年来，随着人们对人工智能领域的深入研究，深度学习在计算机视觉和自然语言处理领域内大放异彩。卷积神经网络（Convolutional Neural Networks, CNN）和循环神经网络（Recurrent Neural Network，RNN）的出现引起了深度学习领域的快速发展，CNN通过卷积操作，能够捕捉图像中的空间特征，是计算机视觉中常用的操作；RNN通过循环记忆，能够捕捉上下文之间的关系特征，多用于自然语言处理。随着RNN和CNN的发展，融合多模态数据的交互系统逐渐成为新的热点。其中，视觉问答（Visual Question Answering，VQA）是一种结合计算机视觉和自然语言处理的多模态交互式系统，旨在根据输入图像和相关的问题，让计算机智能的回答。VQA系统同时涉及计算机视觉和自然语言处理两个模态的信息及处理方法，使得VQA模型不但要理解图像的空间特征和文本之间的语义关系，还需要将两个不同模态之间的特征信息进行交互，从而根据融合特征来预测问题的答案。然而，现有的视觉问答模型大多基于自然图像，不适用于遥感图像的场景应用需求，难以直接迁移至遥感图像的VQA研究。基于遥感影像的视觉问答研究是遥感领域内新兴的重要研究方向之一，遥感VQA的研究推动了遥感影像的智能化发展，是遥感图像快速调查和检测全球资源的关键技术。为了获取遥感图像中有价值的信息，本文提出了端到端的遥感VQA应用系统。本文主要是基于深度学习框架从以下几个方面研究遥感视觉问答：（1）基于全局-局部的遥感视觉问答研究。在遥感数据应用中，对图像的全局场景理解和局部目标识别同样重要，因此，本文在视觉问答中设计一个能够从全局场景到局部目标变化而理解遥感图像特征的模型。本文从这个思路出发，创建了一个新遥感VQA模型——全局局部遥感视觉问答（Global and Local Visual Question Answering, GLVQA）模型；并构建了个包含全局和局部问答对的遥感VQA数据集——GLVQA数据集。该数据集收集了30种不同场景的遥感影像，大大提高了现有遥感VQA数据的场景丰富度，且该数据集的问答设计主要针对影像的全局场景理解和局部目标识别两方面进行，最终建立了1500个GLVQA数据样本。本文中提出的模型在GLVQA数据集上进行评估，结果表明，GLVQA模型获得了83.6%的验证精度，能够较好的回答全局场景理解问题及局部目标识别问题，在遥感VQA任务中表现出巨大的应用潜力。（2）基于多尺度融合注意力的遥感视觉问答研究。遥感图像目标尺寸差异较大，一张影像上信息量较多，包括信息杂乱的背景目标、局部小目标、显著性目标、全局场景等不同尺度的目标信息。单一尺度特征无法捕捉不同尺寸空间特征信息。因此，本文根据遥感图像的多尺度特性创建了新的遥感VQA数据集，多尺度遥感视觉问答（Multi-scale Remote Sensing Visual Question Answering ，MRS-VQA）数据集，MRS-VQA数据集集成了不同尺度图像特征，除了之前工作提到的全局场景问题和局部目标问题，还增加了显著性目标问题，以及局部目标之间的推理问题，该数据集共包含了3400个样本，问答类型丰富，能够满足不同的遥感应用场景。本研究针对此数据集，在遥感视觉问答系统中引入了多尺度特征组合方式，设计了多尺度遥感视觉问答（MRS-VQA）模型，并在MRS-VQA模型的多模态融合阶段引入了注意力机制，通过对重要区域目标进行加权，提高了MRS-VQA模型准确率（96.82%）和模型检测效果，同时通过可视化注意力图的形式，增强了模型的可解释性，实验结果表明MRS-VQA模型能够真正理解文本的语义信息，并对应到图像中的相关区域，有效提高了遥感视觉问答系统的应用性。﹀
论文外文摘要：	︿ In recent years, with people's in-depth research in the field of artificial intelligence, deep learning shines brilliantly in the fields of computer vision and natural language processing. The emergence of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) has caused the rapid development in the field of deep learning. CNN can capture the spatial characteristics of images through convolution operation, which is a common operation in computer vision. RNN can capture the relationship characteristics between contexts through circular memory, which is mostly used in natural language processing. With the development of RNN and CNN, the interactive system integrating multi-modal data has gradually become a new hot spot. Among them, Visual Question Answering (VQA) is a multimodal interactive system combining computer vision and natural language processing, which aims to make the computer answer intelligently according to the input images and related questions. The VQA system involves both computer vision and natural language processing the information and processing methods of two modes, which makes the VQA model not only understand the spatial characteristics of images and the semantic relationship between texts, but also need to interact the feature information between two different modes, so as to predict the answer to the question according to the fused features. However, most of the existing visual question answering models are based on natural images, which are not suitable for the scene application requirements of remote sensing images, and it is difficult to directly migrate to VQA research of remote sensing images. The research of visual question answering based on remote sensing images is one of the emerging important research directions in the field of remote sensing. The research of remote sensing VQA promotes the intelligent development of remote sensing images and is the key technology for rapid investigation and detection of global resources by remote sensing images. In order to obtain valuable information from remote sensing images, this paper proposes an end-to-end remote sensing VQA application system. Based on the deep learning framework, this paper mainly studies remote sensing visual question-and-answer from the following aspects: (1) Research on remote sensing visual question answering based on global-local. In the application of remote sensing data, it is equally important to understand the global scene of the image and identify the local target. Therefore, this paper designs a model that can change from the global scene to the local target to understand the characteristics of remote sensing images in visual question and answer. Based on this idea, this paper creates a new remote sensing VQA model-Global and Local Visual Question Answering (GLVQA) model. And a remote sensing VQA data set-GLVQA data set containing global and local question-and-answer pairs is constructed. This data set collects 30 remote sensing images of different scenes, which greatly improves the scene richness of the existing remote sensing VQA data. The question and answer design of this data set mainly focuses on the global scene understanding and local target recognition of the images, and finally 1500 GLVQA data samples are established. The model proposed in this paper is evaluated on the data set of GLVQA, and the results show that the verification accuracy of GLVQA model is 83.6%, and it can better answer the problems of global scene understanding and local target recognition, showing great application potential in remote sensing VQA tasks. (2) Research on remote sensing visual question answering based on multi-scale attention fusion. The target size of remote sensing images is quite different, and there is a lot of information in one image, including the target information of different scales such as background target, local small target, salient target and global scene. A single scale feature cannot capture the spatial feature information of different sizes. Therefore, according to the multi-scale characteristics of remote sensing images, this paper creates a new remote sensing VQA data set, a multi-scale remote sensing visual question answering (MRS-VQA) data set, which integrates the image characteristics of different scales. In addition to the global scene problem and local target problem mentioned in the previous work, the salient target problem and the reasoning problem between local targets are also added. The data set contains 3,400 samples, and the types of questions and answers are rich, which can meet different remote sensing application scenarios. Aiming at this data set, this study introduces multi-scale feature combination into the remote sensing visual question answering system, designs a multi-scale remote sensing visual question answering (MRS-VQA) model, and introduces attention mechanism in the multi-modal fusion stage of MRS-VQA model. By weighting the targets in important areas, the accuracy rate of MRS-VQA model is improved (96.82%) and the model detection effect is improved. At the same time, through the form of visual attention diagram, the interpretability of the model is enhanced. The experimental results show that the MRS-VQA model can really understand the semantic information of the text and correspond to the relevant action areas in the image, which effectively improves the application of the remote sensing visual question answering system. ﹀
参考文献：	︿ [1] Paoletti M E, Haut J M, Plaza J, et al. Deep learning classifiers for hyperspectral imaging: A review[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2019, 158: 279-317. [2] Cheng G, Xie X, Han J, et al. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 3735-3756. [3] Sun W, Peng J, Yang G, et al. Fast and latent low-rank subspace clustering for hyperspectral band selection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(6): 3906-3915. [4] Scheibenreif L, Mommert M, Borth D. Masked Vision Transformers for Hyperspectral Image Classification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2165-2175. [5] Zhang F, Du B, Zhang L. Saliency-guided unsupervised feature learning for scene classification[J]. IEEE transactions on Geoscience and Remote Sensing, 2014, 53(4): 2175-2184. [6] Sheykhmousa M, Mahdianpari M, Ghanbari H, et al. Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 6308-6325. [7] Zhang Y, Yuan Y, Feng Y, et al. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 57(8): 5535-5548. [8] Wang Y, Bashir S M A, Khan M, et al. Remote sensing image super-resolution and object detection: Benchmark and state of the art[J]. Expert Systems with Applications, 2022: 116793. [9] Xia G S, Bai X, Ding J, et al. DOTA: A large-scale dataset for object detection in aerial images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 3974-3983. [10] Zhang X, Li X, An J, et al. Natural language description of remote sensing images based on deep learning[C]//2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017: 4798-4801. [11] Lu X, Gong T, Zheng X. Multisource compensation network for remote sensing cross-domain scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 58(4): 2504-2515. [12] Lu X, Wang B, Zheng X. Sound active attention framework for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 58(3): 1985-2000. [13] Huang W, Wang Q, Li X. Feature sparsity in convolutional neural networks for scene classification of remote sensing image[C]//IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019: 3017-3020. [14] Zhang S, He G, Chen H B, et al. Scale adaptive proposal network for object detection in remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2019, 16(6): 864-868. [15] Liu X, Liu Q, Wang Y. Remote sensing image fusion based on two-stream fusion network[J]. Information Fusion, 2020, 55: 1-15. [16] Wu Q, Teney D, Wang P, et al. Visual question answering: A survey of methods and datasets[J]. Computer Vision and Image Understanding, 2017, 163: 21-40. [17] Goyal Y, Khot T, Summers-Stay D, et al. Making the v in vqa matter: Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6904-6913. [18] Kafle K, Kanan C. An analysis of visual question answering algorithms[C]//Proceedings of the IEEE international conference on computer vision. 2017: 1965-1973. [19] Teney D, van den Hengel A. Visual question answering as a meta learning task[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 219-235. [20] Hegde S, Jahagirdar S, Gangisetty S. Making the V in Text-VQA Matter[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 5579-5587. [21] Jiang L, Meng Z. Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph[J]. Electronics, 2023, 12(6): 1390. [22] 龚安, 丁磊, 姚鑫杰. 基于图卷积网络的视觉问答研究[J]. 计算机与数字工程, 2022,50(01):135-139. [23] Antol S., Agrawal A., Lu J., Mitchell M., Batra D., Lawrence Zitnick, C., Parikh, D. "Vqa: Visual question answering." IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2425-2433. [24] Wu L, Wang Y, Shao L. Cycle-consistent deep generative hashing for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2018, 28(4): 1602-1612. [25] Bai S, An S. A survey on automatic image caption generation[J]. Neurocomputing, 2018, 311: 291-304. [26] Rafi M H, Islam S, Labib S M H I, et al. A Deep Learning-Based Bengali Visual Question Answering System[C]//2022 25th International Conference on Computer and Information Technology (ICCIT). IEEE, 2022: 114-119. [27] 吝博强, 田文洪. 基于层次注意力机制的高效视觉问答模型[J]. 计算机应用研究, 2021,38(02):636-640. [28] Bhorge S, Rane M, Rane N, et al. Visual AI for Satellite Imagery Perspective: A Visual Question Answering Framework in the Geospatial Domain[C]//2023 IEEE 8th International Conference for Convergence in Technology (I2CT). IEEE, 2023: 1-6. [29] Zhang H, Xu T, Li H, et al. Stackgan++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(8): 1947-1962. [30] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. [31] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [32] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788. [33] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Advances in neural information processing systems, 2015, 28. [34] Yusuf Zakari R, Owusu J W, Wang H, et al. VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges[J]. arXiv e-prints, 2022: arXiv: 2212.13296. [35] 崔政, 胡永利, 孙艳丰, 等. 面向跨模态数据协同分析的视觉问答方法综述[J]. Journal of Beijing University of Technology, 2022, 48(10). [36] Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1473-1482. [37] Xi Y, Zhang Y, Ding S, et al. Visual question answering model based on visual relationship detection[J]. Signal Processing: Image Communication, 2020, 80: 115648. [38] Teney D, Wu Q, van den Hengel A. Visual question answering: A tutorial[J]. IEEE Signal Processing Magazine, 2017, 34(6): 63-75. [39] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. 2009: 248-255. [40] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448. [41] Lin G, Milan A, Shen C, et al. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1925-1934. [42] Hu R, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 804-813. [43] Johnson J, Hariharan B, Van Der Maaten L, et al. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2901-2910. [44] Chen X, Li L J, Fei-Fei L, et al. Iterative visual reasoning beyond convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7239-7248. [45] Zhou B, Tian Y, Sukhbaatar S, et al. Simple baseline for visual question answering[J]. arXiv preprint arXiv:1512.02167, 2015. [46] Chen Z, Chen J, Geng Y, et al. Zero-shot visual question answering using knowledge graph[C]//The Semantic Web–ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings 20. Springer International Publishing, 2021: 146-162. [47] Kafle K, Kanan C. Answer-type prediction for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4976-4984. [48] Gao H, Mao J, Zhou J, et al. Are you talking to a machine? dataset and methods for multilingual image question[J]. Advances in neural information processing systems, 2015, 28. [49] 路静, 吴春雷, 王雷全. 基于对称注意力机制的视觉问答系统[J]. 计算机系统应用, 2021,30(05):114-119. [50] 陈巧红, 漏杨波, 孙麒等. 基于多模态门控自注意力机制的视觉问答模型[J]. 浙江理工大学学报(自然科学版), 2022,47(03):413-423. [51] 张昊雨, 张德. 基于图结构的级联注意力视觉问答模型[J]. 计算机工程与应用, 2023,59(06):155-161. [52] Ben-Younes H, Cadene R, Cord M, et al. Mutan: Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2612-2620. [53] Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[J]. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016, pp. 457–468. [54] Xiong C, Merity S, Socher R. Dynamic memory networks for visual and textual question answering[C]//International conference on machine learning. PMLR, 2016: 2397-2406. [55] Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[J]. Advances in neural information processing systems, 2016, 29. [56] Xu H, Saenko K. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer International Publishing, 2016: 451-466. [57] Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29. [58] Xu H, Saenko K. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer International Publishing, 2016: 451-466. [59] Shih K J, Singh S, Hoiem D. Where to look: Focus regions for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4613-4621. [60] Liang W, Niu F, Reganti A, et al. LRTA: A transparent neural-symbolic reasoning framework with modular supervision for visual question answering[J]. arXiv preprint arXiv:2011.10731, 2020. [61] Goyal Y, Khot T, Summers-Stay D, et al. Making the v in vqa matter: Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6904-6913. [62] Gurari D, Li Q, Stangl A J, et al. Vizwiz grand challenge: Answering visual questions from blind people[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 3608-3617. [63] Li X, Chen M, Nie F, et al. A multiview-based parameter free framework for group detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2017, 31(1). [64] Li X, Chen M, Nie F, et al. Locality adaptive discriminant analysis[C]// 2201(2207). in proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 2201-2207. [65] Zhang X, Wang Q, Chen S, et al. Multi-scale cropping mechanism for remote sensing image captioning[C]//IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019: 10039-10042. [66] Li X, Zhang X, Huang W, et al. Truncation cross entropy loss for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 59(6): 5246-5257. [67] Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 2625-2634. [68] Kafle K, Kanan C. Visual question answering: Datasets, algorithms, and future challenges[J]. Computer Vision and Image Understanding, 2017, 163: 3-20. [69] Simonyan K. and Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. International Conference on Learning Representations (ICLR), 2015 [70] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [71] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086. [72] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. [73] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014. [74] Kim J H, Jun J, Zhang B T. Bilinear attention networks[J]. Advances in neural information processing systems, 2018, 31. [75] Geman D, Geman S, Hallonquist N, et al. Visual turing test for computer vision systems[J]. Proceedings of the National Academy of Sciences, 2015, 112(12): 3618-3623. [76] Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[J]. Advances in neural information processing systems, 2014, 27. [77] Gao H, Mao J, Zhou J, et al. Are you talking to a machine? dataset and methods for multilingual image question[J]. Advances in neural information processing systems, 2015, 28. [78] Krishna R, Zhu Y, Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International journal of computer vision, 2017, 123: 32-73. [79] Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[J]. Advances in neural information processing systems, 2014, 27. [80] Goyal Y, Khot T, Summers-Stay D, et al. Making the v in vqa matter: Elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6904-6913. [81] Krishna R, Zhu Y, Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International journal of computer vision, 2017, 123: 32-73. [82] Zhu Y, Groth O, Bernstein M, et al. Visual7w: Grounded question answering in images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4995-5004. [83] Marino K, Rastegari M, Farhadi A, et al. Ok-vqa: A visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 2019: 3195-3204. [84] Li Q, Tao Q, Joty S, et al. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 552-567. [85] Biten A F, Tito R, Mafla A, et al. Scene text visual question answering[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 4291-4301. [86] Wang P, Wu Q, Shen C, et al. Fvqa: Fact-based visual question answering[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(10): 2413-2427. [87] Bhattacharya N, Li Q, Gurari D. Why does a visual question have different answers?[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 4271-4280. [88] Shah M, Chen X, Rohrbach M, et al. Cycle-consistency for robust visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6649-6658. [89] Kervadec C, Antipov G, Baccouche M, et al. Roses are red, violets are blue... but should vqa expect them to?[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2776-2785. [90] Zhu L, Xu Z, Yang Y, et al. Uncovering the temporal context for video question answering[J]. International Journal of Computer Vision, 2017, 124: 409-42. [91] Tapaswi M, Zhu Y, Stiefelhagen R, et al. Movieqa: Understanding stories in movies through question-answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4631-4640. [92] Lobry S, Murray J, Marcos D, et al. Visual question answering from remote sensing images[C]//IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019: 4951-4954. [93] Shi Z, Zou Z. Can a machine generate humanlike language descriptions for a remote sensing image?[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(6): 3623-3634. [94] Lobry S, Marcos D, Murray J, et al. RSVQA: Visual question answering for remote sensing data[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(12): 8555-8566. [95] Rahnemoonfar M, Chowdhury T, Sarkar A, et al. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding[J]. IEEE Access, 2021, 9: 89644-89654. [96] Yuan Z, Mou L, Xiong Z, et al. Change detection meets visual question answering[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-13. [97] Zheng X, Wang B, Du X, et al. Mutual attention inception network for remote sensing visual question answering[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-14. [98] Zhang X, Luo H, Zhong S, et al. Movable Object Detection in Remote Sensing Images via Dynamic Automatic Learning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023. [99] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [100] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [101] Guo J, Hu G, Xu W, et al. Hierarchical content importance-based video quality assessment for HEVC encoded videos transmitted over LTE networks[J]. Journal of Visual Communication and Image Representation, 2017, 43: 50-60. [102] Zeiler M D, Fergus R. Visualizing and understanding convolutional networks[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. Springer International Publishing, 2014: 818-833. [103] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. [104] Kaiming, He, Xiangyu, et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2015, 37: 1904-1916. [105] Kim J H, On K W, Lim W, et al. Hadamard product for low-rank bilinear pooling[J]. arXiv preprint arXiv:1610.04325, 2016. [106] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014. [107] Hu R, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 804-813. [108] Johnson J, Hariharan B, Van Der Maaten L, et al. Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2989-2998. [109] 张海涛, 郭欣雨. 基于多模态推理图神经网络的场景文本视觉问答模型[J]. 计算机应用研究, 2022,39(01):280-284+302. [110] 白姣姣, 柯显信, 曹斌. 基于注意力机制的视觉问答任务研究[J]. 计算机应用与软件, 2020,37(10):145-150. [111] Hill F, Bordes A, Chopra S, et al. The goldilocks principle: Reading children's books with explicit memory representations[J]. arXiv preprint arXiv:1511.02301, 2015. [112] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086. [113] Nam H, Ha J W, Kim J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 299-307. [114] Das A, Agrawal H, Zitnick L, et al. Human attention in visual question answering: Do humans and deep networks look at the same regions?[J]. Computer Vision and Image Understanding, 2017, 163: 90-100. [115] Chen H, Shi Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection[J]. Remote Sensing, 2020, 12(10): 1662. [116] Chen H, Qi Z, Shi Z. Remote sensing image change detection with transformers[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-14. [117] Cong R, Zhang Y, Fang L, et al. RRNet: Relational reasoning network with parallel multiscale attention for salient object detection in optical remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-11. [118] Hua Y, Mou L, Zhu X X. Relation network for multilabel aerial image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(7): 4558-4572. [119] Kim J M, Koepke A, Schmid C, et al. Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2584-2594. [120] Peng Y, Huang X, Zhao Y. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385. [121] Wu C, Yin S, Qi W, et al. Visual chatgpt: Talking, drawing and editing with visual foundation models[J]. arXiv preprint arXiv:2303.04671, 2023. [122] Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057. [123] Xia G S, Hu J, Hu F, et al. AID: A benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965-3981. [124] Szegedy C, Liu W, Jia Y, et al. Going Deeper with Convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1-9. [125] He K, Zhang X, Ren S, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 1026-1034. [126] Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29. [127] Miao Y, Cheng W, He S, et al. Research on visual question answering based on GAT relational reasoning[J]. Neural Processing Letters, 2022: 1-14. [128] Zhu R, Zhang S, Wang X, et al. ScratchDet: Training single-shot object detectors from scratch[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 15-20, 2019, Long Beach, CA, USA. New York: IEEE, 2019:2268-2277. [129] Lin S, Ji R, Yan C, et al. Towards optimal structured cnn pruning via generative adversarial learning[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 15-20, 2019, Long Beach, CA, USA. New York: IEEE, 2019:2790-2799. [130] 张吉宇. 基于多尺度融合和信息增强的遥感图像语义分割[D]. 西安：西安电子科技大学, 2021. [131] 周金柳. 基于多尺度特征融合和视觉注意力机制的遥感图像目标检测[D]. 西安：西安电子科技大学, 2021. ﹀
中图分类号：	TP751
开放日期：	2023-06-16

附件下载