论文中文题名: |
基于注意力机制的场景文本检测与识别算法研究
|
姓名: |
刘哲
|
学号: |
20208049002
|
保密级别: |
公开
|
论文语种: |
chi
|
学科代码: |
0812
|
学科名称: |
工学 - 计算机科学与技术(可授工学、理学学位)
|
学生类型: |
硕士
|
学位级别: |
工学硕士
|
学位年度: |
2023
|
培养单位: |
西安科技大学
|
院系: |
计算机科学与技术学院
|
专业: |
计算机科学与技术
|
研究方向: |
图像处理
|
第一导师姓名: |
厍向阳
|
第一导师单位: |
西安科技大学
|
论文提交日期: |
2023-06-15
|
论文答辩日期: |
2023-06-05
|
论文外文题名: |
Research on scene text detection and recognition algorithm based on attention mechanism
|
论文中文关键词: |
场景文本检测 ; Mask R-Cnn ; GRU ; Swin Transformer ; 注意力机制 ; 特征融合
|
论文外文关键词: |
Scene text detection ; Mask R-CNN ; GRU ; Swin Transformer ; Attention mechanism ; Feature fusion.
|
论文中文摘要: |
︿
<p>图像或视频中文本信息具有高度概括性,能够准确描述场景内容,对计算机有效理解图像或视频内容具有重要的应用价值。场景文本检测与识别作为机器理解视频和图像中文字的基础,已经成为该领域的研究热点,被广泛应用于智能助盲系统、场景理解和无人驾驶等领域。本文针对文本检测任务中小尺度文本和大尺度文本检测精度低的问题,提出了基于改进特征金字塔的场景文本检测算法;针对文本识别任务中不规则文本识别准确率较低的问题,提出了基于残差GRU和空间注意力机制的场景文本识别算法。本文的主要研究成果如下:</p>
<p>1.针对目前文本检测中小尺度文本和大尺度文本检测精度低的问题,提出了一种基于多尺度注意力特征融合的场景文本检测算法。该方法以Mask R-Cnn为基线模型,引入Swin_Transformer作为骨干网络提取底层特征。在特征金字塔(Feature Pyramid Networks,FPN)中,通过将多尺度注意力热图与底层特征通过横向连接相融合,使检测器的不同层级专注于特定尺度的目标,并利用相邻层注意力热图之间的关系实现了FPN结构中的纵向特征共享,避免了不同层之间梯度计算的不一致性问题。实验表明:在ICDAR2015,CTW1500和Total-Text弯曲文本数据集上的F综合指标分别达到了85.61%,76.83%和78.56%,与主流算法相比能够达到良好水平。</p>
<p>2.针对场景文本识别中不规则文本识别准确率较低的问题,本文提出了一个基于残差GRU和空间注意力机制的场景文本识别算法。该算法在编码器-解码器结构上进行改进,主要包括改进后的ResNet 31、双向GRU编码器和解码器以及空间注意力模块。在六个主流数据集上的消融实验和对比实验结果表明,残差GRU编码器和空间注意力模块能够充分融合空间特征和序列特征,有效提升了文本目标识别的健壮性。在不规则文本数据集中,ICDAR2015、SVTP和CUTE80数据集的识别准确率分别达到了87.4%、86.3%和94.8%,表明了本文算法改进的有效性。</p>
﹀
|
论文外文摘要: |
︿
<p>Text information in images or videos is highly summary and accurately describes scene content. It has important application value in effectively understanding image or video content by computers. As the basic foundation for machine understanding of text in videos and images, scene text detection and recognition have become a research hotspot in this field and have been widely used in areas such as intelligent assisted imaging systems, scene understanding, and unmanned driving. We proposes a scene text detection algorithm based on an improved feature pyramid to address the problem of low accuracy in detecting small-scale and large-scale texts in text detection tasks; and a scene text recognition algorithm based on residual GRU and spatial attention mechanism to address the problem of low accuracy in irregular text recognition in text recognition tasks. The main research results of this paper are as follows:</p>
<p>A scene text detection algorithm based on multi-scale attention feature fusion was proposed to address the problem of low detection accuracy of small-scale and large-scale texts in text detection. The algorithm takes Mask R-Cnn as the baseline model and introduces Swin_Transformer as the backbone network to extract low-level features. In the Feature Pyramid Networks (FPN), multi-scale attention heat maps are horizontally fused with low-level features to make different levels of detectors focus on specific targets of a particular scale. The relationship between adjacent layer attention heat maps is used to achieve vertical feature sharing in the FPN structure, avoiding inconsistency in gradient calculations between different layers. The experiment shows that the F-measure on the ICDAR2015, CTW1500, and Total Text curved text datasets reaches 85.61%, 76.83%, and 78.56%, respectively, which can reach a good level compared to mainstream algorithms.</p>
<p>2. The paper proposes a scene text recognition algorithm based on residual GRU and spatial attention mechanism to improve the low recognition accuracy of irregular text in scene text recognition. The algorithm improves the encoder-decoder structure, including the improved ResNet31, bidirectional GRU encoders and decoders, and spatial attention modules. Results from six popular datasets in ablation and comparative experiments show that residual GRU encoders and spatial attention modules can fully integrate spatial and sequence features and effectively improve the robustness of text recognition targets. In irregular text data sets, the recognition accuracy of ICDAR2015, SVTP and CUTE80 data sets reaches 87.4% , 86.3% and 94.8% respectively, which shows the effectiveness of the improved algorithm.</p>
﹀
|
参考文献: |
︿
[1]LeCun Y, Bengio Y, Hinton G. Deep learning[J]. nature, 2015, 521(7553): 436-444. [2]Epshtein B, Ofek E, Wexler Y. Detecting text in natural scenes with stroke width transform[C]//2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010: 2963-2970. [3]唐有宝, 卜巍, 邬向前. 多层次 MSER 自然场景文本检测[J]. 浙江大学学报: 工学版, 2016, 50(6): 1134-1140. [4]颜建强. 图像视频复杂场景中文字检测识别方法研究[D]. 西安电子科技大学, 2014. [5]Toutanova K, Chen D, Pantel P, et al. Representing text for joint embedding of text and knowledge bases[C]//Proceedings of the 2015 conference on empirical methods in natural language processing. 2015: 1499-1509. [6]Zheng J, Pan H, Cheng J. Rolling bearing fault detection and diagnosis based on composite multiscale fuzzy entropy and ensemble support vector machines[J]. Mechanical Systems and Signal Processing, 2017, 85: 746-759. [7]Chen J C, Kim A S. Brownian dynamics, molecular dynamics, and Monte Carlo modeling of colloidal systems[J]. Advances in colloid and interface science, 2004, 112(1-3): 159-173. [8]Gelmont B, Kim K, Shur M. Monte Carlo simulation of electron transport in gallium nitride[J]. Journal of applied physics, 1993, 74(3): 1818-1821. [9]Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 56-72. [10]Yu Y, Si X, Hu C, et al. A review of recurrent neural networks: LSTM cells and network architectures[J]. Neural computation, 2019, 31(7): 1235-1270. [11]Nabati R, Qi H. Rrpn: Radar region proposal network for object detection in autonomous vehicles[C]//2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019: 3093-3097. [12]Liao M, Shi B, Bai X, et al. Textboxes: A fast text detector with a single deep neural network[C]//Proceedings of the AAAI conference on artificial intelligence. 2017, 31(1). [13]Liao M, Shi B, Bai X. Textboxes++: A single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690. [14]王雪娇, 张超敏. 改进 Faster R-CNN 下的自然场景文本检测[J]. 仪表技术, 2020. [15]牛作东, 李捍东. 引入注意力机制的自然场景文本检测算法研究[J]. 计算机应用与软件, 2019, 36(9): 198r-203,269. [16]Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2550-2558. [17]Agrawal N, Prabhakaran V, Wobber T, et al. Design tradeoffs for SSD performance[C]//USENIX Annual Technical Conference. 2008, 57. [18]Zhou X, Yao C, Wen H, et al. East: an efficient and accurate scene text detector[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017: 5551-5560. [19]Siddique N, Paheding S, Elkin C P, et al. U-net and its variants for medical image segmentation: A review of theory and applications[J]. Ieee Access, 2021, 9: 82031-82057. [20]He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969. [21]Gupta N, Jalal A S. Text or non-text image classification using fully convolution network (FCN)[C]//2020 international conference on contemporary computing and applications (IC3A). IEEE, 2020: 150-153. [22]Li Y, Wu Z, Zhao S, et al. PSENet: Psoriasis severity evaluation network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(01): 800-807. [23]Deng D, Liu H, Li X, et al. Pixellink: Detecting scene text via instance segmentation[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1). [24]Liao M, Wan Z, Yao C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(07): 11474-11481. [25]Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304. [26]Goel V, Mishra A, Alahari K, et al. Whole is greater than sum of parts: Recognizing scene text words[C]//2013 12th international conference on document analysis and recognition. IEEE, 2013: 398-402. [27]Yao C, Bai X, Shi B, et al. Strokelets: A learned multi-scale representation for scene text recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 4042-4049. [28]Netzer Y, Wang T, Coates A, et al. Reading digits in natural images with unsupervised feature learning[J]. 2011. [29]Wu Y, Chen Y, Yuan L, et al. Rethinking classification and localization for object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10186-10195. [30]Liu H, Wang B, Bao Z, et al. Perceiving stroke-semantic context: Hierarchical contrastive learning for robust scene text recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(2): 1702-1710. [31]Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304. [32]He P, Huang W, Qiao Y, et al. Reading scene text in deep convolutional sequences[C]//Proceedings of the AAAI conference on artificial intelligence. 2016, 30(1). [33]Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022. [34]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [35]Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks[C]//2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, 2017: 1597-1600. [36]王润民, 桑农, 丁丁, 等. 自然场景图像中的文本检测综述[J]. 自动化学报, 2018, 44(12): 2113-2141. [37]Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition[C]//2013 12th international conference on document analysis and recognition. IEEE, 2013: 1484-1493. [38]Wang K, Babenko B, Belongie S. End-to-end scene text recognition[C]//2011 International conference on computer vision. IEEE, 2011: 1457-1464. [39]Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading[C]//2015 13th international conference on document analysis and recognition (ICDAR). IEEE, 2015: 1156-1160. [40]Yuliang L, Lianwen J, Shuaitao Z, et al. Detecting curve text in the wild: New dataset and new solution[J]. arXiv preprint arXiv:1712.02170, 2017. [41]Zhang Z, Zhang C, Shen W, et al. Multi-oriented text detection with fully convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4159-4167. [42]Ch'ng C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 935-942. [43]Gomez R, Shi B, Gomez L, et al. Icdar2017 robust reading challenge on coco-text[C]//2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, 1: 1435-1443. [44]Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755. [45]Nayef N, Yin F, Bizid I, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 1454-1459. [46]Al-Jawfi R. Handwriting Arabic character recognition LeNet using neural network[J]. Int. Arab J. Inf. Technol., 2009, 6(3): 304-309. [47]Alom M Z, Taha T M, Yakopcic C, et al. The history began from alexnet: A comprehensive survey on deep learning approaches[J]. arXiv preprint arXiv:1803.01164, 2018. [48]Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255. [49]Sengupta A, Ye Y, Wang R, et al. Going deeper in spiking neural networks: VGG and residual architectures[J]. Frontiers in neuroscience, 2019, 13: 95. [50]Ballester P, Araujo R. On the performance of GoogLeNet and AlexNet applied to sketches[C]//Proceedings of the AAAI conference on artificial intelligence. 2016, 30(1). [51]Lydia A, Francis S. Adagrad—an optimizer for stochastic gradient descent[J]. Int. J. Inf. Comput. Sci, 2019, 6(5): 566-568. [52]Chandra R, Zhang M. Cooperative coevolution of Elman recurrent neural networks for chaotic time series prediction[J]. Neurocomputing, 2012, 86: 116-123. [53]Jordan M I, Mitchell T M. Machine learning: Trends, perspectives, and prospects[J]. Science, 2015, 349(6245): 255-260. [54]Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE transactions on Signal Processing, 1997, 45(11): 2673-2681. [55]Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014. [56]Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125. [57]Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448. [58]Wang X, Chan K C K, Yu K, et al. Edvr: Video restoration with enhanced deformable convolutional networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019: 0-0. [59]Lai W S, Huang J B, Ahuja N, et al. Deep laplacian pyramid networks for fast and accurate super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 624-632. [60]Iandola F, Moskewicz M, Karayev S, et al. Densenet: Implementing efficient convnet descriptor pyramids[J]. arXiv preprint arXiv:1404.1869, 2014. [61]Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 834-848. [62]Wang J, Yuan Y, Yu G. Face attention network: An effective face detector for the occluded faces[J]. arXiv preprint arXiv:1711.07246, 2017. [63]Su J, Liu Z, Zhang J, et al. DV-Net: Accurate liver vessel segmentation via dense connection model with D-BCE loss function[J]. Knowledge-Based Systems, 2021, 232: 107471. [64]Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 56-72. [65]Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2550-2558. [66]Xu Y, Wang Y, Zhou W, et al. Textfield: Learning a deep direction field for irregular scene text detection[J]. IEEE Transactions on Image Processing, 2019, 28(11): 5566-5579. [67]Long S, Ruan J, Zhang W, et al. Textsnake: A flexible representation for detecting text of arbitrary shapes[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 20-36. [68]Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83. [69]Jaderberg M, Simonyan K, Vedaldi A, et al. Reading text in the wild with convolutional neural networks[J]. International journal of computer vision, 2016, 116: 1-20. [70]Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2315-2324. [71]Mishra A, Alahari K, Jawahar C V. Scene text recognition using higher order language priors[C]//BMVC-British machine vision conference. BMVA, 2012. [72]Maren S, Phan K L, Liberzon I. The contextual brain: implications for fear conditioning, extinction and psychopathology[J]. Nature reviews neuroscience, 2013, 14(6): 417-428. [73]Risnumawan A, Shivakumara P, Chan C S, et al. A robust arbitrary text detection system for natural scene images[J]. Expert Systems with Applications, 2014, 41(18): 8027-8048. [74]Wang G. Scene text recognition with finer grid rectification[J]. arXiv preprint arXiv:2001.09389, 2020. [75]Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304. [76]Cheng Z, Bai F, Xu Y, et al. Focusing attention: Towards accurate text recognition in natural images[C]//Proceedings of the IEEE international conference on computer vision. 2017: 5076-5084. [77]Shi B, Wang X, Lyu P, et al. Robust scene text recognition with automatic rectification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4168-4176
﹀
|
中图分类号: |
TP391.4
|
开放日期: |
2023-06-15
|