- 无标题文档
查看论文信息

论文中文题名:

     

姓名:

 刘哲    

学号:

 20208049002    

保密级别:

     

论文语种:

 chi    

学科代码:

 0812    

学科名称:

  -     

学生类型:

     

学位级别:

     

学位年度:

 2023    

培养单位:

 西    

院系:

 计算机科学与技术学院    

专业:

 计算机科学与技术    

研究方向:

     

第一导师姓名:

 厍向阳    

第一导师单位:

 西安科技大学    

论文提交日期:

 2023-06-15    

论文答辩日期:

 2023-06-05    

论文外文题名:

 Research on scene text detection and recognition algorithm based on attention mechanism    

论文中文关键词:

 场景文本检测 ; Mask R-Cnn ; GRU ; Swin Transformer ; 注意力机制 ; 特征融合    

论文外文关键词:

 Scene text detection ; Mask R-CNN ; GRU ; Swin Transformer ; Attention mechanism ; Feature fusion.    

论文中文摘要:
<p>广GRU</p> <p>1.Mask R-Cnn线Swin_Transformer(Feature Pyramid Networks,FPN)使FPNICDAR2015CTW1500Total-TextF85.61%76.83%78.56%</p> <p>2.GRU-ResNet 31GRUGRUICDAR2015SVTPCUTE8087.4%86.3%94.8%</p>
论文外文摘要:
<p>Text information in images or videos is highly summary and accurately describes scene content. It has important application value in effectively understanding image or video content by computers. As the basic foundation for machine understanding of text in videos and images, scene text detection and recognition have become a research hotspot in this field and have been widely used in areas such as intelligent assisted imaging systems, scene understanding, and unmanned driving. We proposes a scene text detection algorithm based on an improved feature pyramid to address the problem of low accuracy in detecting small-scale and large-scale texts in text detection tasks; and a scene text recognition algorithm based on residual GRU and spatial attention mechanism to address the problem of low accuracy in irregular text recognition in text recognition tasks. The main research results of this paper are as follows:</p> <p>A scene text detection algorithm based on multi-scale attention feature fusion was proposed to address the problem of low detection accuracy of small-scale and large-scale texts in text detection. The algorithm takes Mask R-Cnn as the baseline model and introduces Swin_Transformer as the backbone network to extract low-level features. In the Feature Pyramid Networks (FPN), multi-scale attention heat maps are horizontally fused with low-level features to make different levels of detectors focus on specific targets of a particular scale. The relationship between adjacent layer attention heat maps is used to achieve vertical feature sharing in the FPN structure, avoiding inconsistency in gradient calculations between different layers. The experiment shows that the F-measure on the ICDAR2015, CTW1500, and Total Text curved text datasets reaches 85.61%, 76.83%, and 78.56%, respectively, which can reach a good level compared to mainstream algorithms.</p> <p>2. The paper proposes a scene text recognition algorithm based on residual GRU and spatial attention mechanism to improve the low recognition accuracy of irregular text in scene text recognition. The algorithm improves the encoder-decoder structure, including the improved ResNet31, bidirectional GRU encoders and decoders, and spatial attention modules. Results from six popular datasets in ablation and comparative experiments show that residual GRU encoders and spatial attention modules can fully integrate spatial and sequence features and effectively improve the robustness of text recognition targets. In irregular text data sets, the recognition accuracy of ICDAR2015, SVTP and CUTE80 data sets reaches 87.4% , 86.3% and 94.8% respectively, which shows the effectiveness of the improved algorithm.</p>
参考文献:

[1]LeCun Y, Bengio Y, Hinton G. Deep learning[J]. nature, 2015, 521(7553): 436-444.

[2]Epshtein B, Ofek E, Wexler Y. Detecting text in natural scenes with stroke width transform[C]//2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010: 2963-2970.

[3]唐有宝, 卜巍, 邬向前. 多层次 MSER 自然场景文本检测[J]. 浙江大学学报: 工学版, 2016, 50(6): 1134-1140.

[4]颜建强. 图像视频复杂场景中文字检测识别方法研究[D]. 西安电子科技大学, 2014.

[5]Toutanova K, Chen D, Pantel P, et al. Representing text for joint embedding of text and knowledge bases[C]//Proceedings of the 2015 conference on empirical methods in natural language processing. 2015: 1499-1509.

[6]Zheng J, Pan H, Cheng J. Rolling bearing fault detection and diagnosis based on composite multiscale fuzzy entropy and ensemble support vector machines[J]. Mechanical Systems and Signal Processing, 2017, 85: 746-759.

[7]Chen J C, Kim A S. Brownian dynamics, molecular dynamics, and Monte Carlo modeling of colloidal systems[J]. Advances in colloid and interface science, 2004, 112(1-3): 159-173.

[8]Gelmont B, Kim K, Shur M. Monte Carlo simulation of electron transport in gallium nitride[J]. Journal of applied physics, 1993, 74(3): 1818-1821.

[9]Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 56-72.

[10]Yu Y, Si X, Hu C, et al. A review of recurrent neural networks: LSTM cells and network architectures[J]. Neural computation, 2019, 31(7): 1235-1270.

[11]Nabati R, Qi H. Rrpn: Radar region proposal network for object detection in autonomous vehicles[C]//2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019: 3093-3097.

[12]Liao M, Shi B, Bai X, et al. Textboxes: A fast text detector with a single deep neural network[C]//Proceedings of the AAAI conference on artificial intelligence. 2017, 31(1).

[13]Liao M, Shi B, Bai X. Textboxes++: A single-shot oriented scene text detector[J]. IEEE transactions on image processing, 2018, 27(8): 3676-3690.

[14]王雪娇, 张超敏. 改进 Faster R-CNN 下的自然场景文本检测[J]. 仪表技术, 2020.

[15]牛作东, 李捍东. 引入注意力机制的自然场景文本检测算法研究[J]. 计算机应用与软件, 2019, 36(9): 198r-203,269.

[16]Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2550-2558.

[17]Agrawal N, Prabhakaran V, Wobber T, et al. Design tradeoffs for SSD performance[C]//USENIX Annual Technical Conference. 2008, 57.

[18]Zhou X, Yao C, Wen H, et al. East: an efficient and accurate scene text detector[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017: 5551-5560.

[19]Siddique N, Paheding S, Elkin C P, et al. U-net and its variants for medical image segmentation: A review of theory and applications[J]. Ieee Access, 2021, 9: 82031-82057.

[20]He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.

[21]Gupta N, Jalal A S. Text or non-text image classification using fully convolution network (FCN)[C]//2020 international conference on contemporary computing and applications (IC3A). IEEE, 2020: 150-153.

[22]Li Y, Wu Z, Zhao S, et al. PSENet: Psoriasis severity evaluation network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(01): 800-807.

[23]Deng D, Liu H, Li X, et al. Pixellink: Detecting scene text via instance segmentation[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).

[24]Liao M, Wan Z, Yao C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(07): 11474-11481.

[25]Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304.

[26]Goel V, Mishra A, Alahari K, et al. Whole is greater than sum of parts: Recognizing scene text words[C]//2013 12th international conference on document analysis and recognition. IEEE, 2013: 398-402.

[27]Yao C, Bai X, Shi B, et al. Strokelets: A learned multi-scale representation for scene text recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 4042-4049.

[28]Netzer Y, Wang T, Coates A, et al. Reading digits in natural images with unsupervised feature learning[J]. 2011.

[29]Wu Y, Chen Y, Yuan L, et al. Rethinking classification and localization for object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10186-10195.

[30]Liu H, Wang B, Bao Z, et al. Perceiving stroke-semantic context: Hierarchical contrastive learning for robust scene text recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(2): 1702-1710.

[31]Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304.

[32]He P, Huang W, Qiao Y, et al. Reading scene text in deep convolutional sequences[C]//Proceedings of the AAAI conference on artificial intelligence. 2016, 30(1).

[33]Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.

[34]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[35]Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks[C]//2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, 2017: 1597-1600.

[36]王润民, 桑农, 丁丁, 等. 自然场景图像中的文本检测综述[J]. 自动化学报, 2018, 44(12): 2113-2141.

[37]Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition[C]//2013 12th international conference on document analysis and recognition. IEEE, 2013: 1484-1493.

[38]Wang K, Babenko B, Belongie S. End-to-end scene text recognition[C]//2011 International conference on computer vision. IEEE, 2011: 1457-1464.

[39]Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading[C]//2015 13th international conference on document analysis and recognition (ICDAR). IEEE, 2015: 1156-1160.

[40]Yuliang L, Lianwen J, Shuaitao Z, et al. Detecting curve text in the wild: New dataset and new solution[J]. arXiv preprint arXiv:1712.02170, 2017.

[41]Zhang Z, Zhang C, Shen W, et al. Multi-oriented text detection with fully convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4159-4167.

[42]Ch'ng C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 935-942.

[43]Gomez R, Shi B, Gomez L, et al. Icdar2017 robust reading challenge on coco-text[C]//2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, 1: 1435-1443.

[44]Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.

[45]Nayef N, Yin F, Bizid I, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 1454-1459.

[46]Al-Jawfi R. Handwriting Arabic character recognition LeNet using neural network[J]. Int. Arab J. Inf. Technol., 2009, 6(3): 304-309.

[47]Alom M Z, Taha T M, Yakopcic C, et al. The history began from alexnet: A comprehensive survey on deep learning approaches[J]. arXiv preprint arXiv:1803.01164, 2018.

[48]Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255.

[49]Sengupta A, Ye Y, Wang R, et al. Going deeper in spiking neural networks: VGG and residual architectures[J]. Frontiers in neuroscience, 2019, 13: 95.

[50]Ballester P, Araujo R. On the performance of GoogLeNet and AlexNet applied to sketches[C]//Proceedings of the AAAI conference on artificial intelligence. 2016, 30(1).

[51]Lydia A, Francis S. Adagrad—an optimizer for stochastic gradient descent[J]. Int. J. Inf. Comput. Sci, 2019, 6(5): 566-568.

[52]Chandra R, Zhang M. Cooperative coevolution of Elman recurrent neural networks for chaotic time series prediction[J]. Neurocomputing, 2012, 86: 116-123.

[53]Jordan M I, Mitchell T M. Machine learning: Trends, perspectives, and prospects[J]. Science, 2015, 349(6245): 255-260.

[54]Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE transactions on Signal Processing, 1997, 45(11): 2673-2681.

[55]Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014.

[56]Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.

[57]Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.

[58]Wang X, Chan K C K, Yu K, et al. Edvr: Video restoration with enhanced deformable convolutional networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019: 0-0.

[59]Lai W S, Huang J B, Ahuja N, et al. Deep laplacian pyramid networks for fast and accurate super-resolution[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 624-632.

[60]Iandola F, Moskewicz M, Karayev S, et al. Densenet: Implementing efficient convnet descriptor pyramids[J]. arXiv preprint arXiv:1404.1869, 2014.

[61]Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 834-848.

[62]Wang J, Yuan Y, Yu G. Face attention network: An effective face detector for the occluded faces[J]. arXiv preprint arXiv:1711.07246, 2017.

[63]Su J, Liu Z, Zhang J, et al. DV-Net: Accurate liver vessel segmentation via dense connection model with D-BCE loss function[J]. Knowledge-Based Systems, 2021, 232: 107471.

[64]Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016: 56-72.

[65]Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2550-2558.

[66]Xu Y, Wang Y, Zhou W, et al. Textfield: Learning a deep direction field for irregular scene text detection[J]. IEEE Transactions on Image Processing, 2019, 28(11): 5566-5579.

[67]Long S, Ruan J, Zhang W, et al. Textsnake: A flexible representation for detecting text of arbitrary shapes[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 20-36.

[68]Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83.

[69]Jaderberg M, Simonyan K, Vedaldi A, et al. Reading text in the wild with convolutional neural networks[J]. International journal of computer vision, 2016, 116: 1-20.

[70]Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2315-2324.

[71]Mishra A, Alahari K, Jawahar C V. Scene text recognition using higher order language priors[C]//BMVC-British machine vision conference. BMVA, 2012.

[72]Maren S, Phan K L, Liberzon I. The contextual brain: implications for fear conditioning, extinction and psychopathology[J]. Nature reviews neuroscience, 2013, 14(6): 417-428.

[73]Risnumawan A, Shivakumara P, Chan C S, et al. A robust arbitrary text detection system for natural scene images[J]. Expert Systems with Applications, 2014, 41(18): 8027-8048.

[74]Wang G. Scene text recognition with finer grid rectification[J]. arXiv preprint arXiv:2001.09389, 2020.

[75]Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(11): 2298-2304.

[76]Cheng Z, Bai F, Xu Y, et al. Focusing attention: Towards accurate text recognition in natural images[C]//Proceedings of the IEEE international conference on computer vision. 2017: 5076-5084.

[77]Shi B, Wang X, Lyu P, et al. Robust scene text recognition with automatic rectification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4168-4176

中图分类号:

 TP391.4    

开放日期:

 2023-06-15    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式