- 无标题文档
查看论文信息

论文中文题名:

 基于深度学习的文本摘要生成算法研究    

姓名:

 赵洵    

学号:

 21208223071    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 085400    

学科名称:

 工学 - 电子信息    

学生类型:

 硕士    

学位级别:

 工程硕士    

学位年度:

 2024    

培养单位:

 西安科技大学    

院系:

 计算机科学与技术学院    

专业:

 软件工程    

研究方向:

 自然语言处理    

第一导师姓名:

 厍向阳    

第一导师单位:

 西安科技大学    

论文提交日期:

 2024-06-14    

论文答辩日期:

 2024-05-30    

论文外文题名:

 Research on text summarization generation algorithm based on deep learning    

论文中文关键词:

 文本摘要 ; GPT-2模型 ; Bert预训练语言模型 ; TextRank ; T5 PEGASUS预训练语言模型 ; Streamlit    

论文外文关键词:

 Text summarization ; GPT-2 model ; Bert pre-trained language model ; Textrank ; T5 PEGASUS pre-trained language model ; Streamlit    

论文中文摘要:

       随着互联网的快速发展,信息量呈井喷式增长,如何从中快速有效地获取所需信息显得极为重要。文本摘要生成技术作为自然语言处理和人工智能领域的重要研究内容之一,能利用计算机从大量文本信息中压缩提炼出简洁连贯的短文。但目前文本摘要生成技术还存在词向量一词多义、对上下文理解不充分以及长文本摘要生成任务中所存在的信息缺失和未登录词(Out Of Vocabulary, OOV)问题。为了解决上述问题,本文提出了两种基于深度学习的文本摘要生成算法。本文的主要工作如下:

       针对文本摘要生成算法中词向量一词多义以及对上下文理解不充分的问题,提出了基于改进GPT-2模型的文本摘要生成算法。该算法:①通过Bert预训练语言模型获取词向量,以得到更多的语义信息;②在自注意力机制前引入时间偏移模块以捕捉文本中的时序信息,获得更准确的上下文理解;③使用贪心策略进行解码,进一步提高摘要的质量。实验结果表明:在NLPCC2017和LCSTS数据集上,该算法能够准确捕捉到文本的关键信息和语义关联。与基线模型相比,该算法在NLPCC2017数据集上的ROUGE-1、ROUGE-2和ROUGE-L指标分别提高了6.6、6.5和5.9个百分点,在LCSTS数据集上的ROUGE-1、ROUGE-2和ROUGE-L指标分别提高了5.7、4.8和5.5个百分点。

       针对长文本摘要生成任务中的信息缺失以及OOV问题,提出了面向长文本的两阶段摘要生成算法。该算法结合了抽取式和生成式方法:①使用TetxRank算法从原文中提取关键句,并按照原文中句子顺序进行排序,从而获得信息高度集中的初步摘要;②对T5 PEGASUS预训练语言模型进行微调,将初步摘要作为其输入,并引入Copy机制,以扩大生成摘要时词汇的选择范围。实验结果表明:在NLPCC2017长文本摘要数据集上,该算法能有效保留原文的关键信息。与基线模型相比,该算法的ROUGE-1、ROUGE-2和ROUGE-L指标分别提高了1.5、1.6和1.8个百分点。

       针对目前缺少文本摘要生成系统的现状,使用轻量级的Streamlit库对提出的面向长文本的两阶段摘要生成模型进行部署,介绍并展示了文本摘要生成系统的功能。

论文外文摘要:

        With the rapid development of the Internet, the amount of information is growing exponentially, and it is extremely important to obtain the required information quickly and effectively. Text summarization generation technology as one of the important research contents in the fields of natural language processing and artificial intelligence, can utilize computers to compress and extract concise and coherent short texts from large amounts of text information. However, current text summarization generation technology still suffers from the polysemy of word vectors, insufficient understanding of context, and the lack of information and OOV problems that exist in long text summarization generation tasks. In order to solve the above problems, this paper proposes two text summary generation algorithms based on deep learning. The main work of this paper is as follows:

        Aiming at the problems of polysemy of word vectors and insufficient understanding of context in text summary generation algorithms, a text summary generation algorithm based on the improved GPT-2 model is proposed. This algorithm: ①Obtain word vectors through Bert pre-trained language model to obtain more semantic information; ②Introduce a time offset module before the self-attention mechanism to capture the timing information in the text and obtain more accurate Context understanding; ③Use the greedy algorithm for decoding to further improve the quality of the summary. The experimental results show that the algorithm can accurately capture key information and semantic associations of text on the NLPCC2017 and LCSTS datasets. Compared with the baseline model, the algorithm improved the ROUGE-1, ROUGE-2, and ROUGE-L metrics by 6.6, 6.5, and 5.9 percentage points respectively on the NLPCC2017 dataset, and increased the ROUGE-1, ROUGE-2, and ROUGE-L metrics by 5.7, 4.8, and 5.5 percentage points respectively on the LCSTS dataset.

        Aiming at the problems of information loss and OOV in long text summary tasks, a two-stage summary generation algorithm for long texts is proposed. This algorithm combines extractive and generative methods: ①Use the TetxRank algorithm to extract key sentences from the original text and sort them according to the order of the sentences in the original text to obtain preliminary summary with highly concentrated information; ②Fine-tune the T5 PEGASUS pre-trained language model, takes preliminary summary as its input, and introduces the Copy mechanism to expand the range of vocabulary choices when generating summaries. The experimental results show that the algorithm can effectively preserve the key information of the original text on the NLPCC2017 long text summary dataset. Compared with the baseline model, the ROUGE-1, ROUGE-2, and ROUGE-L metrics of this algorithm have improved by 1.5, 1.6, and 1.8 percentage points, respectively.

        Aiming at the problem of the current lack of a text summarization generation system, a lightweight Streamlit library was used to deploy the proposed two-stage summarization generation model for long texts, and the functions of the text summarization generation system were introduced and demonstrated.

中图分类号:

 TP391    

开放日期:

 2024-06-14    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式