论文中文题名: | 基于深度学习的 命名实体识别方法研究 |
姓名: | |
学号: | 21208223062 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 085400 |
学科名称: | 工学 - 电子信息 |
学生类型: | 硕士 |
学位级别: | 工程硕士 |
学位年度: | 2024 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 自然语言处理 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2024-06-14 |
论文答辩日期: | 2024-05-30 |
论文外文题名: | Research on Named Entity Recognition Method Based on Deep Learning |
论文中文关键词: | |
论文外文关键词: | Named Entity Recognition ; Biaffine Model ; Iteratively Dilated Convolutional Network ; Span ; Diffusion Model |
论文中文摘要: |
命名实体识别作为自然语言处理中的一项核心任务,为众多后续应用提供了坚实的基础。其主要任务是从海量非结构化文本中精准地识别出各类命名实体,并赋予它们相应的标签,如人名、地名、机构名及特定术语等。然而,命名实体识别目前面临着一些挑战:1)嵌套实体:在文本中经常出现命名实体相互嵌套的情况。2)模型局限性:传统命名实体识别模型易忽视嵌套实体间的逻辑关系。缺乏对实体语义关联的全局理解,导致标签错误分类和实体边界定位困难。本文针对以上问题展开研究,主要工作内容如下: (1)针对扁平实体存在嵌套实体的问题,提出基于跨度的命名实体识别模型RIB-NER(RoBERTa-wwm-ext IDCNN BiLSTM NER, RIB-NER )。首先,以RoBERTa-wwm-ext作为模型嵌入层提供字符级嵌入,获得更多的上下文语义信息和词汇信息。其次,利用IDCNN,以并行卷积核增加词之间的位置信息,使词与词之间联系更加紧密。同时,在该模型中融合BiLSTM网络来获取上下文信息。最后,采用双仿射模型对句子中的开始标记和结束标记评分,使用这些标记探索跨度。该模型在MSRA和Weibo两个语料库实验,并分别获得了95.11%和73.94%的F1值。 (2)针对嵌套实体之间存在词元重叠,甚至共享相同的头尾词元,导致嵌套实体边界难以确定的问题,提出基于边界感知的扩散式实体识别模型DBA-NER(Diffusion Boundary Awareness NER, DBA-NER )。该模型采用扩散过程感知命名实体的边界,通过扩大实体间信息的感知范围,从而能更准确的定位实体跨度边界的位置索引。首先,通过固定的前向扩散过程向实体边界逐步添加高斯噪声,得到一个噪声跨度。其次,在反向推理过程,利用BioBERT和ONLSTM作为句子编码器,提供句子级字符嵌入,以获得更多的上下文信息。然后,通过融合双重注意力机制建立边界可学习去噪网络,逐步细化实体边界,来生成实体边界信息。最后,使用实体分类器对识别出的实体进行分类。该模型分别在嵌套数据集ACE2004和GENIA实验验证,结果表明,该模型在两个语料库下分别获得87.93%和80.91%的F1值。 (3)设计并实现了一个的命名实体识别可视化系统。该系统实现了本文所提出的RIB-NER和DBA-NER算法,旨在通过直观展示。在系统中,研究人员和实践者可以交互式地浏览文本数据,直观地观察算法对各类命名实体的识别情况,了解算法的性能和效果。 |
论文外文摘要: |
Named entity recognition, as a core task in natural language processing, provides a solid foundation for many subsequent applications. Its main task is to accurately identify various named entities from massive unstructured texts and assign them corresponding labels, such as person names, place names, organization names, and specific terms. However, named entity recognition currently faces some challenges: 1) Nested entities: Named entities are often nested within each other in text. 2) Model limitations: Traditional named entity recognition models tend to ignore the logical relationships between nested entities. The lack of global understanding of entity semantic associations leads to label misclassification and difficulty in locating entity boundaries. This paper conducts research on the above issues. The main work contents are as follows: (1) Aiming at the problem of nested entities in flat entities, a span-based named entity recognition model RIB-NER is proposed. First, RoBERTa-wwm-ext is used as the model embedding layer to provide character-level embedding to obtain more contextual semantic information and lexical information. Secondly, IDCNN is used to increase the position information between words with parallel convolution kernels, making the connection between words closer. At the same time, the BiLSTM network is integrated in the model to obtain contextual information. Finally, a biaffine model is employed to score the start and end markers in the sentence, using these markers to explore the span. This model was tested on two corpora, MSRA and Weibo, and obtained F1 values of 95.11% and 73.94% respectively. (2) Aiming at the problem that there is overlapping word elements between nested entities, or even sharing the same head and tail word elements, which makes it difficult to determine the boundaries of nested entities, a diffuse entity recognition model DBA-NER based on boundary awareness is proposed. This model uses a diffusion process to perceive the boundaries of named entities, and by expanding the perception range of information between entities, it can more accurately locate the position index of entities across boundaries. First, Gaussian noise is gradually added to the entity boundary through a fixed forward diffusion process to obtain a noise span. Secondly, in the reverse reasoning process, BioBERT and ONLSTM are used as sentence encoders to provide sentence-level character embeddings to obtain more contextual information. Then, the boundary learnable denoising network is established by integrating the dual attention mechanism, and the entity boundary is gradually refined to generate entity boundary information. Finally, the identified entities are classified using an entity classifier. The algorithm was experimentally verified in the nested data sets ACE2004 and GENIA respectively. The results showed that the model obtained F1 values of 87.93% and 80.91% under the two corpora respectively. (3) Designed and implemented a named entity recognition visualization system. This system implements the RIB-NER and DBA-NER algorithms proposed in this article and is designed to be visually demonstrated. In the system, researchers and practitioners can interactively browse text data, intuitively observe the algorithm's recognition of various named entities, and understand the performance and effect of the algorithm. |
中图分类号: | TP391 |
开放日期: | 2024-06-17 |