- 无标题文档
查看论文信息

论文中文题名:

 基于领域语料库的中文自动分词系统的研究    

姓名:

 杜 璞    

学号:

 04245    

保密级别:

 公开    

学科代码:

 081203    

学科名称:

 计算机应用技术    

学生类型:

 硕士    

学位年度:

 2007    

院系:

 计算机科学与技术学院    

专业:

 计算机应用技术    

第一导师姓名:

 张小艳    

论文外文题名:

 The Research of Automatic Chinese Word Segmentation System Based OD Domain Corpus    

论文中文关键词:

 中文分词 最大匹配法 歧义字段 语料库    

论文外文关键词:

 Chinese segmentation Maximum MaShingMethod ambiguity    

论文中文摘要:
中文自动分词是中文信息处理中的一个重要环节,长期以来一直是人们研究的热点和难点。在中文信息处理中,分词广泛应用于信息检索、机器翻译、自动问答系统、文本挖掘等领域。计算机对于中文的处理相对于西文的处理存在更大的难度,集中体现在对文本分词的处理上。本文介绍了中文自动分词现状和存在的困难,以及目前常用的一些切分算法,在对常用的中文分词算法分析比较的基础上,采用基于词典的正向减字最大匹配算法;建立了具有三级索引的首字Hash表的词典结构,该结构与改进的正向最大匹配法形成统一;在歧义处理方面,本文采用了统计和规则相结合的歧义消除策略实现了通用语料的交集型歧义、组合型歧义以及专业语料的组合型歧义的识别和消除。 本文对词典文件进行了重组,通过计算首字偏移量的方法建立索引表,并根据词长由长到短的顺序形成词语链,进一步缩小了匹配范围、减少了匹配次数;对正向减字最大匹配算法进行了改进,其匹配算法的时间复杂度是O(n),n为词表中以某字为首字的平均词的个数。实验显示,相比其它的算法,有效的提高了切分速度。 作者对分词词典的建立方式、分词步骤及歧义字段的处理提出了新的改进方法,提高了分词的完整性和准确性,并在VC++6.0集成开发环境中实现了完整的基于计算机领域语料库的中文自动分词系统。最后分析比较了现有的中文分词算法和本文所描述的分词算法在分词效率和精度的差异,并以有针对性的文本为例进行了测试,对所用方法进行了验证。这一课题的研究及其成果对于中文信息处理中的多种领域的分词和歧义处理,都将具有一定的参考价值和良好的应用前景。
论文外文摘要:
Automatic Chinese word segmentation is a key issue of Chinese information processing, which is always a hot and difficult poin_t.In Chinese information pIoc鹳sin舀word segmentation is widely used in lhe a慨of information retrieval,machine translation, automatic question-answering,and text minin舀etc.It's more difficult for computers to process Chinese than to Western letters in the弘o∞ssing ofword segmentation.In this paper, actual states and difficulties of word segmentation躺introduced,including wide-used segmentation algorithms.Based on the comparisons with the algorithms,Maximum Matchmg Method based on dictionary is adopted in the paper;,and the dictionary’s structure is akind of three-level hash index table based on capital words.At the aspect of ambiguity,the eliminating strategy that combme statistic with rule is adopted tO realize the identification and elimination ofmixed andcombmcd ambiguity in the common corpus and mixed ambiguity in the domain ones. The method that combination of dictionary's structure in which CI"eateS index table by compI|ling capital word biased value link the word with decreasing order shorten the range of match,and reduco the times of match.The妇complex degree of the improved h缸Ddmmn Matching Method is O(n),which n represents the average numbers of one capital word.It is shown in the experiment that the algorithm effectively enh硒ces the speed ofsegmentation. Anew method which can improve the integrality and accuracy of word segmentation is put forward to improve the constnlction of the word segmentation dictionary,the steps of word segmentation and the process of the ambiguity.Then on the basis of analysis and contrast of existing Chinese word segmentation algorithms and in which mentioned in this paper with the difference on the aspects of word segmentation efficiency and越amlcy, examples aim at that峨tested tO validate the method that used.The research and its outcome will have valuable reference and good applicable prospect to the word segmentation and amt“gmty processing in many domains ofChinese information processing.
中图分类号:

 TP391.1    

开放日期:

 2011-09-06    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式