论文中文题名: | 基于领域语料库的中文自动分词系统的研究 |
姓名: | |
学号: | 04245 |
保密级别: | 公开 |
学科代码: | 081203 |
学科名称: | 计算机应用技术 |
学生类型: | 硕士 |
学位年度: | 2007 |
院系: | |
专业: | |
第一导师姓名: | |
论文外文题名: | The Research of Automatic Chinese Word Segmentation System Based OD Domain Corpus |
论文中文关键词: | |
论文外文关键词: | |
论文中文摘要: |
中文自动分词是中文信息处理中的一个重要环节,长期以来一直是人们研究的热点和难点。在中文信息处理中,分词广泛应用于信息检索、机器翻译、自动问答系统、文本挖掘等领域。计算机对于中文的处理相对于西文的处理存在更大的难度,集中体现在对文本分词的处理上。本文介绍了中文自动分词现状和存在的困难,以及目前常用的一些切分算法,在对常用的中文分词算法分析比较的基础上,采用基于词典的正向减字最大匹配算法;建立了具有三级索引的首字Hash表的词典结构,该结构与改进的正向最大匹配法形成统一;在歧义处理方面,本文采用了统计和规则相结合的歧义消除策略实现了通用语料的交集型歧义、组合型歧义以及专业语料的组合型歧义的识别和消除。 本文对词典文件进行了重组,通过计算首字偏移量的方法建立索引表,并根据词长由长到短的顺序形成词语链,进一步缩小了匹配范围、减少了匹配次数;对正向减字最大匹配算法进行了改进,其匹配算法的时间复杂度是O(n),n为词表中以某字为首字的平均词的个数。实验显示,相比其它的算法,有效的提高了切分速度。 作者对分词词典的建立方式、分词步骤及歧义字段的处理提出了新的改进方法,提高了分词的完整性和准确性,并在VC++6.0集成开发环境中实现了完整的基于计算机领域语料库的中文自动分词系统。最后分析比较了现有的中文分词算法和本文所描述的分词算法在分词效率和精度的差异,并以有针对性的文本为例进行了测试,对所用方法进行了验证。这一课题的研究及其成果对于中文信息处理中的多种领域的分词和歧义处理,都将具有一定的参考价值和良好的应用前景。
﹀
|
论文外文摘要: |
Automatic Chinese word segmentation is a key issue of Chinese information processing,
which is always a hot and difficult poin_t.In Chinese information pIoc鹳sin舀word
segmentation is widely used in lhe a慨of information retrieval,machine translation,
automatic question-answering,and text minin舀etc.It's more difficult for computers to
process Chinese than to Western letters in the弘o∞ssing ofword segmentation.In this paper,
actual states and difficulties of word segmentation躺introduced,including wide-used
segmentation algorithms.Based on the comparisons with the algorithms,Maximum Matchmg
Method based on dictionary is adopted in the paper;,and the dictionary’s structure is akind of
three-level hash index table based on capital words.At the aspect of ambiguity,the
eliminating strategy that combme statistic with rule is adopted tO realize the identification and
elimination ofmixed andcombmcd ambiguity in the common corpus and mixed ambiguity in
the domain ones.
The method that combination of dictionary's structure in which CI"eateS index table by
compI|ling capital word biased value link the word with decreasing order shorten the range of
match,and reduco the times of match.The妇complex degree of the improved h缸Ddmmn
Matching Method is O(n),which n represents the average numbers of one capital word.It is
shown in the experiment that the algorithm effectively enh硒ces the speed ofsegmentation.
Anew method which can improve the integrality and accuracy of word segmentation is
put forward to improve the constnlction of the word segmentation dictionary,the steps of
word segmentation and the process of the ambiguity.Then on the basis of analysis and
contrast of existing Chinese word segmentation algorithms and in which mentioned in this
paper with the difference on the aspects of word segmentation efficiency and越amlcy,
examples aim at that峨tested tO validate the method that used.The research and its outcome
will have valuable reference and good applicable prospect to the word segmentation and
amt“gmty processing in many domains ofChinese information processing.
﹀
|
中图分类号: | TP391.1 |
开放日期: | 2011-09-06 |