- 无标题文档
查看论文信息

论文中文题名:

 基于词嵌入和k-近邻的中文文本分类及应用    

姓名:

 马春洁    

学号:

 18201009007    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 070104    

学科名称:

 理学 - 数学 - 应用数学    

学生类型:

 硕士    

学位级别:

 理学硕士    

学位年度:

 2021    

培养单位:

 西安科技大学    

院系:

 理学院    

专业:

 应用数学    

研究方向:

 计算智能    

第一导师姓名:

 丁正生    

第一导师单位:

 西安科技大学    

论文提交日期:

 2021-06-18    

论文答辩日期:

 2021-06-03    

论文外文题名:

 Improved Word Embedding and k-nearest Neighbor Algorithm for Chinese Text Classification    

论文中文关键词:

 中文文本分类 ; 改进的连续词袋模型 ; 改进的k-近邻算法 ; 游客分类    

论文外文关键词:

 Chinese text classification ; Improved continuous bag-of-words model ; Improved k-nearest neighbor algorithm ; Tourist classification    

论文中文摘要:

受中国互联网技术发展和移动社交网络进步的推动,中文文本信息的数量正在迅速增长,中文文本信息蕴含着巨大的潜在价值,如何高速、准确的对中文信息处理(CLP)具有重要研究价值与意义。基于此,本文分别对文本分类中的词嵌入和分类器构建这两步进行了改进,提升了中文信息处理的准确率与计算效率,最后通过以上提出的两种方法对游客进行了分类。

首先,本文对词嵌入方法进行了改进。不同于传统的字母语言处理方法,结合了汉字的声、形、意三维结构,针对连续词袋模型进行了改进,提出了包含中文内部特征(读音、字形)和外部特征(语义依赖上下文信息)的双通道模型。通过与相关算法进行实验对比,结果表明基于连续词袋的双通道模型对中文文本实现词嵌入更具有效性。其次,本文对分类器算法进行了改进,针对kNN算法冗余计算量过多的不足,故采用聚类算法将样本数据分为多个簇,并通过双目标函数得到距离待测点更近的簇,达到筛选样本提升运算速率的目的。通过与相关算法进行实验对比,结果表明基于两次筛选的kNN算法(TS-kNN)能够在保证正确率不受影响的基础上提升分类速率,更快更准确的得到文本分类结果。最后,为体现算法的实用性,本文通过爬虫算法得到游客的评价信息,对这些文本数据使用基于连续词袋的双通道模型进行词嵌入表示,并使用TS-kNN进行游客分类。根据分类结果对各类型的主要特点进行说明,并为旅游景点提供能够提高游客满意度的相关建议。

本文所改进的用于词嵌入的基于连续词袋的双通道模型和用于分类的TS-kNN算法,大大提升了中文文本处理的准确率与计算效率,通过实例验证,算法也具有很强的实用价值。

论文外文摘要:

Driven by the development of Internet technology and the progress of mobile social networks in China, the amount of Chinese text information is growing rapidly, and Chinese text information contains great potential value. How to deal with Chinese information processing (CLP) with high speed and accuracy has important research value and significance. Base on this, this thesis improves the word embedding and classifier construction in text classification respectively, and improves the accuracy and calculation efficiency of Chinese information processing. Finally, the tourists are classified by the two methods proposed above.

Firstly, this thesis improves the word embedding method. Different from the traditional alphabetic language processing method, it combines the three-dimensional structure of sound, shape and meaning of Chinese characters, improves the continuous bag of words model, and proposes a two-channel model including Chinese internal features (pronunciation, font) and external features (semantics dependent context information). By comparison with experimental correlation algorithm, the results show that the two channel model based on continuous bag of words is more effective for Chinese text word embedding. Then, this thesis improves the classifier algorithm. Aiming at the insufficient amount of redundant calculation of the kNN algorithm, the clustering algorithm is used to divide the sample data into multiple clusters, and the double objective function is used to obtain the closer to the point to be measured. Clusters and cluster centers achieve the purpose of screening samples to increase the computing speed. Through experimental comparison with related algorithms, the results show that the kNN algorithm (TS-kNN) based on double filtering for text classification can improve the classification speed without affecting the accuracy, and get the text classification results faster and more accurately. Finally, in order to reflect the practicality of the algorithm, this thesis obtains the evaluation information of tourists through crawlers, uses a two-channel model based on continuous bag-of-words to vectorize these text data, and tourists are classified by TS-kNN. According to the classification results, explain the main characteristics of each type, and provide relevant suggestions for tourist attractions that can improve tourist satisfaction.

It is greatly improved the accuracy and computational efficiency of Chinese text processing for the improved two-channel model based on continuous bag of words for word embedding and the TS-kNN algorithm for classification. Verified by examples, the algorithm is also very powerful practical value.

参考文献:

[1] 刘红光, 马双刚, 刘桂锋. 基于机器学习的专利文本分类算法研究综述[J]. 图书情报研究, 2016, 9(3): 79-86.

[2] 王珊, 王会举, 覃雄派, 等. 架构大数据: 挑战现状与展望[J]. 计算机学报, 2011, 34(10): 1741-1752.

[3] 李然, 林政, 林海伦, 等. 文本情绪分析综述[J]. 计算机研究与发展, 2018, 55(1): 30-52.

[4] 池云仙, 赵书良, 罗燕, 等. 基于词频统计规律的文本数据预处理方法[J]. 计算机科学, 2017, 288(10): 276-282.

[5] Zilong Jiang, Shu Gao, Liangchen Chen. Study on text representation method based on deep learning and topic information[J]. Computing,2020,102(23). 156-164.

[6] 奉国和, 郑伟. 文本分类特征降维研究综述[J]. 图书情报工作, 2011, 55(9): 109-113.

[7] 程泽凯, 林士敏. 文本分类器稳定性评估研究[J]. 情报学报, 2005, 24(1): 64-68.

[8] 奚雪峰, 周国栋. 面向自然语言处理的深度学习研究[J]. 自动化学报, 2016, 42(10): 1445-1465.

[9] Shichao Zhang. Challenges in kNN Classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 8(3): 1.

[10] 赵天昀. 一种改进的SVM决策树文本分类算法[J]. 情报杂志, 2010, 12(8): 141-143.

[11] 陈先来, 韩超鹏, 安莹, 等. 基于互信息和逻辑回归的新词发现[J]. 数据分析与知识发现, 2019, 3(8): 105-113.

[12] 余芳, 姜云飞. 一种基于朴素贝叶斯分类的特征选择方法[J]. 中山大学学报(自然科学版), 2004, 43(5): 118-118.

[13] 罗帆, 王厚峰. 结合RNN和CNN层次化网络的中文文本情感分类[J]. 北京大学学报(自然科学版), 2018, 54(3): 459-465.

[14] 李然, 林政, 林海伦, 等. 文本情绪分析综述[J]. 计算机研究与发展, 2018, 55(1): 30-52.

[15] Jian Xu, Jiawei Liu, Liangang Zhang. Improve Chinese Word Embeddings by Exploiting Internal Structure[C]. Proceedings of 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2016.

[16] Terry Winograd. What does it mean to understand language[J]. Cognitive Science, 1980, 4(3): 209-241.

[17] Duc-Hong Pham, Anh Cuong Le. Exploiting multiple word embeddings and one-hot character vectors for aspect-based sentiment analysis[J]. International Journal of Approximate Reasoning, 2018, 10(3): 1-10.

[18] Hongli Yuan, Yongchuan Tang, Wenjuan Li. A detection method for android applica-tion security based on TF-IDF and machine learning[J]. PloS one, 2020, 15(9): e0238694.

[19] Sadaf Salehkalaibar, Michèle Wigger. Distributed Hypothesis Testing Based on Unequal-Error Protection Codes[J]. IEEE Transactions on Information Theory, 2020, 66(7): 40-62.

[20] Tomas Mikolov, Kai Chen, Greg Corrado, et al. Efficient Estimation of Word Representations in Vector Space[J]. Computer Science, 2013, 10(3): 1-10.

[21] Jeffrey Pennington, Richard Socher, Christoper D. Manning. Glove: Global Vectors for Word Representation[C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, 66(7): 4150-4182.

[22] 张磊, 段青玲, 肖晓琰, 等. 基于支持向量机的中文农业文本分类技术研究[J]. 农业机械学报, 2015, 46(S1): 174-179.

[23] 杨波, 王琼, 杨仕博. 面向智能电网的文本分类研究综述[J]. 电子技术与软件工程, 2020, 4(17): 198-200.

[24] 王彬, 司杨涛, 付军涛. 基于改进的TF-IDF和贝叶斯算法的新闻分类[J]. 科技风, 2020, 17(31): 9-10.

[25] Ronan Collobert, Jason Weston, Leon Bottou. Natural language processing (Almost) from scratch[J]. Journal of Machine Learning Research. 2011, 46(12): 2493-2537.

[26] Yoon Kim. Convolutional Neural Networks for Sentence Classification[C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

[27] Zichao Yang, Diyi Yang, Chris Dyer. Hierarchical Attention Networks for Document Classification[C]. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL), 2016.

[28] Rie Johnson, Tong Zhang. Deep Pyramid Convolutional Neural Networks for Text Categorization[C]. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017.

[29] Jacob Devlin, Ming-Wei, Chang Kenton Lee, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL 2019), 2019.

[30] 樊兴华, 孙茂松. 一种高性能的两类中文文本分类方法[J]. 计算机学报, 2006, 29(1): 124-131.

[31] 胡燕, 吴虎子, 钟珞. 中文文本分类中基于词性的特征提取方法研究[J]. 武汉理工大学学报, 2007, 29(4): 132-135.

[32] 谢金宝, 侯永进, 康守强, 等. 基于语义理解注意力神经网络的多元特征融合中文文本分类[J]. 电子与信息学报, 2018, 40(5): 1258-1265.

[33] 刘进锋, 荣冈. Web文本挖掘在辅助研究中的应用[J]. 情报科学, 2006, 24(3): 400-405.

[34] 梁柯, 李健, 陈颖雪, 等. 基于朴素贝叶斯的文本情感分类及实现[J]. 智能计算机与应用, 2019, 157(5): 150-153.

[35] 李枫林, 柯佳. 基于深度学习的文本表示方法[J]. 情报科学, 2019, 37(1): 156-164.

[36] 冀宇轩. 文本向量化表示方法的总结与分析[J]. 电子世界, 2018, 3(22): 10-12.

[37] Kim Young Soo,Moon Hyun Sil,Kim Jae Kyeong. Self Introduction Essay Classification Using Doc2Vec for Efficient Job Matching[J]. Journal of Information Technology Services, 2020, 19(1): 1-5.

[38] Praveen Krishnan, C.V. Jawahar. Bringing semantics into word image representation[J]. Pattern Recognition, 2020, 3(108): 107542.

[39] García Maximiliano, Maldonado Sebastián, Vairetti Carla. Efficient n-gram construction for text categorization using feature selection techniques[J]. Intelligent Data Analysis, 2021, 25(3): 509-525.

[40] Brenda Ivette Garcia-Maya, Nikolaos Limnios. Identification of Words in Biological Sequences Under the Semi-Markov Hypothesis[J]. Journal of Computational Biology, 2020, 27(5): 683-697.

[41] Haitao Wang, Jie He, Xiaohong Zhang, et al. A Short Text Classification Method Based on N-Gram and CNN[J]. Chinese Journal of Electronics, 2020, 29(2): 248-254.

[42] Tomas Mikolov, Kai Chen, Greg Corrado, et al. Efficient Estimation of Word Representations in Vector Space[J]. Computer Science, 2013, 9(2): 2-9.

[43] Jiangsu Daxue, Xuebao Ziran, Kexue Ban. Training algorithm of dynamic hierarchical Softmax based on neural network language model[J]. Journal of Jiangsu University-Nat ural Science, 2020, 41(1): 67-80.

[44] 王丽, 肖小玲, 张乐乐. TF-IDF和Word2vec在新闻文本分类中的比较研究[J]. 电脑知识与技术, 2020, 16(29): 220-222.

[45] 张征杰, 王自强. 文本分类及算法综述[J]. 电脑知识与技术, 2012, 841(2): 825-828.

[46] 吕昊, 林君, 曾晓献. 改进朴素贝叶斯分类算法的研究与应用[J]. 湖南大学学报(自然科学版), 2012, 39(12): 56-61.

[47] Lei Guan, Tao Sun, Lin-bo Qiao, et al. An efficient parallel and distributed solution to nonconvex penalized linear SVMs[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(04): 587-604.

[48] Amato Flora, Coppolino Luigi, Cozzolino Giovanni, et al. Enhancing random forest classification with NLP in DAMEH: A system for DAta Management in eHealth Domain[J]. Neurocomputing, 2021, 21(04): 444-457.

[49] 叶斌. 论汉字结构的多样性[J]. 江西社会科学, 2010, 6(1): 212-216.

[50] (东汉)许慎著. 说文解字[M]. 杭州: 浙江古籍出版社, 2016.

[51] 中华人民共和国教育部国家语言文字工作委员会发布. 语言文字规范汉字部首表GF0011-2009[M]. 北京: 语文出版社, 2009.

[52] Fesseha Awet, Xiong Shengwu, Emiru Eshete Derb, et al. Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya[J]. Information, 2021, 12(2): 52-65.

[53] Ming Liu, Bo Lang, Zepeng Gu, et al. Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding[J].Tsinghua Science and Technology, 2017, 22(6): 619-632.

[54] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed Representations of Words and Phrases and their Compositionality[C]. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, 4(2): 7-13.

[55] 窦小凡. K-近邻算法综述[J]. 通讯世界, 2018, 8(10): 273-274.

[56] Deng Cai, Xinlei Chen. Large Scale Spectral Clustering with Landmark-Based Representation[C]. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, 21(6): 313–318.

[57] 周庆平, 谭长庚, 王宏君, 等. 基于聚类改进的KNN文本分类算法[J]. 计算机应用研究, 2016, 3382(11): 3374-3377.

[58] Nuno Moniz. Real-time 2019 Portuguese Parliament Election Results Dataset[C]. arXiv, 2019, 4(5): 21-25.

[59] 匡振曦, 武继刚, 李嘉兴. 基于聚类的环形kNN算法[J]. 计算机工程与科学, 2019, 41(5): 804-812.

[60] 尤众喜, 华薇娜, 潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响[J]. 数据分析与知识发现, 2019, 3(7): 23-33.

[61] 刘克强. 2009共享版ICTCLAS的分析与使用[J]. 科教文汇(上旬刊), 2009, 271(8): 280.

[62] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究[J]. 数据分析与知识发现, 2017, 1(3): 72-80.

中图分类号:

 TP391.1    

开放日期:

 2021-06-21    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式