论文中文题名: | 基于CBOW-Svote的金融基金评论情感分析研究 |
姓名: | |
学号: | 20208088027 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 083500 |
学科名称: | 工学 - 软件工程 |
学生类型: | 硕士 |
学位级别: | 工学硕士 |
学位年度: | 2023 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 自然语言处理 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2023-12-14 |
论文答辩日期: | 2023-12-04 |
论文外文题名: | Research on sentiment analysis of financial fund reviews based on CBOW-Svote |
论文中文关键词: | |
论文外文关键词: | Emotional analysis ; Support vector machine ; Information entropy ; Joint voting |
论文中文摘要: |
近年来,中国经济和信息技术发展迅猛,随着金融与互联网不断融合与深化,理财方式的选择日益丰富。基金以申赎灵活、高收益等优势走入大众视野。目前,文本情感分析研究不断深入,对社交文本及评论反馈的研究也日益丰富。但针对金融基金领域的情感分析研究相对匮乏,互联网理财应用的蓬勃发展和广泛普及不断激起基民网络评论和舆情参与的热情,基金评论区由于自带社交属性,近期呈现出了前所未有的繁荣。对这些基金评论文本进行分析处理很有必要。本文以天天基金评论区为研究对象进行情感分析,并且在分析过程中进行改进与创新,研究内容可概括为: (1)针对基金领域研究中缺乏数据集以及自动标注算法采用率较差的问题,本文研究基金评论区评论文本的特点,使用伪类选择器根据网页结构建立基金评论区爬虫并建立金融基金领域情感分析数据集,搭建了针对金融基金评论文本从爬取、情感极性自动标注、文本预处理、情感分析到评论建议回复的一体化基金评论区情感分析体系。 使用基金评论领域新词获取方法、TF-IDF算法及PMI算法提升文本的分词效果并扩充基本情感词典,融入基金评论区文本特点,构建程度副词等级表来优化辅助词典,将二者与通用情感词典融合形成基金评论领域的情感词典,进一步结合弱标注信息对基金评论文本进行自动标注。对比仅采用基础词典时的标注情况,数据的采用率提升约8%,结果表明本文所扩充的基金评论情感词典对基金评论文本情感极性的把握进一步得到提升。 (2)针对集成学习软投票方法权重确定困难以及SVM对参数调节选择敏感的问题,本文根据信息熵理论提出信息熵加权法,构建了一种基于联合投票机制的CBOW-Svote情感分析模型。使用CBOW模型提取文本特征,结合以堆叠支持向量机为主的集成学习强分类器进行基金评论文本情感分析,将信息熵原理引入情感分析领域,丰富了权值选择方式,与其他加权方式比较准确率获得了一定提升。 接着运用所构建的CBOW-Svote模型结合TF-IDF算法对情感值后百分之五的语句进行词频统计,筛选其中表示消极情感的词汇,通过正则表达式对基金评论区的评论文本进行逐词遍历,捕获投资情绪较差的用户,利用所构建的基金评论区情感分析模型辨别评论区基民投资情绪。研究可应用于高效甄别基民的不理性投资情绪,减少不理性投资行为的发生;将模型分别在自建基金评论数据集和公开数据集进行验证与分析,通过与其他六种模型的对比实验得到分析结果,数据表明所建立模型综合能力较好,获得了3.15%和2.82%的平均准确度提升,同时说明本文根据信息熵理论提出的情感分析模型具有一定迁移能力。 |
论文外文摘要: |
In recent years, China's economy and information technology have developed rapidly. With the continuous integration and deepening of finance and the Internet, the choices of financial management methods have become increasingly diverse. The fund has entered the public eye with advantages such as flexible redemption and high returns. At present, research on text sentiment analysis is constantly deepening, and research on social texts and comment feedback is also becoming increasingly rich. However, research on sentiment analysis in the field of financial funds is relatively scarce. The vigorous development and widespread popularity of internet wealth management applications continue to stimulate the enthusiasm of grassroots online comments and public opinion participation. Due to its inherent social attributes, the fund comment area has recently shown unprecedented prosperity. It is necessary to analyze and process these fund comment texts. This article focuses on the emotional analysis of the Tiantian Fund comments section, and makes improvements and innovations during the analysis process. The research content can be summarized as follows: (1) In response to the lack of a dataset and poor adoption rate of automatic annotation algorithms in the field of fund research, this article studies the characteristics of fund comment area comment text. A pseudo class selector is used to establish a fund comment area crawler based on the webpage structure and a financial fund sentiment analysis dataset is established. The research focuses on crawling, automatic annotation of sentiment polarity, text preprocessing, and An integrated fund comment area sentiment analysis system from sentiment analysis to response to comment suggestions. Using the new word acquisition method in the field of fund reviews, TF-IDF algorithm, and PMI algorithm to improve the word segmentation effect of the text and expand the basic sentiment dictionary, integrating the characteristics of the fund review area text, constructing a degree adverb ranking table to optimize the auxiliary dictionary, integrating the two with the general sentiment dictionary to form an sentiment dictionary in the field of fund reviews, and further combining weak annotation information to automatically annotate the fund review text. Compared to the labeling situation when only using a basic dictionary, the data adoption rate increased by about 8%, indicating that the constructed sentiment dictionary further improved its understanding of the emotional polarity of fund comment texts. (2) In response to the difficulty in determining weights in ensemble learning soft voting methods and the sensitivity of SVM to parameter adjustment selection, this paper proposes an information entropy weighting method based on information entropy theory and constructs a CBOW-Svote sentiment analysis model based on joint voting mechanism. Using the CBOW model to extract text features, combined with an ensemble learning strong classifier based on stacked support vector machines for sentiment analysis of fund comment texts, the information entropy principle is introduced into the field of sentiment analysis, enriching the weight selection methods and achieving a certain improvement in accuracy compared to other weighting methods. Then, the constructed CBOW-Svote model combined with TF-IDF algorithm is used to perform word frequency statistics on the 5% sentences after the sentiment value, screen the vocabulary representing negative emotions, and traverse the comment text in the fund comment area word by word through regular expressions to capture users with poor investment sentiment. The constructed fund comment area sentiment analysis model is used to identify the investment sentiment of the investors in the comment area. Research can be applied to efficiently identify the irrational investment emotions of residents and reduce the occurrence of irrational investment behavior; The model was validated and analyzed on both the self built fund review dataset and the public dataset. Through comparative experiments with six other models, the analysis results showed that the established model had good comprehensive ability, achieving an average accuracy improvement of 3.15% and 2.82%. At the same time, it indicates that the sentiment analysis model proposed in this paper based on information entropy theory has certain transfer ability. |
参考文献: |
[1] 杨旸.个体化视角下青年理财的生活策略和价值追求[J].中国青年研究,2022(08):85-93.DOI:10.19633/j.cnki.11-2579/d.2022.0114. [3] 王颖洁,朱久祺,汪祖民等.自然语言处理在文本情感分析领域应用综述[J].计算机应用,2022,42(04):1011-1020. [7] 王志涛, 於志文, 郭斌,等. 基于词典和规则集的中文微博情感分析[J]. 计算机工程与应用, 2015(08):222-229. [15] 张冬雯,杨鹏飞,许云峰.基于word2vec和SVMperf的中文评论情感分类研究[J].计算机科学,2016,43(S1):418-421+447. [23] 王波.基于跨领域知识的基金评论情感分析[J].情报杂志,2011,30(02):44-47. [24] 易洪波,赖娟娟,董大勇.网络论坛不同投资者情绪对交易市场的影响—基于 VAR 模型的实证分析[J]. 财经论坛,2015( 1) : 46-54. [26] 彭怡.基于行为金融理论的投资策略[J].经贸实践,2018,No.242(24):18. [34] 周志华.机器学习[M].北京:清华大学出版社,2016:31-182. [35] 程江洲,谢诗雨,张赟宁,王劲峰,唐阳.基于聚类加权随机森林的非侵入式负荷识别[J].智慧电力.2020,48(08):123-129. [37] Zhou Z H . Ensemble Methods : Foundations and Algorithms[M].Taylor & Francis,2012:74-75. [38] 钟昕妤,李燕.中文分词技术研究进展综述[J].软件导刊,2023,22(02):225-230. [43] 韩飞,柴玉梅,王黎明等.一种结合随机游走和粗糙决策的文本分类方法[J].小型微型计算机系统,2019,40(06):1165-1173. [44] 金罡.基于词嵌入分布式表示特征的卷积循环神经网络长文本自动分类研究[J].电子技术,2022,51(06):52-54. [58] 樊振,过弋,张振豪. 基于词典和弱标注信息的电影评论情感分析[J]. 计算机应用,2018,38(11):3084-3088. [63] 朱子龙,张立臣.基于堆叠极限树集成算法的信息物理系统入侵检测方法[J].计算机应用与软件,2021,38(11):314-321. |
中图分类号: | TP391 |
开放日期: | 2023-12-15 |