查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于Web文本信息抽取的微博舆情分析
姓名：	熊祖涛
学号：	G09051
保密级别：	公开
学科代码：	085211
学科名称：	计算机技术
学生类型：	工程硕士
学位年度：	2013
院系：	计算机科学与技术学院
专业：	计算机技术
第一导师姓名：	龚尚福
论文外文题名：	Analysis of Micro-blog Public Opinion based on Text Information Extraction from Webpage
论文中文关键词：	微博 ; 信息抽取 ; 文本聚类 ; 舆情分析
论文外文关键词：	Micro-blog ; Information Extraction ; Text Clustering ; Analysis of Public Opinion
论文中文摘要：	︿据中国互联网络信息中心（CNNIC）发布的统计数据，截止到2012 年12 月，中国微博用户总量已达3.09 亿。微博所具有的裂变式传播模式、多元化传播终端、低门槛、高互动性等诸多优势，使其成为网络舆论的重要发源地。来自中国传媒大学网络舆情(口碑)研究所2011年7月发布的舆情指数显示，微博已成为仅次于新闻媒体报道的中国第二大舆情源头，在舆论导向中正在扮演着越来越重要的角色。如何及时获取微博舆情信息，了解舆情现状，预测舆情走势，从而因势利导、趋利除弊，已经成为舆情研究的一个重要的新课题。论文从这一背景出发，研究了利用Web信息抽取技术处理微博数据、进行舆情分析的方法。首先针对微博文本的特点，利用Heritrix主题网络爬虫采集微博页面，并以镜像网页的形式进行存储。再结合HTML标签的嵌套特性，为采集到的网页构建起适合访问的DOM树结构。对于微博文本形式自由、语言不规范的情况，提出对文本中包含的标点符号、表情符号、停用词、非登录词等利用人工标注和借助网络语料库处理相结合的方式进行规范化处理的方法。在中文分词和词性标注阶段，将NLPIR汉语分词和R语言Rwordseg分词两种方式进行了比较。考虑到微博文本内容短，聚类时易产生数据稀疏问题，文中提出了采用LDA模型表示微博文本，对比了基于划分的聚类方法和基于层次的聚类方法的优点与不足，提出了一种k-means聚类与层次聚类相结合的算法。在舆情分析阶段，采用基于2-POS模型方法进行主客观文本分类，利用CRFs方法结合情感词自身规律和上下文信息等进行情感词标注。最后，借助于情感词典对微博话题和评论观点进行了倾向性分析。对于论文中采用的技术手段与方法，以国内具有代表性的新浪微博为例，进行了一定的实验研究、数据对比和量化分析。初步的实验研究结果表明，文中采用的R语言分词、LDA模型、k-means与层次聚类相结合的短文本聚类、2-POS模型、CRFs等技术和方法在微博数据的处理上相对于其他传统方法具有一定的优势，能够较好地实现对于微博舆情数据的抽取、统计与分析。﹀
论文外文摘要：	︿ According to statistics released by the China Internet Network Information Center (CNNIC), the micro-blog users in China reached 309 million in total by the end of December 2012. The advantages of micro-blog including the mode of transmission of fission, diversified communication terminal, low threshold, high interactivity and so on make it an important birthplace of the network of public opinion. Public sentiment index released by the network public opinion from the Communication University of China (word of mouth) Institute in July 2011 reflected that micro-blog has become China's second largest source of public opinion after the news media reports and is playing an increasingly important role in the direction of public opinion. How to get access to the micro-blog public opinion information promptly, understand the current situation of public opinion and predict the trend of public opinion in order to make good use of the benefits and eliminate defects has become an important new topic of public opinion research. Based on this background, the method of processing micro-blog data and analyzing public opinion with Web information extraction technology is researched in the paper. Firstly, according to the characteristics of micro-blog text, the Heritrix topic Web crawler is used to collect micro-blog pages, and store them in the form of a mirror pages. Next, the good access DOM tree structure for collected pages is built up combined with HTML tags nested characteristics. As to free micro-blog text form and nonstandard language, standardized treatment methods are put forward with the combination of manually tagging and the network corpus processing dealing with punctuation, emoticons, stop words, non Login words, etc that contained in the text. In the stage of Chinese word segmentation and part-of-speech tagging, the comparation is made between Rwordseg segmentation tool in R language and NLPIR Chinese word segmentation system. As the short micro-blog text content clustering is easy to cause the problem of data sparseness, the LDA model is used to represent micro-blog text in this paper comparing advantages and disadvantages of division-based clustering method with hierarchical clustering method and putting forward a new method with the combination of k-means clustering and hierarchical clustering algorithm. During the time of public opinion analysis, we process subjective and objective text classification based on 2-POS model, and tag emotional words with CRFs method combining the laws of emotion words with context information. Finally, orientation analysis is made on the topics of micro-blog and viewpoints of the comment by means of emotional dictionary. As to technical means and methods used in the paper, we carry out the experimental studies, the comparison of the data and quantitative analysis on Sina, the representative of the domestic micro-blog. Preliminary experimental results show that techniques and methods such as the R language word segmentation, LDA model, short text clustering combined k-means with hierarchical clustering method, 2-POS model, and CRFs holding certain advantages over other traditional methods of data processing in the micro-blog data can better actualize micro-blog public opinion data extraction, statistics and analysis. ﹀
中图分类号：	TP393.09
开放日期：	2013-06-17

附件下载