查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于Spark不平衡数据分类算法的研究
姓名：	王朋
学号：	201508389
学生类型：	工程硕士
学位年度：	2018
院系：	计算机科学与技术学院
专业：	软件工程
研究方向：	机器学习和大数据
第一导师姓名：	Li Junmin
论文外文题名：	Research on the classification algorithm of unbalance data based on Spark
论文中文关键词：	不平衡数据 ; Spark ; 聚类 ; 等比例 ; 抽样
论文外文关键词：	Unbalanced data ; Spark ; Cluster ; Equal Proportion ; Sampling
论文中文摘要：	︿随着互联网的发展，数据呈爆炸式增长，人们需要从中挖掘出有价值的信息，分类是其中最为基础的方法。不平衡数据分类是指数据中不同类别数目相差较大，在分类时分类器对少数类样本识别较低。随着数据量的增大，少数类样本也会增多，在单机环境下，传统的分类和聚类算法往往需要经过多次迭代直至达到足够小的误差才会停止运行，有时不能满足大量不平衡数据的分类。在较大规模不平衡数据中，针对分类器对少数类样本识别较低的问题，进行以下内容研究：本文提出基于Spark的类内抽样分类法，首先分析随机欠抽样存在的缺陷，提出对多数类样本通过聚类获得其整体特性，从聚类生成的多个簇中，选择数目最大的进行二次聚类，数目最小的，如果其占整个多数类样本比例非常小，则将该簇舍弃。然后根据每个簇的数目进行等比例抽样，与少数类样本构成一个平衡的数据集。最后使用Spark MLlib中的支持向量机算法对其进行分类。通过实验证明，基于Spark的类内抽样法在少数类样本识别上比随机欠抽样方式较优。当不平衡比增大时，基于Spark的类内抽样分类法分类效果并不明显，因此，本文进一步提出基于Spark的类间类内抽样分类法，该方法先对数据集进行聚类，将簇中多数类和少数类样本比值低于阈值1的簇舍弃。其次根据每个簇中多数类和少数类样本的比值，以等比例方式计算出每个簇中应当抽取的多数类样本数目。然后在对每个簇中多数类样本进行聚类生成多个子簇，根据每个子簇的数目以等比例的方式抽取多数类样本，最后和少数类样本构成一个平衡的数据集，采用Spark MLlib 中的决策树算法来进行分类，通过实验证明，基于Spark的类间类内抽样分类法在少数类样本识别上比基于Spark的类内抽样分类法较优。﹀
论文外文摘要：	︿ With the development of Internet, data is undergoing an explosive growth. People need to dig out valuable information from them and classification is one of the most basic methods. Unbalanced data classification means that the number of different classes in the data is quite different,and the classifier is less sensitive to the samples of minority class when classifying. As the amount of data increases, the number of samples of minority class will also increase. In the single machine environment, the traditional classification and clustering algorithms often need to be iterated many times until it reaches enough error, but sometimes it can not meet the classification of a large number of unbalanced data. Aiming at the low recognition of classifiers for minority class samples in the large scale imbalanced data, the following contents are studied. A within-class sampling classification method based on Spark is proposed in this thesis. Firstly, the defects of random undersampling are analyzed and propose to obtain the overall characteristics of the majority of the samples by clustering. From the clusters generated by the clustering, the maximum number is selected to execute second clustering. As for the smallest number, if it accounts for a very small proportion of the majority class samples, the cluster will be abandoned. Then, a proportional data sampling is made according to the number of clusters, and a balanced data set is formed with a few samples. Finally, the support vector machine algorithm in Spark MLlib is used to classify them. It is proved by experiments that the within-class sampling classification method based on Spark is better than random undersampling in the recongnition of minority samples. The classification effect of within-class sampling based on Spark is not obvious when the imbalanced ratio is increased. Therefore,a beween-class and within-class sampling classification method based on Spark is further proposed, which first cluster the data sets and discard the clusters with the ratio of the majority class samples and the minority class samples below the threshold value of 1. Secondly, the number of majority samples in each cluster is caculated by equal ratio according to the ratio of majority and the minority class samples. Then, the majority class samples in each cluster are clustered to generate a number of sub clusters, and the majority class samples are equal proportion extracted according to the number of sub clusters. Finally, a balanced data set is formed with a minority class samples, and the decision tree algorithm in Spark MLlib is used to classify them. It is proved by the experiment that the between-class and within-class sampling classification method is better than within-class sampling classification method based on Spark in the recongnition of minority samples. ﹀
中图分类号：	TP301.6
开放日期：	2018-06-19

附件下载