查看论文信息

免费浏览

查看论文信息

论文中文题名：	蒙特卡洛方法的k-sigma算法在异常值检测中的应用
姓名：	陈星言
学号：	20201221057
保密级别：	公开
论文语种：	chi
学科代码：	025200
学科名称：	经济学 - 应用统计
学生类型：	硕士
学位级别：	经济学硕士
学位年度：	2023
培养单位：	西安科技大学
院系：	理学院
专业：	应用统计
研究方向：	信息统计技术
第一导师姓名：	丁正生
第一导师单位：	西安科技大学
论文提交日期：	2023-06-13
论文答辩日期：	2023-06-01
论文外文题名：	The application of k-sigma algorithm of Monte Carlo method in outlier detection
论文中文关键词：	异常值检测 ; 蒙特卡洛方法 ; k-sigma算法 ; ARIMA算法 ; 碳足迹
论文外文关键词：	Outlier detection ; Monte Carlo method ; k-sigma algorithm ; ARIMA algorithm ; carbon footprint
论文中文摘要：	︿在工业技术智能化背景下，双碳战略的实施需要实现碳足迹的量化，通过监控碳足迹实现经济增长和碳中和的平衡发展。能源管理数据库往往会形成海量的时序数据流，在数据分析中为了得到更加真实可靠的碳足迹数据，异常值检测和替换就变得尤为关键。传统k-sigma算法和ARIMA算法都是理论完善且应用广泛的算法，但在工业生产中，时序数据具有多维度的指标，传统算法并不能结合数据的相关性综合分析。为了提高异常值检测方法的有效性和实用性，本文采用蒙特卡洛方法的k-sigma算法对碳足迹数据集的异常值进行标注。根据原始数据点的均值和方差生成正态分布，并从分布中抽取大量的数据生成蒙特卡洛样本集。计算蒙特卡洛样本集中每个样本点到均值的马氏距离，与原数据集中待检测样本到均值的马氏距离作比较，通过蒙特卡洛样本集中样本点的个数自适应地调整参数k值的大小，由此判定异常值。在实验阶段，通过python对算法进行实现，将构造的算法与传统3-sigma方法以及机器学习算法作比较，结果验证了本算法的有效性。针对上一部分通过蒙特卡洛方法的k-sigma算法判定得到的连续异常值，在碳足迹数据集中，这些异常值前后的数据都可供参考，因此本文将ARIMA算法改进为双向传播的ARIMA算法，对蒙特卡洛方法k-sigma算法标注的异常值进行替换。在实验阶段，分别对两种ARIMA算法预测的替换值和真实值之间的误差进行量化。实验数据集的预测结果显示，双向传播的ARIMA算法的误差小于传统ARIMA算法的误差，体现了改进的异常值预测算法的可行性和优越性。﹀
论文外文摘要：	︿ In the context of industrial technology intelligence, the implementation of dual-carbon strategy needs to realize the quantification of carbon footprint, and realize the balanced development of economic growth and carbon neutrality through monitoring carbon footprint. Energy management databases often form massive sequential data streams. In order to obtain more real and reliable carbon footprint data in data analysis, outlier detection and replacement become particularly critical. Traditional k-sigma algorithm and ARIMA algorithm are both well-established and widely used algorithms. However, in industrial production, time series data has multi-dimensional indicators, and traditional algorithms cannot combine the correlation analysis of data comprehensively. In order to improve the validity and practicability of the outlier detection method, this paper uses the k-sigma algorithm of Monte Carlo method to label the outliers of the carbon footprint data set. A normal distribution is generated according to the mean and variance of the original data points, and a large amount of data is extracted from the distribution to generate a Monte Carlo sample set. The Markov distance between each sample point in the Monte Carlo sample set and the mean value is calculated, and compared with the Markov distance between the samples to be detected and the mean value in the original data set, the size of parameter k value is adjusted adaptively by the number of sample points in the Monte Carlo sample set, and the outliers are determined. In the experimental stage, the algorithm is implemented by python, and the constructed algorithm is compared with the traditional 3-sigma method and the machine learning algorithm, and the results verify the effectiveness of the algorithm. For the continuous outliers determined by the k-sigma algorithm of Monte Carlo method in the above part, the data before and after these outliers can be used for reference in the carbon footprint data set. Therefore, this paper improves the ARIMA algorithm to a bidirectional propagation ARIMA algorithm, and replaces the outliers marked by the k-sigma algorithm of Monte Carlo method. In the experimental phase, the error between the replacement value predicted by the two ARIMA algorithms and the true value was quantified. The prediction results of the experimental data set show that the error of the bidirectional propagation ARIMA algorithm is smaller than that of the traditional ARIMA algorithm, which shows the feasibility and superiority of the improved outlier prediction algorithm. ﹀
参考文献：	︿ [1] Habeeb R, Nasaruddin F, Gani A. Real-time big data processing for anomaly detection: A survey[J]. International Journal of Information Management, 2019, 45: 289-307. [2] Hodge V, Austin J. A survey of outlier detection methodologies[J]. Artificial intelligence review, 2004, 22(2): 85-126. [3] Hong G, Tan CD, Qin LY, Wu X. Identification of priority areas for UGI optimisation under carbon neutrality targets: Perspectives from China[J]. Ecological Indicators, 2023, 148: 110045. [4] Leonardo R, Christian B, Mariarosaria L. Carbon footprint of the globe artichoke supply chain in Southern Italy: From agricultural production to industrial processing[J]. Journal of Cleaner Production, 2023, 391: 136240. [5] Akoglu L, McGlohon M, Faloutsos C, Oddball: Spotting anomalies in weighted graphs[J]. Springer, Berlin, Heidelberg, 2010: 410-421. [6] Hamid R, Johnson A, Batta S. Detection and explanation of anomalous activities: representing activities as bags of event n-grams[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). IEEE, 2005, 1: 1031-1038. [7] Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey[J]. ACM computing surveys (CSUR), 2009, 41(3): 1-58. [8] Deokar B, Hazarnis A. Intrusion detection system using log files and reinforcement learning[J]. International Journal of Computer Applications, 2012, 45(19): 28-35. [9] Xuan S, Liu G, Li Z. Random forest for credit card fraud detection[C]//2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC). IEEE, 2018: 1-6. [10]Bertero C, Roy M, Sauvanaud C. Experience report: Log mining using natural language processing and application to anomaly detection[C]//2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2017: 351-360. [11]徐久强, 周洋洋, 王进法. 基于流时间影响域的网络流量异常检测[J]. 东北大学学报(自然科学版), 2019, 40(1): 26-31. [12]Laptev N, Amizadeh S, Flint I. Generic and Scalable Framework for Automated Time-series Anomaly Detection[C]//Acm Sigkdd International Conference on Knowledge Discovery & Data Mining. ACM, 2015: 1939-1947. [13]Du M, Li F, Zheng G. Deeplog: Anomaly detection and diagnosis from system logs through deep learning[C]//Computer and Communications Security. ACM, 2017: 1285-1298. [14]Cook R D. Detection of influential observation in linear regression[J]. Technometrics, 1977, 19(1): 15-18. [15]Sridhar R, Rajeev R, Kyuseok S. Efficient algorithms for mining outliers from large data sets[J]. ACM SIGMOD Record, 2000, 29(2): 427-438. [16]Markus M, Raymond T, Sander J. Lof: Identifying density-based local outliers[J]. ACM Sigmod Record, 2000, 29(6): 93-104. [17]Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection[J]. Data Mining and Knowledge Discovery, 2014, 28(1): 190-237. [18]Noble C C, Cook D J. Graph-based anomaly detection[J]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003: 631-636. [19]Duan L, Xu LD, Liu Y. Cluster-based outlier detection[J]. Microelectronics and Computer, 2008, 168(1): 151-168. [20]Bouguila N. Count Data Clustering Using Unsupervised Localized Feature Selection and Outliers Rejection[C]//IEEE International Conference on Tools with Artificial Intelligence. IEEE, 2011: 1020-1027. [21]Bouguila N, Almakadmeh K, Boutemedjet S. A finite mixturemodel for simultaneous high-dimensional clustering, localized feature selectionand outlier rejection[J]. Expert Systems with Applications, 2012, 39(7): 6641-6656. [22]Chen Y, Tu L. Density-based clustering for real-time stream data[C]//Acm Sigkdd International Conference on Knowledge Discovery & Data Mining. ACM, 2007: 133-142. [23]Kriegel H P, Schubert M, Zimek A. Angle-based outlierdetection in high dimensional data[J]. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008: 444-452. [24]Park C, Kim I. Outlier detection using difference-based variance estimators in multiple regression[J]. Communications in Statistics, 2018, 47(24): 5986-6001. [25]Song Y, Dong P, Wang X. Rapid penalized likelihood-based outlier detection via heteroskedasticity test[J]. Journal of Statistical Computation and Simulation, 2016, 87(6): 1206-1229. [26]She Y, Owen A B. Outlier Detection Using Nonconvex Penalized Regression[J]. Journal of the American Statistical Association, 2011, 106(494): 626-639. [27]Cook R D. Detection of influential observation in linear regression[J]. Journal of the American Statistical Association, 1979, 74(365): 169-174. [28]Yuen K V, Ortiz G A. Outlier detection and robust regression for correlated data[J]. Computer Methods in Applied Mechanics & Engineering, 2017, 313(1): 632-646. [29]Aelst S V, Rousseeuw P. Minimum volume ellipsoid[J]. Wiley Interdisciplinary Reviews Computational Statistics, 2009, 1(1): 71-82. [30]Hansen L K, Salamon P. Neural network ensembles[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 12(10): 993-1001. [31]Breiman L. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140. [32]Breiman L. Random forests[J]. Machine Learning, 2001, 45(1): 1-15. [33]Dasgupta D. Artificial neural networks and artificial immune systems:similarities and differences[J]. IEEE, 1997, 1: 873-878. [34]Izakian H, Abraham A, Snásel V. Fuzzy clustering using hybrid fuzzy c-means and fuzzy particle swarm optimization[C]//Nature and Biologically Inspired Computing. IEEE, 2009: 1690-1694. [35]丁洁, 王磊, 沈荻帆. 一种大数据异常检测系统的研究与实现[J]. 海南大学学报(自然科学版), 2015, 33(1): 24-33. [36]李海林, 邬先利. 基于频繁模式发现的时间序列异常检测方法[J]. 计算机应用, 2018, 38(11): 3204-3210． [37]戴仙波, 王娜, 刘颖. 一种基于改进高斯核函数的BGP异常检测方法[J]. 计算机工程, 2018, 45(10): 122-129. [38]Lu H, Li Y, Mu S. Motor Anomaly Detection for Unmanned Aerial Vehicles Using Reinforcement Learning[J]. Internet of Things Journal, IEEE, 2018, 5(4): 2315-2322. [39]Hartigan J, Wong M, Algorithm AS 136: A K-Means Clustering Algorithm[J]. Journal of the Royal Statistical Society. 1979, 28(1): 100-108. [40]Kwon D, Kim H, Kim J. A survey of deep learning-based network anomaly detection. Cluster Computing, 2019, 22: 949-961. [41]Liu F, Ting K, Zhou Z. Isolation forest[C]//2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008: 413-422. [42]Kim G, Myung. Multivariate outliers and decompositions of mahalanobis distance[J]. Communications in Statistics Theory & Methods, 2000, 29(7): 1511-1526. [43]Mahalanobis P C. On the generalised distance in statistics[J]. Proceedings of the National Institute of Sciences of India, 1936, 2: 49-55. [44]Metropolis N, Ulam S. The monte carlo method[J]. Journal of the American statistical association, 1949, 44(247): 335-341. [45]Reviewer D. Monte Carlo Methods: Monte Carlo Methods in Statistical Physics[J]. Computing in Science & Engineering, 2000, 2(6): 73-74. [46]Syamsiana I N, Wibowo S S, Hakim M F. Energy Database Management System (EDBMS)-based data acquisition audit for electricity savings analysis[C]//IOP Conference Series: Materials Science and Engineering, 2021, 1073(1): 012036. [47]Rotib H W, Nappu M B, Tahir Z. Electric Load Forecasting for Internet of Things Smart Home Using Hybrid PCA and ARIMA Algorithm[J]. International Journal of Electrical and Electronic Engineering & Telecommunications, 2021, 10(6): 425-430. [48]Bai L, Lu K, Dong YF. Predicting monthly hospital outpatient visits based on meteorological environmental factors using the ARIMA model[J]. Scientific reports, 2023, 13(1): 2691-2692. [49]赵云. 切比雪夫不等式与大数极限定律内在联系探析[J]. 甘肃高师学报, 2022, 27(5): 5-9. [50]Angiulli F, Basta S, Lodi S. Reducing distance computations distance based outliers[J]. Expert Systems With Applications, 2020, 147, 113215: 1-11. [51]北京市市场监督管理局. 电子信息产品碳足迹核算指南[R]. 北京市地方标准, 2021: 3-6. ﹀
中图分类号：	C81
开放日期：	2023-06-14

附件下载