查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于数据流的相关挖掘方法研究
姓名：	周芬芬
学号：	201208376
学生类型：	工程硕士
学位年度：	2015
院系：	计算机科学与技术学院
专业：	计算机技术
第一导师姓名：	杨君锐
论文外文题名：	Related mining methods research based on data streams
论文中文关键词：	数据挖掘 ; 数据流 ; 最大频繁模式 ; 密度网格 ; 聚类
论文外文关键词：	Data mining ; data stream ; maximal frequent patterns ; Density grid ; Clustering
论文中文摘要：	︿科技的进步使得数据的产生和搜集更加容易，如何从大量数据中获取有用的信息，以指导人类的某些生产过程和行为是数据分析领域的重要研究课题。作为数据分析的一种有效手段，数据挖掘技术能够发现隐含在大量数据中的有趣知识，而数据流的出现给相关处理技术提出了更高的要求。数据流具有流动性、无限性和高速性等一系列不同于传统数据的新特点，这要求相关的挖掘算法必须是高速增量的，并且能够利用有限的内存在一定误差范围内得到有效的挖掘结果。本文针对数据流分析处理中的几个基本问题，在广泛查阅国内外文献的基础上进行了研究和分析。主要内容集中在以下两点。首先，针对数据流环境，提出了一种有效的最大频繁模式挖掘算法DSM-Miner。算法中使用事务滑动窗口来指定每次处理的事务个数，并以衰减方式对新旧事务加以区别和对待，同时，在改进经典FP-Tree结构的基础上提出了一种滑动窗口最大频繁模式树SWM-Tree，并通过动态更新SWM-Tree对模式进行增量存储与维护。在最大频繁模式挖掘过程中采用了以对应节点为根的枚举树作为搜索空间，加之适当的剪枝和位项组运算方式，以及“深度优先”搜索策略。通过多个不同数据集下的测试，证明了DSM-Miner算法在时间复杂度、空间复杂度以及可扩展性这三个方面的有效性。其次，提出了一种基于密度网格的数据流聚类算法DSCBDG。DSCBDG借鉴了经典算法CluStream的两层架构，将聚类过程划分成在线和离线两个子过程。在线部分采用密度网格实时维护数据流概要信息，利用动态划分网格技术对网格单元进行定期更新，并利用改进的金字塔时间模型，在快照对应的时刻实时维护数据流概要信息。离线部分首先取出相应快照中存储的网格表，并采用可变的密度阈值对网格进行判断分析，通过合并相邻密集网格得到初始的微型簇。最后以微型簇为顶点，簇间距离为加权边构建连通图，生成最小生成树。通过保留最短距离的边获得最终的聚类结果。﹀
论文外文摘要：	︿ The advances in technology make it easier to generate and collect data, how to get useful information from large amounts of data to guide some production process and behavior of human beings is becoming a significant research in data analysis. As an effective means of data analysis, data mining can discover interesting knowledge from large amounts of data, while the emergence of data stream has put forward higher requirements in the related processing technology. Different from the traditional data, data stream has a series of new features such as liquidity, unlimited and high speed, which decides that the mining algorithm must be high speed increment, and also can use the limited memory to get mining results effectively within a certain error range. This paper mainly did some research and analysis based on a wide range of literature review to several basic problems of data stream analysis and processing, and the main contents focus on the following points. Firstly, based on data stream environment, an effective algorithm DSM-Miner for mining maximal frequent patterns was proposed. It uses Transactions Sliding Window to specify the number of transactions in each treatment process, and distinguishes and treats the old and new transactions by the way of decaying, meanwhile it proposes a sliding window maximal frequent pattern tree SWM-Tree based on the improved classical FP-Tree structure, incrementally maintains and stores patterns by updating SWM-Tree dynamically. In the process of mining maximal frequent patterns, the algorithm uses the corresponding node of SWM-Tree as the root of an enumeration tree and uses this enumeration tree as a search space. In addition, the algorithm also adopts appropriate pruning operations, calculation method of bit items group and "depth-first" search strategies. By testing several different data sets, the validity of time complexity, space complexity and scalability for DSM-Miner algorithm was be proved. Secondly, an effective algorithm DSCBDG for clustering over data stream was proposed by using the method of Density Grid. The DSCBDG references to the Two-Tier structure of the CluStream which is a classical algorithm, and divide the clustering process into two sub-processes: online and offline. The online part uses Density Grid to maintain the summary information of data stream in real time, and updates the grid regularly by mesh dynamically. Then uses the model of modified pyramid-time to maintains grid information in real time on the moment of snapshot. The offline part firstly takes out the grid table in the corresponding snapshots, then uses the variable density threshold to judges and analyzes the grid, getting the initial tiny clusters by merging adjacent grids. Finally, uses the clusters as vertexes, and distance between clusters as edge-weighted to build a connected graph, then generates minimum spanning tree. The final clustering result is obtained by retaining the edge of the shortest distance. ﹀
中图分类号：	TP311.13
开放日期：	2015-06-18

附件下载