论文中文题名: | 基于Hadoop的云平台在海量Web数据分析中的应用研究 |
姓名: | |
学号: | G09028 |
保密级别: | 公开 |
学科代码: | 085208 |
学科名称: | 电子与通信工程 |
学生类型: | 工程硕士 |
学位年度: | 2014 |
院系: | |
专业: | |
第一导师姓名: | |
第二导师姓名: | |
论文外文题名: | Research and Application of the Massive Web Data Analysis Based on Hadoop |
论文中文关键词: | |
论文外文关键词: | Cloud Platform ; Massive Data ; Hadoop ; Data analysis ; Web |
论文中文摘要: |
随着社会的进步和Internet技术的发展,网络数据规模日渐庞大,Web已成为全球最大的数据仓库,无论是企业还是个人都面临如何有效管理海量Web数据的难题。传统数据处理方法存在成本过高、可靠性较低、编写并行处理程序困难等诸多缺点。基于开放源代码的Hadoop并行处理框架能够有效、可靠、智能的管理海量Web数据。
为了提高传统单一节点在海量Web 数据分析和挖掘中存在时间和空间效率,通过分析Hadoop云计算开源平台技术的国内外研究现状和发展趋势,基于Hadoop开源框架分布式文件系统(HDFS)和Map/Reduce编程模型,研究了海量Web日志性能指标和一种Web挖掘算法的Map/Reduce化过程,设计了海量Web数据分析系统架构,搭建了Hadoop开发平台,实现了一个分布式的海量Web数据分析系统的开发。该系统集成了数据和应用,并通过Hadoop的应用程序编程接口(API)连接到Eclipse中,利用Maven管理和构建Hadoop项目,实现任务之间的共享操作。
通过在虚拟机搭建了4个节点的Hadoop集群环境系统测试平台,测试分析了该系统和传统系统的Shell脚本处理,统计分析了Hadoop平台Web日志数据的采集和其关键绩效指标(KPI),完成基于物品的协同过滤算法并行程序测试,测试结果表明该系统有效提高了海量Web 数据分析和挖掘的时间和空间效率。
﹀
|
论文外文摘要: |
With the rapaid development of Internet technology and society, there are more and more network data , Web has become the world's largest data warehouse , whether enterprise or personal are facing the problem of massive Web data on how to effectively manage. Traditional data processing methods have many disadvantages, for example, high cost, low reliability, difficulties of parallel processing program etc. Based on open source framework, Hadoop parallel processing can be effective, reliable, intelligent management of massive Web datas.
In order to improve the traditional single node which exists the efficiency of time and space for the massive Web data analysis and mining, by analyzing the Hadoop cloud computing platform technology research status and development trends at home and abroad, based on open source framework Hadoop Distributed File System (HDFS) and Map/Reduce programming model, this paper research the massive Web logs performance indicators and the Map/Reduce process of Web-mining algorithms, design the massive Web data analysis system architecture, build the Hadoop implements a distributed and implements a distributed mass for the development of Web data analysis system. The system integrates data and applications, which collect Eclipse via the application programming interface (API) of Hadoop, using Maven to manage and building the Hadoop project , in order to realize the task sharing.
This paper bulid the four nodes of Hadoop cluster through the virtual machine, which analyses the processing of shell script in the system and traditional system, analyses the collection of Web log data and its key performance indicators (KPI) for the Hadoop platform, completes parallel programs based on collaborative filtering algorithm, the test results show that the system is effective goods to improve the efficiency of time and space in massive Web data analysis and mining.
﹀
|
中图分类号: | TP393 |
开放日期: | 2014-06-17 |