查看论文信息

免费浏览

查看论文信息

论文中文题名：	图计算中遍历类应用的局部性挖掘与优化研究
姓名：	冯茹
学号：	19208049014
保密级别：	公开
论文语种：	chi
学科代码：	081202
学科名称：	工学 - 计算机科学与技术（可授工学、理学学位） - 计算机软件与理论
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2022
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	计算机科学与技术
研究方向：	图计算
第一导师姓名：	蒋林
第一导师单位：	西安科技大学
论文提交日期：	2022-06-28
论文答辩日期：	2022-06-07
论文外文题名：	Research on Locality Mining and Optimization of Traversal Class Applications in Graph Computing
论文中文关键词：	图计算 ; 遍历类应用 ; 局部性 ; 深度分支重排序 ; 双压缩稀疏行
论文外文关键词：	Graph Processing ; Traversal Algorithm ; Locality ; Depth-Branch-Reorder ; Doubly Compressed Sparse Row
论文中文摘要：	︿大数据背景下，物理世界中图数据规模呈爆炸式增长，社交网络分析、数据挖掘、生物网络和推荐系统等应用紧密依赖于图计算系统。然而，图数据的非结构化和不规则性使得图计算在面临遍历、查找等实际问题时，造成内存访问随机性强、局部性差。而图计算操作简单、实时性强的高“访存计算比”不仅带宽需求高，且带宽浪费严重。为解决这一问题，提高图计算系统性能，论文基于综合权衡<图应用，图数据>性能参数评价模型，提出图数据重排序算法和一种新型的深度分支重排序双压缩稀疏行数据格式，缓解局部性差带来的挑战。首先，为揭示不同图计算框架的差异性，构建综合权衡<图应用，图数据>性能参数评价模型，形成图计算框架分析的充分判据。选取工作负载和代表性数据集，对Ligra、Gemini和GraphBIG图计算框架进行深入分析与研究。围绕影响图计算系统性能的因素，包括图数据结构信息、图计算应用的具体实现、图计算系统的性能指标等，使用perf面向各类真实图进行性能实测与特性分析，进而构建基于图计算框架局部性的性能评价模型。实验结果表明，Gemini框架在执行时间和计算量上性能最佳，Ligra在数据移动量上性能最好，GraphBIG框架的性能最差，尤其在缓存缺失率和执行时间上最高。其次，针对图应用的实现过程中存在数据访问随机性强、局部性差问题，提出一种基于局部性特征的深度分支重排序算法。根据构建的性能评价模型对遍历类应用进行层次社区和深度社区局部性挖掘，发现数据潜在局部性的问题。基于此，提出一种深度分支重排序算法，对算法性能与挖掘结果进行对比验证和分析。实验结果表明，重排序算法以数据集Amazon作为输入数据时，排序过程中消耗时间最短，性能最优。DBR算法时间开销比Gorder算法降低36.6%，比SeqDFS算法可降低10.6%。在缓存命中率上DBR算法较两者显著提升，以Wiki数据集作为实现BFS应用的输入数据，DBR算法的L1级缓存缺失率比Gorder算法优化18.6%，比SeqDFS算法优化13.5%。然后，针对遍历类应用的性能需求，选择合适的数据组织格式来提升图计算系统性能问题，设计一种新型图数据组织格式DBR_DCSR。研究遍历类应用与不同数据组织格式之间的关系，组织格式包括COO、CSC、CSR、DCSC、CSCI。根据不同组织格式的内存占用以及性能测试结果，设计出一种新型图数据组织格式DBR_DCSR，并对其性能进行对比验证和分析。实验结果表明，DCSC压缩格式有效提高内存存储效率，CSR压缩格式有效减少程序执行时间、数据移动量以及降低能耗，DBR_DCSR压缩格式相较内存占用最低的DCSC压缩格式内存占用率下降8.2%~11.9%。最后，为保证局部性优化方案的高效性，选择GraphBIG图计算框架进行测试与验证。基于高性能服务器搭建测试环境，设计总体验证方案。与当前典型图计算框架进行数据移动量、计算量、执行时间和缓存缺失率等指标的性能对比分析。实验结果表明，遍历类应用在GraphBIG上实现DBR_DCSR压缩算法局部性性能最优。以CA数据集作为实现BFS应用的输入数据，在执行时间上，GraphBIG_DBR_DCSR与DBG算法相比减少17.6%，与LBR算法相比减少8.7%。较Ligra与Gemini框架的原始数据分别减少了56.4%、11.1%。在数据移动量上，GraphBIG_DBR_DCSR较Ligra与Gemini框架的原始数据分别减少了7.4%、28.6%。在缓存缺失率上，GraphBIG_DBR_DCSR较Ligra与Gemini框架的原始数据分别减少了19.7%、15.6%。﹀
论文外文摘要：	︿ In the context of big data, the scale of graph data in the physical world has exploded, and applications such as social network analysis, data mining, biological networks, and recommendation systems are closely dependent on graph computing systems. However, due to the unstructured and irregular nature of graph data, when graph computing faces practical problems such as traversal and search, it results in strong randomness and poor locality of memory access. However, the high "memory access calculation ratio" with simple operation and strong real-time performance of graph computing not only requires high bandwidth, but also wastes serious bandwidth. In order to solve this problem and improve the performance of the graph computing system, this paper proposes a graph data reordering algorithm and a novel deep branch reordering double-compressed sparse row data format based on a comprehensive trade-off performance parameter evaluation model. Alleviate the challenges posed by poor locality. Firstly, in order to reveal the differences between different graph computing frameworks, a comprehensive trade-off performance parameter evaluation model is constructed to form a sufficient criterion for graph computing framework analysis. Select workloads and representative datasets to conduct in-depth analysis and research on Ligra, Gemini, and GraphBIG graph computing frameworks. Focusing on the factors that affect the performance of the graph computing system, including graph data structure information, the specific implementation of graph computing applications, and the performance indicators of the graph computing system, perf is used to conduct performance measurement and characteristic analysis for various real graphs, and then build a graph-based computing framework. Locality performance evaluation model. The experimental results show that the Gemini framework has the best performance in terms of execution time and computation volume, Ligra has the best performance in terms of data movement, and the GraphBIG framework has the worst performance, especially in terms of cache miss rate and execution time. Secondly, to solve the problems of strong randomness and poor locality of data access in the implementation of graph applications, a deep branch reordering algorithm based on locality features is proposed. According to the constructed performance evaluation model, hierarchical community and deep community locality mining is carried out for traversal applications, and the potential locality of data is discovered. Based on this, a deep branch reordering algorithm is proposed to compare and analyze the algorithm performance and mining results. The experimental results show that when the reordering algorithm takes the dataset Amazon as the input data, the sorting process consumes the shortest time and has the best performance. The time overhead of the DBR algorithm is 36.6% lower than that of the Gorder algorithm, and 10.6% lower than that of the SeqDFS algorithm. In terms of cache hit rate, the DBR algorithm is significantly higher than the two. Using the Wiki dataset as the input data for BFS applications, the L1 cache miss rate of the DBR algorithm is 18.6% better than that of the Gorder algorithm and 13.5% better than that of the SeqDFS algorithm. Thirdly, according to the performance requirements of traversal applications, an appropriate data organization format is selected to improve the performance of graph computing systems, and a new graph data organization format DBR_DCSR is designed. Research the relationship between traversal applications and different data organization formats, including COO, CSC, CSR, DCSC, CSCI. According to the memory usage and performance test results of different organization formats, a new graph data organization format DBR_DCSR is designed, and its performance is compared, verified and analyzed. The experimental results show that the DCSC compression format effectively improves the memory storage efficiency, the CSR compression format effectively reduces program execution time, data movement and energy consumption, and the DBR_DCSR compression format reduces the memory usage by 8.2%~11.9% compared to the DCSC compression format with the lowest memory usage. Finally, in order to ensure the efficiency of the local optimization scheme, the GraphBIG graph computing framework is selected for testing and verification. Build a test environment based on a high-performance server and design an overall verification scheme. Compare and analyze the performance of data movement, computation, execution time and cache miss rate with the current typical graph computing framework. The experimental results show that the locality performance of the DBR_DCSR compression algorithm achieved by the traversal class application on GraphBIG is the best. Taking the CA dataset as the input data to implement the BFS application, the execution time of GraphBIG_DBR_DCSR is reduced by 17.6% compared with the DBG algorithm and 8.7% compared with the LBR algorithm. Compared with the original data of Ligra and Gemini frameworks, it is reduced by 56.4% and 11.1% respectively. In terms of data movement, GraphBIG_DBR_DCSR is 7.4% and 28.6% less than the original data of Ligra and Gemini frameworks, respectively. In terms of cache miss rate, GraphBIG_DBR_DCSR is 19.7% and 15.6% lower than the original data of Ligra and Gemini frameworks. ﹀
参考文献：	︿ [1]Zhuo Y Wang C, Zhang M,et al. GraphQ: Scalable PIM-Based Graph Processing[C]//Proceedings of the 52nd Annual IEEE/ACM International Symposium on Micro Architecture. New York, USA: ACM, 2019:712–725. [2]Wu C, Zhang G, Zheng W. Reviewing Large Graph Computing from a System Perspective[J]. Big Data Research, 2015,1(3): 48-61. [3]Kumar P, Huang H H. Graphone: A data store for real-time analytics on evolving graphs[J]. ACM Transactions on Storage (TOS), 2020, 15(4): 1-40. [4]Gui C Y, Zheng L, He B, et al. A survey on graph processing accelerators: Challenges and opportunities[J]. Journal of Computer Science and Technology, 2019, 34(2): 339-371. [5]Xu Z, Chen X, Shen J, et al. Gardenia: A graph processing benchmark suite for next-generation accelerators[J]. ACM Journal on Emerging Technologies in Computing Systems (JETC), 2019, 15(1): 1-13. [6]Dai G, Huang T, Chi Y, et al. Graphh: A processing-in-memory architecture for large-scale graph processing[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 38(4): 640-653. [7]Chen X, Tan H, Chen Y, et al. ThunderGP: HLS-based graph processing framework on fpgas[C]//The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York, USA: ACM, 2021: 69-80. [8]Zhang Y, Liao X, Gu L, et al. AsynGraph: Maximizing data parallelism for efficient iterative graph processing on gpus[J]. ACM Transactions on Architecture and Code Optimization (TACO), 2020, 17(4): 1-21. [9]Yu J, Qin W, Zhu X, et al. DFOGraph: an I/O-and communication-efficient system for distributed fully-out-of-core graph processing[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM, 2021: 474-476. [10]Ni J, Guo X, Cheng Y. SIP: Boosting Up Graph Computing by Separating the Irregular Property Data[C]//Proceedings of the 2020 on Great Lakes Symposium on VLSI. New York, USA: ACM, 2020: 15-20. [11]Alamsyah A, Rahardjo B. Social network analysis taxonomy based on graph representation[J]. arXiv preprint arXiv,2021,2102(08888):1-11. [12]Wang X, Liu K, Lu W, et al. A fast cycle detection method for power grids based on graph computing[J].CSEE Journal of Power and Energy Systems, 2021:1-9. [13]Keshavarzi M, Rahmani-Asl M. GenFloor: Interactive generative space layout system via encoded tree graphs[J]. Frontiers of Architectural Research, 2021, 10(4): 771-786. [14]Dathathri R. Compiler and runtime systems for homomorphic encryption and graph processing on distributed and heterogeneous architectures[D]. Austin, USA:The University of Texas, 2020. [15]Tahir A, Ahmad J, Shah S A, et al. High Performance Big Data Graph Analytics Leveraging Near Memory System[C]//2019 International Conference on Advances in the Emerging Computing Technologies (AECT). Piscataway: IEEE, 2020:1-4. [16]Jin Y, Wang H, Tang X, et al. Identifying scalability bottlenecks for large-scale parallel programs with graph analysis[C]//Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM, 2020: 409-410. [17]Dhulipala L, Blelloch G E, Shun J. Theoretically efficient parallel graph algorithms can be fast and scalable[J].ACM Transactions on Parallel Computing (TOPC), 2021, 8(1): 1-70. [18]Mukkara A, Beckmann N, Abeydeera M, et al. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling[C]//2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Piscataway: IEEE, 2018: 1-14. [19]Shao Y, Cui B, Ma L. Page: a partition aware engine for parallel graph computation[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(2): 518-530. [20]Jin H, Yao P, Liao X, et al. Towards dataflow-based graph accelerator[C]//2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). Piscataway: IEEE, 2017: 1981-1992.. [21]Xie C, Chen R, Guan H, et al. Sync or async: Time to fuse for distributed graph-parallel computation[J]. ACM SIGPLAN Notices, 2015, 50(8): 194-204. [22]Zhang M, Wu Y, Zhuo Y, et al. Wonderland: A novel abstraction-based out-of-core graph processing system[J]. ACM SIGPLAN Notices, 2018, 53(2): 608-621. [23]Maass S, Min C, Kashyap S, et al. Mosaic: Processing a trillion-edge graph on a single machine[C]//Proceedings of the Twelfth European Conference on Computer Systems. New York, USA: ACM,2017: 527-543. [24]Zhang Y, Liao X, Jin H, et al. FBSGraph: Accelerating asynchronous graph processing via forward and backward sweeping[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 30(5): 895-907. [25]Qiu J, Dhulipala L, Tang J, et al. Lightne: A lightweight graph processing system for network embedding[C]//Proceedings of the 2021 international conference on management of data. New York, USA: ACM, 2021: 2281-2289. [26]Shun J, Dhulipala L, Blelloch G E. Smaller and faster: Parallel processing of compressed graphs with Ligra+[C]//2015 Data Compression Conference. Piscataway: IEEE, 2015: 403-412. [27]Yuan L, Ding C, Denning P, et al. A measurement theory of locality[J]. arXiv preprint arXiv,2018,1802(01254):1-26. [28]Chen D, Liu F, Ding C, et al. Locality analysis through static parallel sampling[J]. ACM SIGPLAN Notices, 2018, 53(4): 557-570. [29]Mukkara A, Beckmann N, Abeydeera M, et al. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling[C]//2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Piscataway: IEEE, 2018: 1-14. [30]Esfahani M K, Kilpatrick P, Vandierendonck H. Locality Analysis of Graph Reordering Algorithms[C]//2021 IEEE International Symposium on Workload Characterization (IISWC). Piscataway: IEEE, 2021: 101-112. [31]Yao P, Zheng L, Zeng Z, et al. A locality-aware energy-efficient accelerator for graph mining applications[C]//2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Piscataway: IEEE, 2020: 895-907. [32]Wei H, Yu J X, Lu C, et al. Speedup graph processing by graph ordering[C]//Proceedings of the 2016 International Conference on Management of Data. New York, USA: ACM, 2016: 1813-1828. [33]Lee E, Kim J, Lim K, et al. Pre-Select Static Caching and Neighborhood Ordering for BFS-like Algorithms on Disk-based Graph Engines[C]//2019 USENIX Annual Technical Conference (USENIX ATC 19). Berkeley, California, USA: USENIX Association, 2019: 459-474. [34]Yesil S, Heidarshenas A, Morrison A, et al. Speeding up SpMV for power-law graph analytics by enhancing locality & vectorization[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-15. [35]Ni J, Guo X, Cheng Y. SIP: Boosting Up Graph Computing by Separating the Irregular Property Data[C]//Proceedings of the 2020 on Great Lakes Symposium on VLSI. New York, USA: ACM, 2020: 15-20. [36]Dhulipala L, Blelloch G E, Shun J. Low-latency graph streaming using compressed purely-functional trees[C]//Proceedings of the 40th ACM SIGPLAN conference on programming language design and implementation. New York, USA: ACM, 2019: 918-934. [37]Jamshidi K, Mahadasa R, Vora K. Peregrine: a pattern-aware graph mining system[C]//Proceedings of the Fifteenth European Conference on Computer Systems. New York, USA: ACM, 2020: 1-16. [38]Kaushik A M, Pekhimenko G, Patel H. Gretch: a hardware prefetcher for graph analytics[J]. ACM Transactions on Architecture and Code Optimization (TACO), 2021, 18(2): 1-25. [39]Shun J, Blelloch G E. Ligra: a lightweight graph processing framework for shared memory[C]//Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming. New York, USA: ACM, 2013:135-146. [40]Sundaram N , Satish N R , Patwary MMA , et al. GraphMat: High performance graph analytics made productive[J].Proceedings of the Vldb Endowment, 2015, 8(11):1214-1225. [41]Chen X. GraphCage: Cache aware graph processing on GPUs[J]. arXiv preprint arXiv,2019,1904(02241):1-12. [42]Sun J, Vandierendonck H, Nikolopoulos D S. Fast load balance parallel graph analytics with an automatic graph data structure selection algorithm[J]. Future Generation Computer Systems, 2020, 112: 612-623. [43]Sun J, Vandierendonck H, Nikolopoulos D S. Accelerating graph analytics by utilising the memory locality of graph partitioning[C]//2017 46th International Conference on Parallel Processing (ICPP). Piscataway: IEEE, 2017: 181-190. [44]Girod L, Mei Y, Newton R, et al. Xstream: A signal-oriented data stream management system[C]//2008 IEEE 24th International Conference on Data Engineering. Piscataway: IEEE, 2008: 1180-1189. [45]Kartik Lakhotia, Rajgopal Kannan, and Viktor Prasanna. 2018. Accelerating pagerank using partition-centric processing[C]//In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '18). Berkeley, California, USA: USENIX Association, 2018: 427–440. [46]Zhu X, Chen W, Zheng W, et al. Gemini: A computation-centric distributed graph processing system [C]//12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Berkeley, California, USA: USENIX Association, 2016: 301-316. [47]Leskovec J, Krevl A. SNAP Datasets: Stanford large network dataset collection[J]. 2014. [48]Lyu Q, Sha M, Gong B, et al. Accelerating Depth-First Traversal by Graph Ordering[C]//33rd International Conference on Scientific and Statistical Database Management. New York, USA: ACM, 2021: 13-24. [49]Kunegis J. Konect: the koblenz network collection[C]//Proceedings of the 22nd international conference on world wide web. New York, USA: ACM, 2013: 1343-1350. [50]Nai L, Xia Y, Tanase I G, et al. GraphBIG: understanding graph computing in the context of industrial solutions[C]//SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2015: 1-12. [51]Liu Y, Schmidt B. Lightspmv: faster cuda-compatible sparse matrix-vector multiplication using compressed sparse rows[J]. Journal of Signal Processing Systems, 2018, 90(1): 69-86. [52]Lane P A, Booth J D. Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format[J]. arXiv preprint arXiv, 2022, 2203(05096): 1-10. [53]Buluç A, Gilbert J R. The Combinatorial BLAS: Design, implementation, and applications[J]. The International Journal of High Performance Computing Applications, 2011, 25(4): 496-509. [54]Faldu P, Diamond J, Grot B. A closer look at lightweight graph reordering[C]//2019 IEEE International Symposium on Workload Characterization (IISWC). Piscataway: IEEE, 2019: 1-13. [55]Zou M, Zhang M, Wang R, et al. Accelerating Graph Processing With Lightweight Learning-Based Data Reordering[J]. IEEE Computer Architecture Letters, 2022, 21(1): 5-8. ﹀
中图分类号：	TP399
开放日期：	2022-06-29

附件下载