查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于TVM的卷积优化和计算图划分并行调度方法研究与实现
姓名：	王朝闻
学号：	19208207034
保密级别：	公开
论文语种：	chi
学科代码：	085211
学科名称：	工学 - 工程 - 计算机技术
学生类型：	硕士
学位级别：	工程硕士
学位年度：	2022
培养单位：	西安科技大学
院系：	计算机科学与技术学院
专业：	计算机技术
研究方向：	机器学习
第一导师姓名：	蒋林
第一导师单位：	西安科技大学
论文提交日期：	2022-07-04
论文答辩日期：	2022-06-07
论文外文题名：	Research and implementation of convolution optimization and parallel scheduling based on TVM
论文中文关键词：	张量虚拟机 ; 内存优化算法 ; 卷积神经网络 ; 行为模式 ; 计算图划分 ; 并行化
论文外文关键词：	Tensor virtual machine ; Memory efficient Convolution ; Convolutional neural network ; Behavior analysis ; Computer graph partition ; Parallelization
论文中文摘要：	︿随着人工智能技术的发展，诸如卷积神经网络（Convolution Neural Network，CNN）等各类人工智能算法及各类硬件平台的出现，增加了算法在不同平台上部署和开发的难度。张量虚拟机（Tensor Virtual Machine，TVM）作为通用的神经网络编译器，可以对不同类型的神经网络进行优化处理，在硬件平台生成高度优化的底层代码，已成为人工智能领域的主要优化部署平台之一。但是，不同设备因负载和通信开销等因素引起性能瓶颈，出现难以充分利用硬件资源的问题。因此，本文基于CNN算法的行为分析，提出一种基于TVM的卷积优化和计算图划分并行调度方法。为了挖掘CNN算法中的分支信息和卷积特征信息，深入分析基于TVM的CNN算法行为模式。首先，采用后序遍历方式对计算图进行分支信息提取，获取分支开始、结束以及分支内部信息，使用分支信息构建特征计算图。其次，借助TVM中间表示，提取卷积的访存特征和计算特征。最后，根据得出的分支信息，对计算图进行手动划分实验结果表明，依据分支特征的划分方法与TVM传统方法对比，平均获得了18%的速度提升。基于卷积特征的卷积优化方法与传统方法对比，平均获得了20%的速度提升，能够有效挖掘特征信息。针对内存优化卷积算法（Memory Efficient Convolution，MEC）在传统设备下因访问数据地址不连续导致的访存时间长等问题，提出一种适用于MEC算法访存行为的优化方法。该方法分为中间矩阵转换和矩阵运算两部分，首先，对于中间矩阵转换部分，采用修改数据读取顺序的方式对其进行优化，使数据读取方式符合算法的访存行为。其次，对于矩阵运算部分，采用更加适合矩阵运算的内存数据布局对卷积核矩阵修改，并利用TVM平台封装的计算函数，重新设计中间矩阵同卷积核矩阵的计算方式。最后，使用平台自带并行库对运算过程进行加速。实验结果表明，与传统MEC算法相比，在单个卷积层上平均获得了50%的速度提升，在多层神经网络中平均获得了57%的速度提升。针对TVM中计算图划分方法依赖专家经验、且划分策略处理单一问题，提出一种基于分支特征的子图划分方法。首先，基于计算图分支特征信息，对计算图进行前向遍历，查找分支开始和结束节点，并切分和存储到数组中。其次，提取数组中各节点进行子图构建，统计各个子图间输入输出依赖配置，并存储到数组中。最后，基于子图间依赖信息对子图进行输入输出依赖配置，并进行参数和设备信息的选择与配置。实验结果表明，在分别采用48和96个CPU核心数时，与传统TVM运行机制对比，本文方法中CNN算法的推理速度提升了20%和15%，有效实现计算图的划分。针对TVM中单个子图无法并行问题，设计并实现一种分支并行方法。首先，设计有向无环图和后序支配树，记录节点的键值、顺序和依赖关系等信息。其次，基于上述信息对分支进行搜索，并将分支节点打包成函数，标记为并行节点。第三，完成并行图标记后，在并行运行阶段对计算图进行处理，涉及并行间线程池设计，数据交互及运行操作。实验结果表明，相比于TVM传统串行方法，分支并行方法在CPU和GPU上推理速度分别提升了10%和20%，相比于Greedy算法，本文算法推理速度平均提升了5%，能够有效利用硬件设备资源。﹀
论文外文摘要：	︿ With the development of artificial intelligence technology, the emergence of various artificial intelligence algorithms such as Convolution Neural Network (CNN) increases the difficulty of algorithms deployment and development on different platforms. The Tensor Virtual Machine (TVM), as a universal neural network compiler, can optimize different types of neural networks and generate highly optimized underlying code on hardware platforms. It has become one of the major optimization deployment platforms in the field of artificial intelligence. However, different devices are difficult to make full use of hardware resources due to performance bottlenecks caused by load and communication overhead. Therefore, this thesis based on the behavior analysis of CNN algorithm proposes a parallel scheduling method of convolution optimization and computational graph partition based on TVM. In order to mine the branch information and convolution feature information in CNN algorithm, the behavior pattern of CNN algorithm based on TVM is deeply analyzed. Firstly, extract the branch information from the computational graph by post-order traversal method, obtain the start, end and internal information of the branch, and construct the feature computational graph by the branch information. Secondly, with the help of the TVM, the features of access and computation of convolution are extracted. Finally, divide the computational graph manually according to the obtained branch information. Experimental results show that compared with the traditional TVM method, the method based on branch characteristics achieves an average speed improvement of 18%. Compared with traditional methods, the convolution optimization method based on convolution features achieves an average speed increase of 20%, and feature information is mined effectively. Aiming at the problem of long access time caused by discontinuous data address of Memory Efficient Convolution (MEC) algorithm on traditional devices, an optimization method applying to MEC algorithm access behavior is proposed. The method is divided into two parts: the intermediate matrix transformation and the matrix operation. First, the intermediate matrix transformation is optimized by modifying the data reading order to make the data reading mode conform to the access behavior of the algorithm. Next, for the matrix operation part, the convolution kernel matrix is modified by using the memory data layout more suitable for matrix operation, and the calculation function encapsulated by TVM platform is used to redesign the calculation method of the middle matrix and the convolution kernel matrix. Finally, the platform's own parallel library is used to speed up the computing process. Experimental results show that compared with the MEC, the average speed is improved by 50% on a single convolutional layer and more than 57% on a multi-layer neural network. In view of the problems that the method of computing graph partition in TVM relies on expert experience and the partition strategy is single, a subgraph partition method based on branch features is proposed. At first, traverse the computational graph forward based on the characteristic information of the branches of the computational graph, search for the start and end nodes of the branches, slice and store them in an array. Then, each node in the array is extracted to construct the subgraph. The dependency configurations of input and output between each subgraph are counted and stored in an array. Finally, using dependency information between subgraphs is to configure input and output dependencies for subgraphs, and select and configure parameters and device information. Experimental results show that when 48 and 96 CPU cores are used, the speed of CNN algorithm is improved by 20% and 15% compared with traditional TVM operation mechanism, which effectively achieves the division of computational graph. To solve the problem that single subgraph cannot be parallel in TVM, a branch parallel method is designed and implemented. Firstly, a directed acyclic graph and a back-order dominated tree are designed to record the key value, order and dependency of nodes. Secondly, the information is used to search for branches, and package the branch nodes into functions and label as parallel nodes. Thirdly, after the parallel graph marking is completed, the parallel runtime is used to process the computational graph, which involves inter-parallel thread pool design, data interaction and operation. Experimental results show that, compared with the traditional serial method of TVM, the branch parallel method can improve the inference speed by 10% and 20% on CPU and GPU, and the algorithm can improve the inference speed by 5% on average compared with the Greedy algorithm, which can make efficient use of hardware device resources. ﹀
参考文献：	︿ [1] Tsai M F, Tseng H J. Enhancing the identification accuracy of deep learning object detection using natural language processing[J]. The Journal of Supercomputing, 2021: 2021, 77(7):1-16. [2] Kasabov N K . Spiking neural networks for deep learning and knowledge representation: Editorial[J]. Neural Networks, 2019, 119:341-342. [3] David, Kanter. Google TPU Boosts Machine Learning[J]. Microprocessor report, 2017, 31(5):18-21. [4] Chen T, Moreau T, Jiang Z, et al. TVM: An automated end-to-end optimizing compiler for deep learning[C].//13th {USENIX} Symposium on Operating Systems Design and Implementation. 2018, PP(99): 578-594. [5] Li M, Liu Y, Liu X, et al. The Deep Learning Compiler: A Comprehensive Survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3):708-727. [6] Jiang H, Zhou Z, Ren Z, et al. CTOS: Compiler Testing for Optimization Sequences of LLVM[J]. IEEE Transactions on Software Engineering, 2021, PP(99):1-8. [7] Wang C, Pastore F, Goknil A, et al. Automatic Generation of Acceptance Test Cases from Use Case Specifications: an NLP-based Approach[J]. IEEE Transactions on Software Engineering, 2020, PP(99).1-9. [8] Lawaniya H. Introduction to Neural Network & Single Layer Neural Network featuring Computer Vision[J]. Neural Networks, 2020. PP(99):1-6. [9] Da'U A, Salim N, Rabiu I, et al. Recommendation System Exploiting Aspect-based Opinion Mining with Deep Learning Method[J]. Information Sciences, 2019, 512(1): 1279-1292. [10] Gawehn E, Hiss J A, Schneider G. Deep Learning in Drug Discovery[J]. Molecular Informatics, 2016, 35(1):1-8. [11] Chen H, Engkvist O, Wang Y, et al. The rise of deep learning in drug discovery[J]. Drug discovery today, 23(6): 1241-1250. [12] Xu J, Liu H, Wu D, et al. Generating Universal Adversarial Perturbation with ResNet[J]. Information Sciences, 2020, 537(1):302-312. [13] Jin X, Wu L, Li X, et al. ILGNet: inception modules with connected local and global features for efficient image aesthetic quality classification using domain adaptation[J]. IET Computer Vision, 2019, 13(2):206-212. [14] Ji J, Zhong B, Ma K K. Image Interpolation Using Multi-Scale Attention-Aware Inception Network[J]. IEEE Transactions on Image Processing. 2020. 29(1):9413-9428. [15] Patterson D. 50 Years of computer architecture: From the mainframe CPU to the domain-specific tpu and the open RISC-V instruction set[C].//IEEE International Solis-State Circuits Conference-(ISSCC). IEEE.2018, PP(99):27-31. [16] 朱虎明,李佩,焦李成,杨淑媛,侯彪.深度神经网络并行化研究综述[J].计算机学报, 2018, 41(08):1861-1881. [17] Abadi M, Barham P, Chen J, et al. Tensorflow: A system for large-scale machine learning[C].//In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) GA, USA, 2016, PP(99):265–283. [18] Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C].//In Proceedings of the 22nd ACM international conference on Multimedia. ACM, ACM, Orlando, FL, USA, 2014, PP(99):675–678. [19] Ahn, Byung Hoon, et al. Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices[C].//Proceedings of Machine Learning and Systems .2020, 2:44-57. [20] 张顺,龚怡宏,王进军.深度卷积神经网络的发展及其在计算机视觉领域的应用[J].计算机学报, 2019, 42(03):453-482. [21] Guo Q, Xie X, Li Y , et al. Audee: automated testing for deep learning frameworks[C].// ASE '20: 35th IEEE/ACM International Conference on Automated Software Engineering. ACM, 2020.PP(99): 486-498. [22] Qararyah F, Wahib M, Dikbayr D, et al. A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs[J]. Parallel Computing. 2020.104(102792):1-14. [23] Meloni P, Capotondi A, Deriu G, et al. NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs[J]. Acm Transactions on Reconfigurable Technology & Systems, 2018, 11(3):18.1-18.24. [24] Manasi S D, Snigdha F S, Sapatnekar S S. NeuPart: Using Analytical Models to Drive Energy-Efficient Partitioning of CNN Computations on Cloud-Connected Mobile Clients[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2020, 28(8): 1844-1857. [25] Deng Z, Peng X, Li Z, et al. Mutual Component Convolutional Neural Networks for Heterogeneous Face Recognition[J]. IEEE Transactions on Image Processing, 2019, 28(6):1-10. [26] Cho M, Brand D. MEC: memory-efficient convolution for deep neural network[C].// International Conference on Machine Learning. 2017, PP(99): 815-824. [27] Xing Y, Weng J, Wang Y, et al. An In-depth Comparison of Compilers for Deep Neural Networks on Hardware[C].// IEEE International Conference on Embedded Software and Systems (ICESS). IEEE, 2019, PP(99):1-8 [28] Urbann O, Camphausen S, Moos A. AC Code Generator for Fast Inference and Simple Deployment of Convolutional Neural Networks on Resource Constrained Systems[C].// In 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS) IEEE.2020,PP(99):1-7. [29] Boemer F, Lao Y, Cammarota, R, nGraph-HE: a graph compiler for deep learning on homomorphically encrypted data[C]// In Proceedings of the 16th ACM International Conference on Computing Frontiers. 2018. PP(99): 3-13. [30] Boemer F, Costache A, Cammarota R. nGraph-HE2: A high-throughput framework for neural network inference on encrypted data[C].//In Proceedings of the 7th ACM Workshop on Encrypted Computing & Applied Homomorphic Cryptography.2019.PP(99): 45-56. [31] Lanmin Z, Tianqi C. Optimizing Deep Learning Workloads on ARM GPU with TVM[C].//In Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning (ReQuEST '18). Association for Computing Machinery, New York, NY, USA, 2018,3(1):1-5. [32] Dukhan M. Indirect deconvolution algorithm[C].// 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. 2020,PP(99):922-926. [33] Zhao T, Hu Q, He X, et al. ECBC: Efficient Convolution via Blocked Columnizing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, PP(99):1-13. [34] Yang H, Fritzsche M, Bartz C, et al. Bmxnet: An open-source binary neural network implementation based on mxnet[C].//Proceedings of the 25th ACM international conference on Multimedia. 2017,PP(99): 1209-1212. [35] Jiang Y, Zhao T, He X, et al. BitStream: An efficient framework for inference of binary neural networks on CPUs[J]. Pattern Recognition Letters, 2019, 125(JUL):303-309. [36] W Liu, H Liu, X Liao. NGraph: Parallel Graph Processing in Hybrid Memory Systems[J]. IEEE Access. 2019, 7(1):103517-103529. [37] Chen Z, Yu C, Morris T. Bring Your Own Codegen to Deep Learning Compiler. arXiv preprint arXi.2105.PP(99):1-14. [38] T Koehler, M Steuwer. Towards a Domain-Extensible Compiler: Optimizing an Image Processing Pipeline on Mobile CPUs[C].// 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2021, PP(99): 27-38. [39] Xiang Y, Kim H. Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference[C]// 2019 IEEE Real-Time Systems Symposium (RTSS). IEEE, 2019.PP(99):1-15. [40] Ding Y, Zhu L, Jia Z. et al. Ios: Inter-operator scheduler for cnn acceleration[C].// Proceedings of Machine Learning and Systems. 2021,3(1):167-180. [41] GIBSON P, CANO J, TURNER J, et al. Optimizing grouped convolutions on edge devices[C].// In 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE. 2020,PP(99):189-196. [42] RAJABATHER K, ABIMANNAN S, et al. Restoration of Digital Design Using Row and Column Major Parsing a Technique[C].//2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). IEEE, 2020,PP(99): 1-6. [43] YANG J, ZOU B, QIU H, et al. MLFNet-Point Cloud Semantic Segmentation Convolution Network Based on Multi-Scale Feature Fusion[J]. IEEE Access, 2021, 9(1): 44950-44962. [44] Lu, Liqiang, and Yun Liang. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs[C]//Proceedings of the 55th Annual Design Automation Conference. 2018.PP(99):1-6. [45] Jia Liancheng, LiangYun, LiXiuhong, et al. Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels[J]. IEEE Transactions on Computers, 2020.7(69):986-997. [46] SHARMA S, BAWA V S, KUMAR V. A Novel Two-stage Residual Learning Based Convolutional Neural Network for Image Super Resolution[J]. Fundamenta Informaticae, 2019, 168(2): 335-351. [47] Xiong W, Katzenbeisser S, Szefer J. Leaking information through cache LRU states in commercial processors and secure caches[J]. IEEE Transactions on Computers, 2021, 70(4): 511-523. [48] 方玉玲, 陈庆奎. 基于矩阵转换的卷积计算优化方法[J].计算机工程, 2019, 045(007):217-221. [49] Chen T, Chen Y, Yu M, et al. NNBlocks: a Blockly framework for AI computing[J]. The Journal of Supercomputing, 2021, 77(8):8622-8652 [50] FAZ-HERNÁNDEZ A, LOPEZ J, DAHAB R. High-performance implementation of elliptic curve cryptography using vector instructions[J]. ACM Transactions on Mathematical Software (TOMS), 2019, 45(3): 1-35. [51] HUANG G, LIU Z, VAN L, et al. Densely connected convolutional networks[C].// Proceedings of the IEEE conference on computer vision and pattern recognition. 2017,PP(99): 4700-4708. [52] LU J, FANG C, XU M, et al. Evaluations on Deep Neural Networks Training Using Posit Number System[J]. IEEE Transactions on Computers. 2020, 70(2): 174-187. [53] QIU X, SUN T, XU Y, et al. Pre-trained models for natural language processing: A survey[J]. Science China Technological Sciences, 2020, 63(10):1872-1897. [54] Roesch J, Lyubomirsky S, Chen T, et al. Relay: A new ir for machine learning frameworks[C].// In Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages. 2019,PP(99):58-68. [55] Ramroach S, Joshi A. Accelerating Data-Parallel Neural Network Training with Weighted-Averaging Reparameterisation[J]. Parallel Processing Letters, 2021, 31(2):2150009. [56] Yu Y, Zhang M. Control Chart Recognition Based on the Parallel Model of CNN and LSTM with GA Optimization[J]. Expert Systems with Applications, 2021,185(1): 115689. [57] Li J Y, Zhan Z, Liu R, et al. Generation-Level Parallelism for Evolutionary Computation: A Pipeline-Based Parallel Particle Swarm Optimization[J]. IEEE Transactions on Cybernetics, 2020, 51(10): 4848-4859. [58] Llopart A, Ravn O, Andersen N A , et al. Generalized framework for the parallel semantic segmentation of multiple objects and posterior manipulation[C]// 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2018,PP(99): 561-568. [59] Szczotka A B, Shakir D I, Clarkson M J, et al. Zero-shot super-resolution with a physically-motivated downsampling kernel for endomicroscopy[J]. IEEE Transactions on Medical Imaging, 2021. 40(7):1863-1874. [60] Lai Y, Fan F, Wu Q, et al. LCANet: Learnable Connected Attention Network for Human Identification Using Dental Images[J]. IEEE Transactions on Medical Imaging.2022 40(3): 905-915. ﹀
中图分类号：	TP391
开放日期：	2022-07-11

附件下载