论文中文题名: | 基于TVM的卷积优化和计算图划分并行调度方法研究与实现 |
姓名: | |
学号: | 19208207034 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 085211 |
学科名称: | 工学 - 工程 - 计算机技术 |
学生类型: | 硕士 |
学位级别: | 工程硕士 |
学位年度: | 2022 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 机器学习 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2022-07-04 |
论文答辩日期: | 2022-06-07 |
论文外文题名: | Research and implementation of convolution optimization and parallel scheduling based on TVM |
论文中文关键词: | |
论文外文关键词: | Tensor virtual machine ; Memory efficient Convolution ; Convolutional neural network ; Behavior analysis ; Computer graph partition ; Parallelization |
论文中文摘要: |
随着人工智能技术的发展,诸如卷积神经网络(Convolution Neural Network,CNN)等各类人工智能算法及各类硬件平台的出现,增加了算法在不同平台上部署和开发的难度。张量虚拟机(Tensor Virtual Machine,TVM)作为通用的神经网络编译器,可以对不同类型的神经网络进行优化处理,在硬件平台生成高度优化的底层代码,已成为人工智能领域的主要优化部署平台之一。但是,不同设备因负载和通信开销等因素引起性能瓶颈,出现难以充分利用硬件资源的问题。因此,本文基于CNN算法的行为分析,提出一种基于TVM的卷积优化和计算图划分并行调度方法。 为了挖掘CNN算法中的分支信息和卷积特征信息,深入分析基于TVM的CNN算法行为模式。首先,采用后序遍历方式对计算图进行分支信息提取,获取分支开始、结束以及分支内部信息,使用分支信息构建特征计算图。其次,借助TVM中间表示,提取卷积的访存特征和计算特征。最后,根据得出的分支信息,对计算图进行手动划分实验结果表明,依据分支特征的划分方法与TVM传统方法对比,平均获得了18%的速度提升。基于卷积特征的卷积优化方法与传统方法对比,平均获得了20%的速度提升,能够有效挖掘特征信息。 针对内存优化卷积算法(Memory Efficient Convolution,MEC)在传统设备下因访问数据地址不连续导致的访存时间长等问题,提出一种适用于MEC算法访存行为的优化方法。该方法分为中间矩阵转换和矩阵运算两部分,首先,对于中间矩阵转换部分,采用修改数据读取顺序的方式对其进行优化,使数据读取方式符合算法的访存行为。其次,对于矩阵运算部分,采用更加适合矩阵运算的内存数据布局对卷积核矩阵修改,并利用TVM平台封装的计算函数,重新设计中间矩阵同卷积核矩阵的计算方式。最后,使用平台自带并行库对运算过程进行加速。实验结果表明,与传统MEC算法相比,在单个卷积层上平均获得了50%的速度提升,在多层神经网络中平均获得了57%的速度提升。 针对TVM中计算图划分方法依赖专家经验、且划分策略处理单一问题,提出一种基于分支特征的子图划分方法。首先,基于计算图分支特征信息,对计算图进行前向遍历,查找分支开始和结束节点,并切分和存储到数组中。其次,提取数组中各节点进行子图构建,统计各个子图间输入输出依赖配置,并存储到数组中。最后,基于子图间依赖信息对子图进行输入输出依赖配置,并进行参数和设备信息的选择与配置。实验结果表明,在分别采用48和96个CPU核心数时,与传统TVM运行机制对比,本文方法中CNN算法的推理速度提升了20%和15%,有效实现计算图的划分。 针对TVM中单个子图无法并行问题,设计并实现一种分支并行方法。首先,设计有向无环图和后序支配树,记录节点的键值、顺序和依赖关系等信息。其次,基于上述信息对分支进行搜索,并将分支节点打包成函数,标记为并行节点。第三,完成并行图标记后,在并行运行阶段对计算图进行处理,涉及并行间线程池设计,数据交互及运行操作。实验结果表明,相比于TVM传统串行方法,分支并行方法在CPU和GPU上推理速度分别提升了10%和20%,相比于Greedy算法,本文算法推理速度平均提升了5%,能够有效利用硬件设备资源。 |
论文外文摘要: |
With the development of artificial intelligence technology, the emergence of various artificial intelligence algorithms such as Convolution Neural Network (CNN) increases the difficulty of algorithms deployment and development on different platforms. The Tensor Virtual Machine (TVM), as a universal neural network compiler, can optimize different types of neural networks and generate highly optimized underlying code on hardware platforms. It has become one of the major optimization deployment platforms in the field of artificial intelligence. However, different devices are difficult to make full use of hardware resources due to performance bottlenecks caused by load and communication overhead. Therefore, this thesis based on the behavior analysis of CNN algorithm proposes a parallel scheduling method of convolution optimization and computational graph partition based on TVM. In order to mine the branch information and convolution feature information in CNN algorithm, the behavior pattern of CNN algorithm based on TVM is deeply analyzed. Firstly, extract the branch information from the computational graph by post-order traversal method, obtain the start, end and internal information of the branch, and construct the feature computational graph by the branch information. Secondly, with the help of the TVM, the features of access and computation of convolution are extracted. Finally, divide the computational graph manually according to the obtained branch information. Experimental results show that compared with the traditional TVM method, the method based on branch characteristics achieves an average speed improvement of 18%. Compared with traditional methods, the convolution optimization method based on convolution features achieves an average speed increase of 20%, and feature information is mined effectively. Aiming at the problem of long access time caused by discontinuous data address of Memory Efficient Convolution (MEC) algorithm on traditional devices, an optimization method applying to MEC algorithm access behavior is proposed. The method is divided into two parts: the intermediate matrix transformation and the matrix operation. First, the intermediate matrix transformation is optimized by modifying the data reading order to make the data reading mode conform to the access behavior of the algorithm. Next, for the matrix operation part, the convolution kernel matrix is modified by using the memory data layout more suitable for matrix operation, and the calculation function encapsulated by TVM platform is used to redesign the calculation method of the middle matrix and the convolution kernel matrix. Finally, the platform's own parallel library is used to speed up the computing process. Experimental results show that compared with the MEC, the average speed is improved by 50% on a single convolutional layer and more than 57% on a multi-layer neural network. In view of the problems that the method of computing graph partition in TVM relies on expert experience and the partition strategy is single, a subgraph partition method based on branch features is proposed. At first, traverse the computational graph forward based on the characteristic information of the branches of the computational graph, search for the start and end nodes of the branches, slice and store them in an array. Then, each node in the array is extracted to construct the subgraph. The dependency configurations of input and output between each subgraph are counted and stored in an array. Finally, using dependency information between subgraphs is to configure input and output dependencies for subgraphs, and select and configure parameters and device information. Experimental results show that when 48 and 96 CPU cores are used, the speed of CNN algorithm is improved by 20% and 15% compared with traditional TVM operation mechanism, which effectively achieves the division of computational graph. To solve the problem that single subgraph cannot be parallel in TVM, a branch parallel method is designed and implemented. Firstly, a directed acyclic graph and a back-order dominated tree are designed to record the key value, order and dependency of nodes. Secondly, the information is used to search for branches, and package the branch nodes into functions and label as parallel nodes. Thirdly, after the parallel graph marking is completed, the parallel runtime is used to process the computational graph, which involves inter-parallel thread pool design, data interaction and operation. Experimental results show that, compared with the traditional serial method of TVM, the branch parallel method can improve the inference speed by 10% and 20% on CPU and GPU, and the algorithm can improve the inference speed by 5% on average compared with the Greedy algorithm, which can make efficient use of hardware device resources. |
参考文献: |
[3] David, Kanter. Google TPU Boosts Machine Learning[J]. Microprocessor report, 2017, 31(5):18-21. [16] 朱虎明,李佩,焦李成,杨淑媛,侯彪.深度神经网络并行化研究综述[J].计算机学报, 2018, 41(08):1861-1881. [20] 张顺,龚怡宏,王进军.深度卷积神经网络的发展及其在计算机视觉领域的应用[J].计算机学报, 2019, 42(03):453-482. [48] 方玉玲, 陈庆奎. 基于矩阵转换的卷积计算优化方法[J].计算机工程, 2019, 045(007):217-221. |
中图分类号: | TP391 |
开放日期: | 2022-07-11 |