论文中文题名: | 面向CPU-GPU平台的深度学习模型自适应划分方法研究 |
姓名: | |
学号: | 20208049011 |
保密级别: | 保密(1年后开放) |
论文语种: | chi |
学科代码: | 081203 |
学科名称: | 工学 - 计算机科学与技术(可授工学、理学学位) - 计算机应用技术 |
学生类型: | 硕士 |
学位级别: | 工学硕士 |
学位年度: | 2023 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 机器学习 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2023-06-19 |
论文答辩日期: | 2023-06-06 |
论文外文题名: | Research on Adaptive Division Method of Deep Learning Model for CPU-GPU Platform |
论文中文关键词: | |
论文外文关键词: | Characteristic Analysis ; Convolution Neural Network ; Recurrent Neural Network ; Model Partition ; Performance Predictor ; Tensor Virtual Machine |
论文中文摘要: |
随着人工智能技术快速发展,诸如卷积神经网络(Convolutional Neural Network, CNN)、循环神经网络(Recurrent Neural Network, RNN)等各种深度学习算法和专用硬件加速平台的出现,增加了算法在不同平台上开发和部署的难度。由于不同硬件设备的计算能力和通信带宽等特性存在差异,当任务计算负载和通信开销过大时,将会出现性能瓶颈,导致硬件资源难以充分利用。因此,本文基于典型CNN模型和RNN模型特征分析结果,面向CPU-GPU平台提出一种深度学习模型自适应划分方法,以缓解计算资源利用率不足的问题。 为挖掘深度学习模型的特征信息,基于TVM(Tensor Virtual Machine)的深度学习模型计算图数据结构,设计了一种算子参数信息与依赖信息的特征提取方法,为模型优化研究提供数据基础。同时,设计了一种依赖关系矩阵,用于依赖特征的提取与计算。实验结果表明,与TVM的算子融合优化策略相比,采用基于特征分析的手动优化方式可以显著提升CNN模型和RNN模型的推理性能,分别平均提升了11.9%和16.8%。 针对传统深度学习模型性能预测方法灵活性差问题,提出了一种基于多参数融合的深度学习模型性能预测方法。在Intel Xeon Gold 6248R多核CPU和Nvidia Quadro P2200 GPU上对算子进行自动化测试,采集算子的性能参数信息。基于多项式拟合方法和算子间的依赖信息,设计推理性能预测方法,实现算子和模型的推理性能预测。实验结果表明,本文提出的性能预测方法与最新相关工作相比,具有更好的泛化性能。对深度学习模型的推理性能预测效果更稳定,且支持多设备协同部署的深度学习模型的推理性能预测,预测误差不超过10%。 针对空间相关的深度学习模型在异构平台执行推理时存在硬件资源利用率低、延迟高的问题,提出了一种空间相关的深度学习模型自适应划分方法。通过特征分析和关键算子选取完成模型的自适应划分,增强调度策略的灵活性。基于划分结果设计了关键路径-贪婪校正算法,对模型推理设备进行调整,并对其性能进行对比验证与分析。实验结果表明,本文方法在模型批处理规模较小、深度较深时,优化效果明显。与TVM算子融合方法相比,推理性能平均提升了12.51%。与EOP优化方法相比,推理性能平均提升了6.9%。 针对时序相关的深度学习模型在异构平台执行推理时设备利用率不足的问题,提出一种时序相关的深度学习模型自适应划分方法。通过对模型进行特性分析,使用性能预测方法对单个时间片的推理时间进行预测,基于时间片将模型划分为性能相当的子图。基于流水线机制设计了多任务子图并行调度算法,提高模型推理的并行度。实验结果表明,与TVM算子融合方法相比,本文方法对不同结构的模型推理性能提升了14%~19%,对大批量任务推理性能最高提升了38%。与最新相关工作从任务规模、处理数据规模和模型层数三方面对比,推理性能平均提升了8.35%、12.35%和10%。 |
论文外文摘要: |
With the rapid development of artificial intelligence technology, various deep learning algorithms such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) and specific hardware acceleration platforms have emerged, making it more difficult to develop and deploy algorithms on different platforms. Due to there are differences in the characteristics of different hardware devices, such as computational power and communication bandwidth. When the computational load and communication overhead of a task are too large, performance bottlenecks may occur, making it difficult to fully utilize hardware resources. Therefore, in this paper, based on the feature analysis results of typical CNN models and RNN models, we propose an adaptive partition method for deep learning models for CPU-GPU platforms to alleviate the problem of underutilization of computational resources. Firstly, in order to explore the operator features of the deep learning model, a feature extract method of operator parameter information and dependency information is designed based on the data structure of the computational graph of the deep learning model of Tensor Virtual Machine, which provides a data basis for the research of operator characteristics. At the same time, a dependency matrix is designed to store inter-operator dependency information for the extraction and calculation of dependency features. The experimental results show that the manual optimization method based on feature analysis can significantly improve the inference performance of CNN models and RNN models by an average of 11.9% and 16.8%, respectively, compared with the operator fusion strategy of TVM. Secondly, to address the problem of poor flexibility of traditional deep learning model performance prediction methods, a deep learning model performance prediction method based on multi-parameter fusion is proposed. The automated testing of operators on Intel Xeon Gold 6248R multicore CPU and Nvidia Quadro P2200 GPU is performed to collect the performance parameter information of operators. Based on the polynomial fitting method and the dependency information among operators, the inference performance prediction method is designed to realize the inference performance prediction of operators and models. The experimental results show that the proposed method in this paper has better universalization performance compared with the previous works. The inference performance prediction of deep learning models is more stable, and it supports the inference performance prediction of deep learning models deployed collaboratively by multiple devices with prediction error not more than 10%. Thirdly, a spatially correlated deep learning model adaptive partitioning method is proposed to address the problems of low hardware resource utilization and high latency when spatially relevant deep learning models perform inference on heterogeneous platforms. The adaptive partition of the model is achieved through feature analysis and key operator selection to improve the flexibility of the scheduling strategy. Based on the partitioning results, a key path-greedy correction algorithm is designed to tune the model inference devices, and its performance is compared and verified and analyzed. The experimental results show that the optimization effect of the method in this paper is significant when the model batch size is smaller and the depth is deeper. Compared with operator fusion strategy of TVM, the inference performance is improved by 12.51% on average. Compared with the EOP optimization method, the inference performance is improved by 6.9% on average. Finally, to address the problem of underutilization of device resources when timing-dependent deep learning models inference on heterogeneous platforms, an adaptive partitioning method for timing-dependent deep learning models is proposed. By characterizing the model and using a performance prediction method to predict the inference time of a single time slice, the model is partitioned into subgraphs with comparable performance based on the time slice. A multi-task subgraph parallel scheduling algorithm is designed based on the pipeline mechanism to improve the parallelism of model inference. Experimental results show that compared with operator fusion strategy of TVM, the performance of the method in this paper is improved by 14%~19% for model inference with different structures, and up to 38% for large batch task inference. Compared with the previous works, the inference performance is improved by 8.35%, 12.35% and 10% on average in terms of task size, processed data size and model layers number. |
参考文献: |
[36] 朱虎明,李佩,焦李成,等.深度神经网络并行化研究综述[J].计算机学报, 2018, 41(08):1861-1881. [46] 刘瑞奇,李博扬,高玉金,等.新型分布式计算系统中的异构任务调度框架[J].软件学报,2022,33(3):1005-1017. [47] 邝祝芳,陈清林,李林峰,等.基于深度强化学习的多用户边缘计算任务卸载调度与资源分配算法[J].计算机学报,2022,45(04):812-824. |
中图分类号: | TP183 |
开放日期: | 2024-06-20 |