- 无标题文档
查看论文信息

论文中文题名:

 面向CPU-GPU平台的深度学习模型自适应划分方法研究    

姓名:

 尚绍法    

学号:

 20208049011    

保密级别:

 保密(1年后开放)    

论文语种:

 chi    

学科代码:

 081203    

学科名称:

 工学 - 计算机科学与技术(可授工学、理学学位) - 计算机应用技术    

学生类型:

 硕士    

学位级别:

 工学硕士    

学位年度:

 2023    

培养单位:

 西安科技大学    

院系:

 计算机科学与技术学院    

专业:

 计算机科学与技术    

研究方向:

 机器学习    

第一导师姓名:

 蒋林    

第一导师单位:

 西安科技大学    

论文提交日期:

 2023-06-19    

论文答辩日期:

 2023-06-06    

论文外文题名:

 Research on Adaptive Division Method of Deep Learning Model for CPU-GPU Platform    

论文中文关键词:

 特征分析 ; 卷积神经网络 ; 循环神经网络 ; 模型划分 ; 性能预测 ; 张量虚拟机    

论文外文关键词:

 Characteristic Analysis ; Convolution Neural Network ; Recurrent Neural Network ; Model Partition ; Performance Predictor ; Tensor Virtual Machine    

论文中文摘要:

随着人工智能技术快速发展,诸如卷积神经网络(Convolutional Neural Network, CNN)、循环神经网络(Recurrent Neural Network, RNN)等各种深度学习算法和专用硬件加速平台的出现,增加了算法在不同平台上开发和部署的难度。由于不同硬件设备的计算能力和通信带宽等特性存在差异,当任务计算负载和通信开销过大时,将会出现性能瓶颈,导致硬件资源难以充分利用。因此,本文基于典型CNN模型和RNN模型特征分析结果,面向CPU-GPU平台提出一种深度学习模型自适应划分方法,以缓解计算资源利用率不足的问题。

为挖掘深度学习模型的特征信息,基于TVM(Tensor Virtual Machine)的深度学习模型计算图数据结构,设计了一种算子参数信息与依赖信息的特征提取方法,为模型优化研究提供数据基础。同时,设计了一种依赖关系矩阵,用于依赖特征的提取与计算。实验结果表明,与TVM的算子融合优化策略相比,采用基于特征分析的手动优化方式可以显著提升CNN模型和RNN模型的推理性能,分别平均提升了11.9%和16.8%。

针对传统深度学习模型性能预测方法灵活性差问题,提出了一种基于多参数融合的深度学习模型性能预测方法。在Intel Xeon Gold 6248R多核CPU和Nvidia Quadro P2200 GPU上对算子进行自动化测试,采集算子的性能参数信息。基于多项式拟合方法和算子间的依赖信息,设计推理性能预测方法,实现算子和模型的推理性能预测。实验结果表明,本文提出的性能预测方法与最新相关工作相比,具有更好的泛化性能。对深度学习模型的推理性能预测效果更稳定,且支持多设备协同部署的深度学习模型的推理性能预测,预测误差不超过10%。

针对空间相关的深度学习模型在异构平台执行推理时存在硬件资源利用率低、延迟高的问题,提出了一种空间相关的深度学习模型自适应划分方法。通过特征分析和关键算子选取完成模型的自适应划分,增强调度策略的灵活性。基于划分结果设计了关键路径-贪婪校正算法,对模型推理设备进行调整,并对其性能进行对比验证与分析。实验结果表明,本文方法在模型批处理规模较小、深度较深时,优化效果明显。与TVM算子融合方法相比,推理性能平均提升了12.51%。与EOP优化方法相比,推理性能平均提升了6.9%。

针对时序相关的深度学习模型在异构平台执行推理时设备利用率不足的问题,提出一种时序相关的深度学习模型自适应划分方法。通过对模型进行特性分析,使用性能预测方法对单个时间片的推理时间进行预测,基于时间片将模型划分为性能相当的子图。基于流水线机制设计了多任务子图并行调度算法,提高模型推理的并行度。实验结果表明,与TVM算子融合方法相比,本文方法对不同结构的模型推理性能提升了14%~19%,对大批量任务推理性能最高提升了38%。与最新相关工作从任务规模、处理数据规模和模型层数三方面对比,推理性能平均提升了8.35%、12.35%和10%。

论文外文摘要:

With the rapid development of artificial intelligence technology, various deep learning algorithms such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) and specific hardware acceleration platforms have emerged, making it more difficult to develop and deploy algorithms on different platforms. Due to there are differences in the characteristics of different hardware devices, such as computational power and communication bandwidth. When the computational load and communication overhead of a task are too large, performance bottlenecks may occur, making it difficult to fully utilize hardware resources. Therefore, in this paper, based on the feature analysis results of typical CNN models and RNN models, we propose an adaptive partition method for deep learning models for CPU-GPU platforms to alleviate the problem of underutilization of computational resources.

Firstly, in order to explore the operator features of the deep learning model, a feature extract method of operator parameter information and dependency information is designed based on the data structure of the computational graph of the deep learning model of Tensor Virtual Machine, which provides a data basis for the research of operator characteristics. At the same time, a dependency matrix is designed to store inter-operator dependency information for the extraction and calculation of dependency features. The experimental results show that the manual optimization method based on feature analysis can significantly improve the inference performance of CNN models and RNN models by an average of 11.9% and 16.8%, respectively, compared with the operator fusion strategy of TVM.

Secondly, to address the problem of poor flexibility of traditional deep learning model performance prediction methods, a deep learning model performance prediction method based on multi-parameter fusion is proposed. The automated testing of operators on Intel Xeon Gold 6248R multicore CPU and Nvidia Quadro P2200 GPU is performed to collect the performance parameter information of operators. Based on the polynomial fitting method and the dependency information among operators, the inference performance prediction method is designed to realize the inference performance prediction of operators and models. The experimental results show that the proposed method in this paper has better universalization performance compared with the previous works. The inference performance prediction of deep learning models is more stable, and it supports the inference performance prediction of deep learning models deployed collaboratively by multiple devices with prediction error not more than 10%.

Thirdly, a spatially correlated deep learning model adaptive partitioning method is proposed to address the problems of low hardware resource utilization and high latency when spatially relevant deep learning models perform inference on heterogeneous platforms. The adaptive partition of the model is achieved through feature analysis and key operator selection to improve the flexibility of the scheduling strategy. Based on the partitioning results, a key path-greedy correction algorithm is designed to tune the model inference devices, and its performance is compared and verified and analyzed. The experimental results show that the optimization effect of the method in this paper is significant when the model batch size is smaller and the depth is deeper. Compared with operator fusion strategy of TVM, the inference performance is improved by 12.51% on average. Compared with the EOP optimization method, the inference performance is improved by 6.9% on average.

Finally, to address the problem of underutilization of device resources when timing-dependent deep learning models inference on heterogeneous platforms, an adaptive partitioning method for timing-dependent deep learning models is proposed. By characterizing the model and using a performance prediction method to predict the inference time of a single time slice, the model is partitioned into subgraphs with comparable performance based on the time slice. A multi-task subgraph parallel scheduling algorithm is designed based on the pipeline mechanism to improve the parallelism of model inference. Experimental results show that compared with operator fusion strategy of TVM, the performance of the method in this paper is improved by 14%~19% for model inference with different structures, and up to 38% for large batch task inference. Compared with the previous works, the inference performance is improved by 8.35%, 12.35% and 10% on average in terms of task size, processed data size and model layers number.

参考文献:

[1] Liu P, Yuan W, Fu J, et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing[J]. ACM Computing Surveys, 2023, 55(9): 1-35.

[2] Pellicer L F A O, Ferreira T M, Costa A H R. Data augmentation techniques in natural language processing[J]. Applied Soft Computing, 2023, 132(109803): 1-20.

[3] Luo X, Hu M, Song T, et al. Semi-supervised medical image segmentation via cross teaching between cnn and transformer[C]//International Conference on Medical Imaging with Deep Learning. PMLR, 2022: 820-833.

[4] Xu M, Yoon S, Fuentes A, et al. A comprehensive survey of image augmentation techniques for deep learning[J]. Pattern Recognition, 2023, 137(109347): 1-12.

[5] Diqi M. DeepRec: Efficient Product Recommendation Model for E-Commerce using CNN[C]//2022 Seventh International Conference on Informatics and Computing. Piscataway: IEEE, 2022: 1-6.

[6] Chen C, Hui Q, Xie W, et al. Convolutional Neural Networks for forecasting flood process in Internet-of-Things enabled smart city[J]. Computer Networks, 2021, 186(107744):1-18.

[7] Zhao Q, Zhao H, Zheng K, et al. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism[J]. Bioinformatics, 2022, 38(3): 655-662.

[8] Bhatt D, Patel C, Talsania H, et al. CNN variants for computer vision: history, architecture, application, challenges and future scope[J]. Electronics, 2021, 10(20): 2470-2497.

[9] Fang W, Chen Y, Xue Q. Survey on research of RNN-based spatio-temporal sequence prediction algorithms[J]. Journal on Big Data, 2021, 3(3): 97-110.

[10] Lindemann B, Müller T, Vietz H, et al. A survey on long short-term memory networks for time series prediction[J]. Procedia CIRP, 2021, 99: 650-655.

[11] Chen T, Moreau T, Jiang Z, et al. TVM: An automated end-to-end optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation. 2018: 578-594.

[12] Li M, Liu Y, Liu X, et al. The Deep Learning Compiler: A Comprehensive Survey[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3):708-727.

[13] Tanaka M, Taura K, Hanawa T, et al. Automatic graph partitioning for very large-scale deep learning[C]//2021 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2021: 1004-1013.

[14] Tarnawski J M, Phanishayee A, Devanur N, et al. Efficient algorithms for device placement of dnn graph operators[J]. Advances in Neural Information Processing Systems, 2020, 33(15451): 1-13.

[15] Zou K, Wang Y, Li H, et al. Learn-to-scale: Parallelizing deep learning inference on chip multiprocessor architecture[C]//2019 Design, Automation & Test in Europe Conference & Exhibition. Piscataway: IEEE, 2019: 1172-1177.

[16] Guan L, Yin W, Li D, et al. XPipe: Efficient pipeline model parallelism for multi-GPU DNN training[J]. arXiv preprint arXiv, 2020, 1911(04610): 1-9.

[17] Shen H, Roesch J, Chen Z, et al. Nimble: Efficiently compiling dynamic neural networks for model inference[J]. Proceedings of Machine Learning and Systems, 2021, 3: 208-222.

[18] Narayanan D, Phanishayee A, Shi K, et al. Memory-efficient pipeline-parallel dnn training[C]//International Conference on Machine Learning. PMLR, 2021: 7937-7947.

[19] Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles. New York, USA: ACM, 2019: 1-15.

[20] Abadi M, Barham P, Chen J, et al. Tensorflow: A system for large-scale machine learning[C]//In 12th USENIX symposium on operating systems design and implementation, Savannah, USA: ACM, 2016:265-283.

[21] Chen T, Li M, Li Y, et al. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems[J]. arXiv preprint arXiv, 2015, 1512(01274): 1-6.

[22] Jia Y, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM international conference on Multimedia. New York, USA: ACM, 2014: 675-678.

[23] Paszke A, Gross S, Massa F, et al. PyTorch: An imperative style, high-performance deep learning library[J]. Advances in neural information processing systems, 2019, 32: 1-12.

[24] Rotem N, Fix J, Abdulrasool S, et al. Glow: Graph lowering compiler techniques for neural networks[J]. arXiv preprint arXiv, 2019, 1805(00907): 1-12.

[25] Hu S M, Liang D, Yang G Y, et al. Jittor: a novel deep learning framework with meta-operators and unified graph execution[J]. Science China Information Sciences, 2020, 63(12): 1-21.

[26] Fegade P, Chen T, Gibbons P, et al. The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding[J]. Proceedings of Machine Learning and Systems, 2022, 4: 721-747.

[27] Justus D, Brennan J, Bonner S, et al. Predicting the computational cost of deep learning models[C]//2018 IEEE international conference on big data. Piscataway: IEEE, 2018: 3873-3882.

[28] Wang C C, Liao Y C, Kao M C, et al. PerfNet: Platform-aware performance modeling for deep neural networks[C]//Proceedings of the International Conference on Research in Adaptive and Convergent Systems. 2020: 90-95.

[29] Geoffrey X Y, Gao Y, Golikov P, et al. Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training[C]//2021 USENIX Annual Technical Conferenc. 2021: 503-521.

[30] Gao Y, Gu X, Zhang H, et al. Runtime Performance Prediction for Deep Learning Models with Graph Neural Network[R]. Technical Report MSR-TR-2021-3. Microsoft, 2021: 1-13.

[31] Qararyah F, Wahib M, Dikbayır D, et al. A computational-graph partitioning method for training memory-constrained DNNs[J]. Parallel Computing, 2021, 104(102792):1-16.

[32] Xu Y, Wu H, Zhang W, et al. EOP: efficient operator partition for deep learning inference over edge servers[C]//Proceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. New York, USA: ACM, 2022: 45-57.

[33] Wang S, Ananthanarayanan G, Zeng Y, et al. High-throughput CNN inference on embedded ARM Big. LITTLE multicore processors[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 39(10): 2254-2267.

[34] Luo Z, Yi X, Long G, et al. Efficient pipeline planning for expedited distributed dnn training[C]//IEEE INFOCOM 2022-IEEE Conference on Computer Communications. Piscataway: IEEE, 2022: 340-349.

[35] Busia P, Minakova S, Stefanov T, et al. ALOHA: A unified platform-aware evaluation method for CNNs execution on heterogeneous systems at the edge[J]. IEEE Access, 2021, 9(133289): 1-20.

[36] 朱虎明,李佩,焦李成,等.深度神经网络并行化研究综述[J].计算机学报, 2018, 41(08):1861-1881.

[37] Ji J, Zhong B, Ma K K. Image Interpolation Using Multi-Scale Attention-Aware Inception Network[J]. IEEE Transactions on Image Processing, 2020 , 29(9413): 1-16.

[38] Zhao S, Li F, Chen X, et al. Vpipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(3): 489-506.

[39] Yin J, Zhang X. An Optimization Toolchain Design of Deep Learning Deployment Based on Heterogeneous Computing Platform[C]//2020 International Conference on Wireless Communications and Signal Processing. Piscataway: IEEE, 2020: 631-635.

[40] Habib G, Qureshi S. Optimization and acceleration of convolutional neural networks: A survey[J]. Journal of King Saud University-Computer and Information Sciences, 2022, 34(7): 4244-4268.

[41] Huang Y, Cheng Y, Bapna A, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism[J]. Advances in neural information processing systems, 2019, 32: 1-10.

[42] Mittal S, Rajput P, Subramoney S. A survey of deep learning on CPUs: opportunities and co-optimizations[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 33(10): 5095-5115.

[43] Mittal S, Umesh S. A survey on hardware accelerators and optimization techniques for RNNs[J]. Journal of Systems Architecture, 2021, 112(7): 101839-101895.

[44] Yoshimura U, Inoue T, Tsuchiya A, et al. Implementation of low-energy LSTM with parallel and pipelined algorithm in small-scale FPGA[C]//2021 International Conference on Electronics, Information, and Communication. Piscataway: IEEE, 2021: 1-4.

[45] Park J H, Yun G, Chang M Y, et al. HetPipe: Enabling Large DNN Training on Whimpy Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism[C]//2020 USENIX Annual Technical Conference. 2020: 307-321.

[46] 刘瑞奇,李博扬,高玉金,等.新型分布式计算系统中的异构任务调度框架[J].软件学报,2022,33(3):1005-1017.

[47] 邝祝芳,陈清林,李林峰,等.基于深度强化学习的多用户边缘计算任务卸载调度与资源分配算法[J].计算机学报,2022,45(04):812-824.

[48] Ye X, Lai Z, Li S, et al. Hippie: A Data-Paralleled Pipeline Approach to Improve Memory-Efficiency and Scalability for Large DNN Training[C]//50th International Conference on Parallel Processing. 2021: 1-10.

[49] Holmes C, Mawhirter D, He Y, et al. Grnn: Low-latency and scalable rnn inference on gpus[C]//Proceedings of the Fourteenth EuroSys Conference 2019. 2019: 1-16.

[50] Zeng L, Chen X, Zhou Z, et al. Coedge: Cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices[J]. IEEE/ACM Transactions on Networking, 2020, 29(2): 595-608.

[51] Zhang M, Hu Z, Li M. DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture[C]//2021 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2021: 151-161.

[52] HU C, LI B. Distributed inference with deep learning models across heterogeneous edge devices[C]//IEEE INFOCOM 2022-IEEE Conference on Computer Communications. Piscataway: IEEE, 2022: 330-339.

[53] Yang C, Gayatri R, Kurth T, et al. An empirical roofline methodology for quantitatively assessing performance portability[C]//2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC. Piscataway: IEEE, 2018: 14-23.

[54] Wang Z, Xu H, Xu Y, et al. CoopFL: Accelerating federated learning with DNN partitioning and offloading in heterogeneous edge computing[J]. Computer Networks, 2023, 220(109490): 1-17.

[55] Sotoudeh M, Tao Z, Thakur A V. Syrenn: A tool for analyzing deep neural networks[J]. International Journal on Software Tools for Technology Transfer, 2023, 25(2): 145-165.

[56] Chen Z, Xu C, Qian W, et al. Elastic Averaging for Efficient Pipelined DNN Training[C]//Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 2023: 380-391.

中图分类号:

 TP183    

开放日期:

 2024-06-20    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式