- 无标题文档
查看论文信息

论文中文题名:

 面向卷积神经网络的可重构阵列结构研究与设计    

姓名:

 朱育琳    

学号:

 19206204098    

保密级别:

 公开    

论文语种:

 chi    

学科代码:

 085210    

学科名称:

 工学 - 工程 - 控制工程    

学生类型:

 硕士    

学位级别:

 工程硕士    

学位年度:

 2022    

培养单位:

 西安科技大学    

院系:

 电气与控制工程学院    

专业:

 控制工程    

研究方向:

 智能芯片研究    

第一导师姓名:

 蒋林    

第一导师单位:

 西安科技大学    

论文提交日期:

 2022-06-29    

论文答辩日期:

 2022-06-07    

论文外文题名:

 Research and Design of Reconfigurable Array structures for Convolutional Neural Networks    

论文中文关键词:

 卷积神经网络 ; 模型压缩 ; 可重构阵列 ; 数据复用 ; 计算机体系结构    

论文外文关键词:

 Convolutional Neural Network ; Model Compression ; Reconfigurable Array ; Data Reuse ; Computer Architecture    

论文中文摘要:

       卷积神经网络(Convolutional Neural Network,CNN)模型的复杂性通常会随着任务的复杂性而增加,这对传统处理器计算能力带来更为严峻的挑战。可重构阵列结构兼具传统处理器的高灵活性与专用集成电路的高效性,为人工智能(Artificial Intelligence,AI)芯片设计提供新思路。然而,在可重构阵列结构上实现CNN算法仍面临计算复杂及存储需求大等问题,论文针对该问题从以下几方面进行深入优化研究。

       首先,为缓解网络模型对可重构阵列结构存储空间带来的压力,论文提出剪枝与量化融合的网络模型压缩方法。利用结构化模型剪枝技术,裁剪对计算结果不敏感的神经元从而减少模型参数量,对压缩后的浮点型参数进行随机量化取整能有效减少对硬件资源的消耗。选取LeNet5及AlexNet进行验证,实验结果表明准确率损失约为2%,参数量减少约为56.3%。与剪枝模型压缩相比,在网络识别精准度基本不变或有所提升的情况下,压缩率最高提升19.9%。

其次,为突破当前可重构阵列结构应用范围的局限性,在可重构处理元(Processor Element,PE)中CNN算法增加MAC、MAX及AVE等相关指令,并根据新增指令在PE的执行单元完成相应硬件结构设计。实验结果表明,面向CNN的可重构PE能够准确完成卷积、池化及函数激活等运算,相比通用指令时钟周期数可减少58.8%,与同类结构相比,硬件资源使用率减少35.9%。

        然后,针对可重构阵列下卷积运算出现大量数据重复访存问题,提出循环分块及展开的数据复用优化策略。为最大程度发挥可重构结构优势,对卷积运算进行循环分块优化设计,并基于循环分块设计卷积核和输入特征图的循环展开。对多种规格的卷积运算进行测试,结果表明,数据访问次数减少最高可达到83.6%,相比基于滑动窗口的数据复用方法,卷积运算乘累加次数最高减少16.25%。

       最后,为验证面向CNN的可重构结构的有效性,提出AlexNet网络可重构实现方案并基于Xilinx的ZC706开发板完成硬件测试及性能分析。结果表明,在面向CNN 的可重构阵列结构下使用本文的重构方案,单个簇PE使用率最高可达到100%,与单线程相比,多线程完成各尺寸卷积操作,加速比最高达到2.45。

       综上所述,面向CNN的可重构结构优化能够有效提升CNN算法运行效率,其最大工作频率能达到147MHz。与文献[49]相比,完成AlexNet网络的处理速度综合提升约为60.6%。与文献[52]、[53]相比,在硬件资源消耗接近的情况下,处理网络的结构更加复杂。与文献[54]相比,处理相同卷积神经网络硬件资源消耗减少45.8%。

论文外文摘要:

       The complexity of Convolutional Neural Network (CNN) models typically increases with the complexity of the task, which presents more significant challenge to traditional processor computing power. Reconfigurable array structure features the high flexibility of traditional processors and the efficiency of dedicated integrated circuits, providing new ideas for designing Artificial Intelligence (AI) chip. However, the implementation of CNN algorithms on reconfigurable array structures still faces problems in term of computational complexity and large storage requirements, and the paper conducts in-depth optimisation studies in the following aspects to address these problems.

       Firstly, the paper proposes a pruning and quantization fusion approach for the compression of network models to alleviate the pressure of network models on the storage space of reconfigurable array structures. structured model pruning technique is used to reduce the number of model parameters by pruning neurons that are not insensitive to the computational results, and random quantization rounding of the compressed floating-point parameters could effectively reduce the consumption of hardware resources. LeNet5 and AlexNet are selected for validation, and the experimental results show that the accuracy loss is about 2% and the parameter reduction is about 56.3%. Compared with pruning model compression, the compression rate is increased by up to 19.9% with essentially the same or improved network recognition accuracy.

       Secondly, in order to break through the limitations of the application scope of the current reconfigurable array structure, MAC, MAX and AVE instructions are added to the CNN algorithm in the reconfigurable processor element (PE), and the corresponding hardware structure design is completed in the execution unit of PE according to the new instructions. The experimental results show that the CNN – oriented reconfigurable PE could quasi-complete convolution, pooling and function activation operations, reduce the number of clock cycles by 58.8% compared with the generic instructions, and decrease the hardware resource usage by 35.9% compared with similar structures.

       Then, the data reuse optimization strategy of cyclic chunking and unfolding is proposed for the problem of large amount of data repeatedly accessed by convolutional operations under reconfigurable arrays. In order to maximize the advantages of reconfigurable structures, a cyclic chunking optimization design for convolutional operations and a cyclic unfolding design based on convolutional kernels and input feature maps are carried out. Results from tests with various sizes of convolutional operation show that data accesses could be reduced by up to 83.6%,Compared to sliding window-based reuse methods, the number of convolutional operations multiplying accumulation is reduced by up to 16.25%.

      Finally, to verify the effectiveness of the CNN-oriented reconfigurable structure optimization, a reconfigurable implementation of the AlexNet network is proposed and the hardware testing and performance analysis are completed based on the Xilinx ZC706 development board. The results show that adopting the reconfiguration scheme of this paper under the CNN-oriented reconfigurable structure, a single cluster PE utilization rate of up to 100% could be achieved, and the speedup ratio of multi-threaded convolutional operations of all sizes can reach up to 2.45 compared with that of single-threaded.

      In summary, the CNN-oriented reconfigurable structure optimization can effectively improve the efficiency of CNN algorithm operation, and its maximum operating frequency can reach 147MHz. Compared with the literature [49], the combined improvement in processing speed of the completed AlexNet network is approximately 60.6%. Compared with the literature [52], [53], the structure of the processing network is more complex with a similar consumption of hardware resources. The hardware resource consumption for processing the same convolutional neural network is reduced by 45.8% compared to the literature [54].

参考文献:

[1]Yang C , Y Wang, Wang X , et al. A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation[J]. IEEE Transactions on Circuits and Systems I: Regular Papers.2020,67(9):3007-3020.

[2]Yu Y , Wu C , Zhao T , et al. OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.2020,28(1):35-47.

[3]Mahale G , Udupa P , Chandrasekharan K K , et al. WinDConv: A Fused Datapath CNN Accelerator for Power-efficient Edge Devices[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11):4278-4289.

[4]Yan F , He Y , Ruwase O , et al. Efficient Deep Neural Network Serving: Fast and Furious[J].IEEE Transactions on Network and Service Management.2018, 15(1):112-126.

[5]Shafiee M J , Mishra A , Wong A . Deep Learning with Darwin: Evolutionary Synthesis of Deep Neural Networks[J]. Neural processing letters, 2018, 48(1):603-613.

[6]Zhao S , Blaabjerg F , Wang H . An Overview of Artificial Intelligence Applications for Power Electronics[J]. IEEE Transactions on Power Electronics, 2021,36(4):4633-4658.

[7]GongShijun, LiJiajun, LuWenyan, et al. ShuntFlowPlus: An Efficient and Scalable Dataflow Accelerator Architecture for Stream Applications[J]. ACM Journal on Emerging Technologies in Computing Systems.2021,17(4):1-24.

[8]Liu L , Zhu J , Li Z , et al. A Survey of Coarse-Grained Reconfigurable Architecture and Design[J]. ACM Computing Surveys.2019,52(6):1-39.

[9]Lino C , Shengbing Z , Huimin D, et al. A Reconfigurable Neural Network Processor With Tile-Grained Multicore Pipeline for Object Detection on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.2021,29(11):1967-1980.

[10] Ahmed J. A,Mohamed E,Ahmed H, et al.Power Efficient Design of High-Performance Convolutional Neural Networks Hardware Accelerator on FPGA: A Case Study With GoogLeNet[J]. IEEE Access. 2021:151897 - 151911.

[11] Bernardo P P , Gerum C , Frischknecht A , et al. UltraTrail: A Configurable Ultralow-Power TC-ResNet AI Accelerator for Efficient Keyword Spotting[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11):4240-4251.

[12] Meloni P , Capotondi A , Deriu G , et al. NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs[J]. ACM Transactions on Reconfigurable Technology & Systems, 2018,11(3):1-24.

[13] Bae I , Harris B , Min H , et al. Auto-Tuning CNNs for Coarse-Grained Reconfigurable Array-based Accelerators[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2018,37(11):2301-2310.

[14]Kyriakos A , Papatheofanous E A , Charalampos B , et al. Design and Performance Comparison of CNN Accelerators Based on the Intel Movidius Myriad2 SoC and FPGA Embedded Prototype[C]// 2019 International Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO)., 2019:142-147.

[15]Yuan T , Liu W , Han J , et al. High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization[J]. Circuits and Systems I: Regular Papers,IEEE Transactions on Circuits and Systems I: Regular Papers, 2021,68(1):250-263.

[16] Zhang W, He L, Chen P , et al. Resolution-Aware Knowledge Distillation for Efficient Inference[J]. IEEE Transactions on Image Processing.2021,30:6985-6996.

[17] H Gao, Wang Z , Cai L , et al. ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(8):2570-2581.

[18] Peiyang Liu, Xi Wang, Lin Wang, Wei Ye, Xiangyu Xi, and Shikun Zhang.Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval[C]//Proceedings of the 30th ACM International Conference on Information & Knowledge Management.2021:3965-3975

[19] Zhang J , Zhang Y , Yan Y , et al. MobileNet-SSD with adaptive expansion of receptive field[C]//2020 IEEE 3rd International Conference of Safe Production and Informatization, 2020:177-181.

[20] Wang Z , Li C , Wang X , et al. Towards Efficient Convolutional Neural Networks Through Low-Error Filter Saliency Estimation[J].Springer, Cham.2019:255-267.

[21] Deng L, Li G, Han S, et al. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey[J]. Proceedings of the IEEE, 2020, 108(4):485-532.

[22] Jinyang G, Weichen Z, Wanli O, et al. Model Compression Using Progressive Channel Pruning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021,31(3):1114-1124.

[23] Choi J, Kong B Y, Park I C. Retrain-Less Weight Quantization for Multiplier-Less Convolutional Neural Networks[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 67(3):972-982.

[24] HanMing, WangYe, DongJian, et al. Double-Shift: A Low-Power DNN Weights Storage and Access Framework based on Approximate Decomposition and Quantization[J]. ACM Transactions on Design Automation of Electronic Systems.2022,27(2):1-16.

[25] Fan X,Liu Z,Lian J, et al. Lighter and Better: Low-Rank Decomposed Self-Attention Networks for Next-Item Recommendation[C]// SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2021:1733-1737.

[26] Liu H , He Y , Yu F R , et al. Flexi-Compression: A Flexible Model Compression Method for Autonomous Driving[C]//Proceedings of the 11th ACM Symposium on Design and Analysis of Intelligent Vehicular Networks and Applications. 2021:19-26.

[27] Huang S, Ankit A, Antunes R,et.al.Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators[C]//The 26th Asia and South Pacific Design Automation Conference (ASPDAC '21).:2021:372-377.

[28] Prabhakar R, Zhang Y, Koeplinger D, et al. Plasticine: A Reconfigurable Architecture For Parallel Paterns[C]// ACM/IEEE International Symposium on Computer Architecture. IEEE Computer Society, 2017:389-402.

[29] Nowatzki T , Gangadhar V , Ardalani N , et al. Stream-Dataflow Acceleration[J]. Acm Sigarch Computer Architecture News, 2017, 45(2):416-429.

[30] Ahmadi M , Vakili S , Langlois J . CARLA: A Convolution Accelerator With a Reconfigurable and Low-Energy Architecture[J].IEEE Transactions on Circuits and Systems I: Regular Papers, 2021,68(8):3184-3196.

[31] Moons B , Uytterhoeven R , Dehaene W , et al. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI[C]// IEEE International Solid-State Circuits Conference,2017:246-247.

[32] Yin S,Ouyang P, Tang S,et al.A 1.06-to-5.09 TOPS/W reconfigurable hybrid neural network processor for deep learning applications[C]//2017 Symposium on VLSI Circuits,2017:26-27.

[33]Yang C , Wang Y , Wang X , et al. WRA: A 2.2-to-6.3 TOPS Highly Unified Dynamically Reconfigurable Accelerator Using a Novel Winograd Decomposition Algorithm for Convolutional Neural Networks[J]. IEEE Transactions on Circuits and Systems I: Regular Papers,2019,66(9):3480-3493.

[34] Tu F , Wu W , Wang Y , et al. Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning[J]. IEEE Journal of Solid-State Circuits, 2021, 56(2):658-673.

[35]Yun Z, Jiang L, Wang S, et al. Design of reconfigurable array processor for multimedia application[J]. Multimedia Tools and Applications, 2018, 77(3): 3639-3657.

[36]Langhammer M, Pasca B. Efficient FPGA Modular Multiplication Implementation[C]// FPGA '21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 2021:217-223.

[37]Chen Y H , Krishna T , Emer J S , et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks[J]. IEEE Journal of Solid-State Circuits, 2017, 52(1):127-138.

[38]Alantali F , Halawani Y , Mohammad B , et al. SLID: Exploiting Spatial Locality in Input Data as a Computational Reuse Method for Efficient CNN[J]. IEEE Access, 2021, 9:57179-57187.

[39]Jiang-Yun L I, Yi-Kai Z, Zhuo-Er X, et al. A survey of model compression for deep neural networks[J]. Chinese Journal of Engineering, 2019, 14(2):23-27.

[40]LuAnni, PengXiaochen, LuoYandong, et al. A Runtime Reconfigurable Design of Compute-in-Memory–Based Hardware Accelerator for Deep Learning Inference[J]. ACM Transactions on Design Automation of Electronic Systems.2021,26(6):1-18.

[41]Yang T , He Z , Kou T , et al. BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization[J]. ACM Transactions on Reconfigurable Technology and Systems.2021,14(4):1-28.

[42]Guo L , Zhou D , Zhou J , et al. Sparseness Ratio Allocation and Neuron Re-pruning for Neural Networks Compression[C]// 2018 IEEE International Symposium on Circuits and Systems (ISCAS). 2018:1-5.

[43]Chen T H , Huang C H , Chu Y S , et al. Towards Efficient Neural Network on Edge Devices via Statistical Weight Pruning[C]// 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE). 2020:192-193.

[44]Shin D , Lee J , Lee J , et al. DNPU: An Energy-Efficient Deep-Learning Processor with Heterogeneous Multi-Core Architecture[J]. IEEE Micro, 2018, 38(5):85-93.

[45]Xiang C,Jing-dong L,Yong Z.Hardware Resource and Computational Density Efficient CNN Accelerator Design Based on FPGA[C]//2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA). 2022:24-26.

[46]Rybalkin V, Ney J, Tekleyohannes M, et al. When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network[J].ACM Transactions on Reconfigurable Technology and Systems. 2022,15(1):1-35.

[47]Y. Shen, M. Ferdman and P. Milder. Overcoming resource underutilization in spatial CNN accelerators[C]// 2016 26th International Conference on Field Programmable Logic and Applications (FPL).2016:1-4.

[48]Korol G, and Moraes F. G.A FPGA Parameterizable Multi-Layer Architecture for CNNs[C]// 2019 32nd Symposium on Integrated Circuits and Systems Design (SBCCI), 2019: 1-6.

[49]Shan Rui,Jiang Lin,Deng Junyong, et al.Parallel design of convolutional neural networks for remote sensing images object recognition based on data-driven array processor[J].The Journal of China Universities of Posts and Telecommunications,2020,27(06):87-100.

[50]Y. Cao, X. Wei, T. Qiao and H. Chen, FPGA-based accelerator for convolution operations[C]//2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2019:1-5.

[51]Irmak H, Ziener D, Alachiotis N.Increasing Flexibility of FPGA-based CNN Accelerators with Dynamic Partial Reconfiguration[C]//2021 31st International Conference on Field-Programmable Logic and Applications, 2021:306-311.

[52]Zhen X, He B, Research on FPGA High-Performance Implementation Method of CNN[C]//2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), 2021:1177-1181.

[53]Cho M,Kim Y, Implementation of Data-optimized FPGA-based Accelerator for Convolutional Neural Network[C]//2020 International Conference on Electronics, Information, and Communication, 2020:1-2.

[54]Gilan A A , Emad M , Alizadeh B . FPGA-Based Implementation of a Real-Time Object Recognition System Using Convolutional Neural Network[J]. IEEE Transactions on Circuits and Systems II: Express Briefs. 2020,67(4):755-759.

中图分类号:

 TN492    

开放日期:

 2022-06-29    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式