论文中文题名: | 面向卷积神经网络的可重构阵列结构研究与设计 |
姓名: | |
学号: | 19206204098 |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 085210 |
学科名称: | 工学 - 工程 - 控制工程 |
学生类型: | 硕士 |
学位级别: | 工程硕士 |
学位年度: | 2022 |
培养单位: | 西安科技大学 |
院系: | |
专业: | |
研究方向: | 智能芯片研究 |
第一导师姓名: | |
第一导师单位: | |
论文提交日期: | 2022-06-29 |
论文答辩日期: | 2022-06-07 |
论文外文题名: | Research and Design of Reconfigurable Array structures for Convolutional Neural Networks |
论文中文关键词: | |
论文外文关键词: | Convolutional Neural Network ; Model Compression ; Reconfigurable Array ; Data Reuse ; Computer Architecture |
论文中文摘要: |
卷积神经网络(Convolutional Neural Network,CNN)模型的复杂性通常会随着任务的复杂性而增加,这对传统处理器计算能力带来更为严峻的挑战。可重构阵列结构兼具传统处理器的高灵活性与专用集成电路的高效性,为人工智能(Artificial Intelligence,AI)芯片设计提供新思路。然而,在可重构阵列结构上实现CNN算法仍面临计算复杂及存储需求大等问题,论文针对该问题从以下几方面进行深入优化研究。 首先,为缓解网络模型对可重构阵列结构存储空间带来的压力,论文提出剪枝与量化融合的网络模型压缩方法。利用结构化模型剪枝技术,裁剪对计算结果不敏感的神经元从而减少模型参数量,对压缩后的浮点型参数进行随机量化取整能有效减少对硬件资源的消耗。选取LeNet5及AlexNet进行验证,实验结果表明准确率损失约为2%,参数量减少约为56.3%。与剪枝模型压缩相比,在网络识别精准度基本不变或有所提升的情况下,压缩率最高提升19.9%。 其次,为突破当前可重构阵列结构应用范围的局限性,在可重构处理元(Processor Element,PE)中CNN算法增加MAC、MAX及AVE等相关指令,并根据新增指令在PE的执行单元完成相应硬件结构设计。实验结果表明,面向CNN的可重构PE能够准确完成卷积、池化及函数激活等运算,相比通用指令时钟周期数可减少58.8%,与同类结构相比,硬件资源使用率减少35.9%。 然后,针对可重构阵列下卷积运算出现大量数据重复访存问题,提出循环分块及展开的数据复用优化策略。为最大程度发挥可重构结构优势,对卷积运算进行循环分块优化设计,并基于循环分块设计卷积核和输入特征图的循环展开。对多种规格的卷积运算进行测试,结果表明,数据访问次数减少最高可达到83.6%,相比基于滑动窗口的数据复用方法,卷积运算乘累加次数最高减少16.25%。 最后,为验证面向CNN的可重构结构的有效性,提出AlexNet网络可重构实现方案并基于Xilinx的ZC706开发板完成硬件测试及性能分析。结果表明,在面向CNN 的可重构阵列结构下使用本文的重构方案,单个簇PE使用率最高可达到100%,与单线程相比,多线程完成各尺寸卷积操作,加速比最高达到2.45。 综上所述,面向CNN的可重构结构优化能够有效提升CNN算法运行效率,其最大工作频率能达到147MHz。与文献[49]相比,完成AlexNet网络的处理速度综合提升约为60.6%。与文献[52]、[53]相比,在硬件资源消耗接近的情况下,处理网络的结构更加复杂。与文献[54]相比,处理相同卷积神经网络硬件资源消耗减少45.8%。 |
论文外文摘要: |
The complexity of Convolutional Neural Network (CNN) models typically increases with the complexity of the task, which presents more significant challenge to traditional processor computing power. Reconfigurable array structure features the high flexibility of traditional processors and the efficiency of dedicated integrated circuits, providing new ideas for designing Artificial Intelligence (AI) chip. However, the implementation of CNN algorithms on reconfigurable array structures still faces problems in term of computational complexity and large storage requirements, and the paper conducts in-depth optimisation studies in the following aspects to address these problems. Firstly, the paper proposes a pruning and quantization fusion approach for the compression of network models to alleviate the pressure of network models on the storage space of reconfigurable array structures. structured model pruning technique is used to reduce the number of model parameters by pruning neurons that are not insensitive to the computational results, and random quantization rounding of the compressed floating-point parameters could effectively reduce the consumption of hardware resources. LeNet5 and AlexNet are selected for validation, and the experimental results show that the accuracy loss is about 2% and the parameter reduction is about 56.3%. Compared with pruning model compression, the compression rate is increased by up to 19.9% with essentially the same or improved network recognition accuracy. Secondly, in order to break through the limitations of the application scope of the current reconfigurable array structure, MAC, MAX and AVE instructions are added to the CNN algorithm in the reconfigurable processor element (PE), and the corresponding hardware structure design is completed in the execution unit of PE according to the new instructions. The experimental results show that the CNN – oriented reconfigurable PE could quasi-complete convolution, pooling and function activation operations, reduce the number of clock cycles by 58.8% compared with the generic instructions, and decrease the hardware resource usage by 35.9% compared with similar structures. Then, the data reuse optimization strategy of cyclic chunking and unfolding is proposed for the problem of large amount of data repeatedly accessed by convolutional operations under reconfigurable arrays. In order to maximize the advantages of reconfigurable structures, a cyclic chunking optimization design for convolutional operations and a cyclic unfolding design based on convolutional kernels and input feature maps are carried out. Results from tests with various sizes of convolutional operation show that data accesses could be reduced by up to 83.6%,Compared to sliding window-based reuse methods, the number of convolutional operations multiplying accumulation is reduced by up to 16.25%. Finally, to verify the effectiveness of the CNN-oriented reconfigurable structure optimization, a reconfigurable implementation of the AlexNet network is proposed and the hardware testing and performance analysis are completed based on the Xilinx ZC706 development board. The results show that adopting the reconfiguration scheme of this paper under the CNN-oriented reconfigurable structure, a single cluster PE utilization rate of up to 100% could be achieved, and the speedup ratio of multi-threaded convolutional operations of all sizes can reach up to 2.45 compared with that of single-threaded. In summary, the CNN-oriented reconfigurable structure optimization can effectively improve the efficiency of CNN algorithm operation, and its maximum operating frequency can reach 147MHz. Compared with the literature [49], the combined improvement in processing speed of the completed AlexNet network is approximately 60.6%. Compared with the literature [52], [53], the structure of the processing network is more complex with a similar consumption of hardware resources. The hardware resource consumption for processing the same convolutional neural network is reduced by 45.8% compared to the literature [54]. |
参考文献: |
|
中图分类号: | TN492 |
开放日期: | 2022-06-29 |