查看论文信息

免费浏览

查看论文信息

论文中文题名：	面向Transformer网络的可重构结构数据复用和存储优化研究
姓名：	张丁月
学号：	20206035030
保密级别：	保密（1年后开放）
论文语种：	chi
学科代码：	080903
学科名称：	工学 - 电子科学与技术（可授工学、理学学位） - 微电子学与固体电子学
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2023
培养单位：	西安科技大学
院系：	电气与控制工程学院
专业：	电子科学与技术
研究方向：	集成电路设计
第一导师姓名：	蒋林
第一导师单位：	西安科技大学
论文提交日期：	2023-07-06
论文答辩日期：	2023-06-01
论文外文题名：	Research on Data Reuse and Storage Optimization of Reconfigurable Structure for Transformer Network
论文中文关键词：	神经网络 ; 可重构结构 ; 数据复用 ; 存储结构 ; 模型压缩
论文外文关键词：	Neural network ; Reconfigurable structure ; Data reuse ; Storage structure ; Model compression
论文中文摘要：	︿可重构芯片解决了多种应用普遍存在的计算高效性、计算并行性和应用灵活性等共性问题，将在人工智能领域具有广泛的应用。Transformer网络是一种并行化处理数据且基于自注意力机制的神经网络，现有的通用神经网络加速器由于硬件架构不匹配，硬件资源利用率和调度效率低，无法有效加速Transformer模型计算。针对Transformer网络计算复杂及存储需求大等问题，论文研究面向Transformer网络的可重构结构数据复用和存储结构优化。首先，对Transformer网络模型的计算量和参数量进行统计及分析，选取最适合的模型压缩方法。利用基于重写方法的网络模型剪枝方法和基于缩放点积注意力的网络模型知识蒸馏方法，提出一种Transformer网络模型剪枝与知识蒸馏融合压缩方案。针对目标检测任务，在ImageNet-1K数据集上，训练骨干网络模型，验证论文所提融合压缩方案的有效性。实验结果表明，在不同宽度和深度配置下，各个骨干网络模型的最优子网络参数量平均减少了52.8%，浮点运算数（FLoating point OPerations, FLOPs）平均减少了62.8%，准确率平均提高了7.2%。其次，研究Transformer网络编码器-解码器模块数据传输机制，可重构阵列结构下Transformer网络数据复用的图像特征提取、多头自注意力（Multi-head Self Attention, MSA）模块计算和数据复用过程。提出一种可重构阵列结构下Transformer网络数据复用方案，引入倒序循环的权重共享方法，实现MSA模块高层与低层权重共享。选取Swin-T最优子网络，进行仿真验证与性能分析。实验结果表明，数据复用后Swin-T对比压缩后Swin-T，参数量降低了45.7%，FLOPs降低了52.6%，准确率降低了1%。然后，利用片内存储和片外存储访问技术，进行可重构阵列存储优化与设计。提出一种多层次分布式存储结构来平衡计算访存的大容量需求与高速需求，采用便签式存储器（Scratch Pad Memory, SPM）和高速缓存器（Cache）混合构成一个多层次分布式存储结构。SPM和计算单元（Processing Element, PE）阵列通过总线进行数据交互，SPM和主存通过直接内存存取（Direct Memory Access, DMA）方式实现数据交互。并且，将缓存模块分为多个缓存区，有利于Transformer网络的数据缓存。实验结果表明，该存储结构对比相关文献，峰值带宽提高了75%，命中延迟减少了约75%。最后，为验证面向Transformer网络的可重构阵列存储结构优化的有效性，进行可重构阵列现场可编程逻辑门阵列（Field Programmable Gate Array, FPGA）原型系统设计与测试。基于实验测试及验证平台，提出一种面向Transformer网络的多层次分布式存储结构测试方案。设计了一种可重构阵列结构下Swin-T最优子网络数据复用方案，实现并测试了MSA模块和前馈网络（Feed Forward Network, FFN）模块的并行化运算过程。基于Xilinx的ZC706开发板，对论文所优化的结构进行仿真验证与性能分析。实验结果表明，可重构阵列处理器能够正确执行MSA模块计算和数据复用，并且多层并行执行数据流，资源的占用率均在65%以下。使用多层次分布式存储结构，将PE整体利用率提高10%。﹀
论文外文摘要：	︿ Reconfigurable chips solve the common problems of high computational efficiency, parallelism and flexibility in many applications, and will be widely used in artificial intelligence. Transformer network is a kind of neural network that processes data in parallel and is based on self-attention mechanism. The existing universal neural network accelerator can not effectively accelerate the calculation of Transformer model because of the mismatch of hardware architecture, low utilization rate of hardware resources and low scheduling efficiency. Aiming at the problems of complex calculation and large storage demand of Transformer network, this paper studies the data reuse and storage structure optimization of reconfigurable structure oriented to Transformer network. Firstly, the calculation and parameters of the Transformer network model are counted and analyzed, and the most suitable model compression method is selected. By using the network model pruning method based on rewriting method and the network model knowledge distillation method based on scaling point product attention, a fusion compression scheme of Transformer network model pruning and knowledge distillation is proposed. Aiming at the target detection task, the backbone network model is trained on ImageNet-1K data set to verify the effectiveness of the fusion compression scheme proposed in this paper. The experimental results show that the optimal subnet parameters of each backbone network model are reduced by 52.8% on average, the FLoating point OPerations (FLOPs) are reduced by 62.8% on average, and the accuracy is improved by 7.2% on average. Secondly, the data transmission mechanism of the encoder-decoder module of the Transformer network, the image feature extraction, Multi-head Self-Attention (MSA) module calculation and data multiplexing process of the data multiplexing of the Transformer network under the reconfigurable array structure are studied. A data multiplexing scheme of Transformer network under reconfigurable array structure is proposed, and the weight sharing method of reverse circulation is introduced to realize the weight sharing between the upper and lower layers of MSA module. Swin-T optimal sub-network is selected for simulation verification and performance analysis. The experimental results show that compared with the compressed Swin-T, the parameters of Swin-T after data multiplexing are reduced by 45.7%, FLOPs by 52.6% and accuracy by 1%. Then, using on-chip storage and off-chip storage access technology, the reconfigurable array storage is optimized and designed. A multi-level distributed storage structure is proposed to balance the high-capacity demand and high-speed demand for computing access. Scratch Pad Memory (SPM) and Cache are mixed to form a multi-level distributed storage structure. The data interaction between SPM and Processing Element (PE) array is realized by bus, and the data interaction between SPM and main memory is realized by Direct Memory Access (DMA). Moreover, the cache module is divided into multiple cache areas, which is beneficial to the data cache of Transformer network. The experimental results show that the peak bandwidth of the storage structure is increased by 75% and the hit delay is reduced by about 75% compared with the related literature. Finally, in order to verify the effectiveness of the storage structure optimization of reconfigurable array for Transformer network, the prototype system of reconfigurable array Field Programmable Gate Array (FPGA) is designed and tested. Based on the experimental test and verification platform, a multi-level distributed storage structure test scheme for Transformer network is proposed. A data multiplexing scheme of Swin-T optimal subnet under reconfigurable array structure is designed, and the parallel operation process of MSA module and Feed Forward Network (FFN) module is realized and tested. Based on ZC706 development board of Xilinx, the optimized structure is simulated and verified and its performance is analyzed. The experimental results show that the reconfigurable array processor can correctly perform MSA module calculation and data multiplexing, and multi-layer parallel data flow, and the resource occupancy rate is below 65%. Multi-level distributed storage structure is used to improve the overall utilization rate of PE by 10%. ﹀
中图分类号：	TN492
开放日期：	2024-07-06

附件下载