查看论文信息

题名：	面向导航定位的SIREA芯片测试基准与测试平台开发
作者：	朱家扬
学号：	21207223073
保密级别：	保密（4年后开放）
语种：	chi
学科代码：	085400
学科：	工学 - 电子信息
学生类型：	硕士
学位：	工程硕士
学位年度：	2024
学校：	西安科技大学
院系：	通信与信息工程学院
专业：	电子信息
研究方向：	专用集成电路设计
导师姓名：	蒋林
导师单位：	西安科技大学
提交日期：	2024-06-24
答辩日期：	2024-06-06
外文题名：	SIREA Chip Test Benchmark and Platform Development for Navigation and Positioning
关键词：	自重构 ; 导航定位 ; 人工智能芯片 ; 测试方法 ; 测试基准 ; 测试平台
外文关键词：	Self-reconfigurable ; Navigation Positioning ; Artificial Intelligence Chip ; Test Method ; Test Benchmark ; Test Platform
摘要：	︿近年来，基于激光雷达的同时定位与地图构建（Simultaneous Localization and Mapping，SLAM）算法在解决高精度、实时定位、导航和地图构建问题中发挥着重要作用。随着算法结构的不断改进与迭代，SLAM系统开始变得越来越复杂，通用处理器己经无法很好地满足这类应用的计算需求。同时，芯片计算架构开始朝着定制化方向演进，进而出现了一系列自动驾驶人工智能（Artificial Intelligence，AI）芯片。在AI处理器的设计开发过程中，标准的基准测试程序和测试指标至关重要。本文提出了一套面向导航定位场景的基准测试程序集，用于对当前导航定位硬件进行客观的评估，判断处理器设计的合理性以及对比不同处理器的设计优劣，指导软硬件层面的系统优化，帮助研究人员设计出更高效的自动驾驶AI芯片。首先，针对本文测试对象，即依托国家重大项目研制的自重构自演化AI芯片（SIREA[1]），选取多任务切换时间、工作频率、全阵列功耗、芯片算力等核心指标作为测试重点，并基于芯片样片设计目标进行原型系统的功能拟合与性能数据折算。同时，研究上述指标的测试方法，明确其测试流程。研究内建自测试技术，通过16阶线性反馈移位寄存器（Linear Feedback Shift Register，LFSR）生成65535组测试向量用于内建电路自测试。研究高级可扩展接口（Advanced Extensible Interface，AXI）总线测试方法。实验结果证明，配置信息可通过AXI总线正确下发，计算结果也可借助AXI总线写回。其次，通过Profile性能分析工具获取基准算法的模块化特性和算子级特征，进而提出导航定位基准测试程序集，即MacroBenchmark和MicroBenchmark。前者包含点云匹配、地图构建，回环检测、图优化等4种不同的SLAM算法模块，后者包含矩阵转置、矩阵乘法、矩阵分解、矩阵求逆等4个典型的矩阵运算类型。同时，研究浮点数据整型转换方法，通过平衡资源占用与计算误差，最终将操作数位宽量化为16bit，小数位宽定点为8bit。研究量化误差精度补偿方法，采用三级残差补偿策略可将轨迹累计误差保持在10^-5量级，符合导航场景精度需求。然后，提出以“RK3588+VU440”为核心元件的原型系统测试平台总体方案，设计基于外设组件快速互联总线（Peripheral Component Interconnect express，PCIe）接口的数据通信模块。采用固定优先级的仲裁策略设计多通道双倍速率存储（Double Data Rate Memory，DDR）控制器，使得数据有序读写，仲裁电路的工作频率为247.7MHz，占用的查找表（Look-Up Table，LUT）资源为181，触发器（Flip Flop，FF）资源为156，最小延时为4.037ns。基于遍历所有处理元（Processor Element，PE）的累加程序对测试平台进行整体功能测试，实测结果符合片上PE集成规模。最后，基于导航定位程序集开展SIREA芯片原型测试。设计程序集的自重构PE阵列映射方案，引入多任务切换向更复杂的算法扩展，从平均相对误差、存储访问次数等关键性能评价指标来验证算法执行情况。基于VU440开发板的硬件实验结果表明，该映射方案可在高速并行计算的同时，仅引入0.1%左右的计算误差。所提基准测试程序集高度适配SIREA芯片，能较好的解决算法兼容性和计算精度的问题。同时，多任务切换时间实测为35Cycle，全阵列功耗为8.06W，工作频率为133.5MHz，芯片算力为36.85GOPS，各项指标测试均符合原型系统预期结果。﹀
外文摘要：	︿ In recent years, the Simultaneous Localization and Mapping (SLAM) algorithm has played an important role in solving high-precision, real-time localization, navigation and map construction. With the continuous improvement and iteration of the algorithm structure, SLAM system begins to become more and more complex, and the general processor cannot meet the needs of this kind of application. At the same time, the architecture began to evolve in the direction of customization, and a series of autonomous driving Artificial Intelligence (AI) chips emerged. In the design and development of AI processors, standard benchmark procedures and test indicators are very important. This paper proposes a set of benchmark test assembly, which is used to evaluate the current navigation and positioning hardware, judge the rationality of processor design and compare the advantages and disadvantages of different processor designs, guide the system optimization at the hardware and software level, and help researchers design more efficient autonomous driving AI chips. Firstly, for the test object of this paper, namely the self-reconfigurable and self-evolving AI chip (SIREA) developed by relying on national major projects, core indicators such as multi-task switching time, operating frequency, full array power consumption, and chip computing power are selected as the test focus, and functional fitting and performance data conversion of the prototype system are carried out based on the chip sample design objectives. Study the testing method of the above indicators and clarify its testing process. Research on Built-In Self-Test technology, generating 65535 sets of test vectors for built-in circuit self-test using a 16th order Linear Feedback Shift Register (LFSR).The test method of Advanced Extensible Interface (AXI) bus is studied. The experimental results show that the configuration information can be correctly sent through AXI bus, and the calculation results can also be written back through AXI bus. Secondly, the modularity and operator-level characteristics of the benchmark algorithm are obtained by Profile, and then the navigation and positioning benchmark assembly, namely MacroBenchmark and MicroBenchmark, are proposed. The former includes four different modules such as point cloud matching, map construction, loop detection, and graph optimization, while the latter includes four typical matrix operation types such as matrix transpose, matrix multiplication, matrix decomposition, and matrix inversion. The method of integer conversion of floating-point data is studied. By balancing resource occupation and calculation error, the operation digit width is quantized to 16bit, and the fractional part is calibrated to 8bit. The method of quantization error accuracy compensation is studied, and the three-level residual compensation strategy can keep the cumulative error of the trajectory at the order of 10^-5, which meets the accuracy requirement of navigation scene. Then, the overall scheme of prototype system test platform with "RK3588+VU440" as the core component is proposed, and the data communication module based on PCIe interface is designed. The multi-channel DDR controller is designed with a fixed priority arbitration strategy to make data read and write orderly. The working frequency of the arbitration circuit is 247.7MHz, the Look-Up Table (LUT) resources are 181, the Flip Flop (FF) resources are 156, and the minimum delay is 4.037ns. The overall function of the test platform is tested based on an accumulator that traverses all Processor elements (PE), and the measured results are consistent with the PE integration scale on chip. Finally, the prototype test of SIREA chip is carried out based on the navigation and positioning. The self-reconfigurable PE array mapping scheme is designed, and the algorithm is extended to a more complex algorithm by multi-task switching. The performance of the algorithm is verified by key performance evaluation indexes such as average relative error and memory access times. The experimental results of hardware based on VU440 board show that the mapping scheme can be used in parallel computation at high speed and only bring about 0.1% error. The proposed benchmark assembly is highly suitable for SIREA chip and can solve the problem of algorithm compatibility and calculation accuracy. At the same time, the measured multi-task switching time is 35Cycle, the power consumption of the whole array is 8.06W, the operating frequency is 133.5MHz, and the chip computing power is 36.85GOPS. All indexes are in line with the expected results of the prototype system. ﹀
参考文献：	︿ [1] Herdt V, Groe D, Pieper P, et al. RISC-V based virtual prototype: an extensible and configurable platform for the system-level[J]. Journal of Systems Architecture, 2020, 109(13): 10-17. [2] Wang Y, Chang X, Zhu H, et al. Towards secure runtime customizable trusted execution environment on FPGA-SoC[J]. IEEE Transactions on Computers, 2024, 28(60): 1-12. [3] 张蔚敏. 深度神经网络硬件基准测试现状及发展趋势[J]. 信息通信技术与政策, 2019, 45(12): 74-79. [4] 徐青青,安虹,武铮,金旭. 主流卷积神经网络的硬件设计与性能分析[J]. 计算机系统应用, 2020, 29(02): 49-57. [5] Cai Y, Ou Y, Qin T. Improving SLAM techniques with integrated multi-sensor fusion for 3D reconstruction[J]. Sensors, 2024, 24(7): 20-33. [6] Li L, Schulze L, Kalavadia K. Promising SLAM methods for automated guided vehicles and autonomous mobile robots[J]. Procedia Computer Science, 2024, 232: 2867-2874. [7] Lu Y, Liu L, Zhu J, et al. Architecture, challenges and applications of dynamic reconfigurable computing[J]. Journal of Semiconductors, 2020, 41(2): 4-13. [8] 魏少军,李兆石,朱建峰,等. 可重构计算:软件可定义的计算引擎[J]. 中国科学:信息科学, 2020, 50(09): 1407-1426. [9] Fan H, Liu S, Ferianc M, et al. A real-time object detection accelerator with compressed SSDLite on FPGA[C]//2018 International conference on field-programmable technology (FPT). IEEE, 2018, 21(03): 14-21. [10] Liang M, Chen M, Wang Z, et al. A CGRA based neural network inference engine for deep reinforcement learning[C]//2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 2018, 43(16): 540-543. [11] Chen H, Dai Y, Xue R, et al. Towards efficient microarchitecture design of simultaneous localization and mapping in augmented reality era[C]//2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 2018, 22(30): 397-404. [12] Guo C, Zhou Y, Leng J, et al. Balancing efficiency and flexibility for DNN acceleration via temporal GPU-systolic array integration[C]//2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020, 19(17): 1-6. [13] Zhao Y L, Hong Y T, Huang H P. Comprehensive performance evaluation between Visual SLAM and LiDAR SLAM for mobile robots: Theories and Experiments[J]. Applied Sciences, 2024, 14(9): 39-45. [14] 刘铭哲,徐光辉,唐堂,等. 激光雷达SLAM算法综述[J]. 计算机工程与应用, 2024, 60(01): 1-14. [15] Grisetti G, Stachniss C, Burgard W. Improved techniques for grid mapping with rao-blackwellized particle filters[J]. IEEE transactions on Robotics, 2007, 23(1): 34-46. [16] Kohlbrecher S, Von Stryk O, Meyer J, et al. A flexible and scalable SLAM system with full 3D motion estimation[C]//2011 IEEE international symposium on safety, security, and rescue robotics. IEEE, 2011, 35(11): 155-160. [17] 李枭凯,李广云,索世恒等. 激光SLAM技术进展[J]. 导航定位学报, 2023, 11(04): 8-17. [18] Zhang J, Singh S. Low-drift and real-time lidar odometry and mapping[J]. Autonomous Robots, 2017, 41(13): 401-416 [19] Xue G, Wei J, Li R, et al. LeGO-LOAM-SC: An improved simultaneous localization and mapping method fusing LeGO-LOAM and scan context for underground coalmine[J]. Sensors, 2022, 22(2): 5-20. [20] Zaman N A B, Abdul-Rahman S, Mutalib S, et al. Applying graph-based SLAM algorithm in a simulated environment[C]//IOP Conference Series: Materials Science and Engineering. IOP Publishing, 2020, 769(1): 12-35. [21] Hess W, Kohler D, Rapp H, et al. Real-time loop closure in 2D LIDAR SLAM[C]//2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, 49(11): 1271-1278. [22] Li Z X, Cui G H, Li C L, et al. Comparative study of SLAM algorithms for mobile robots in complex environment[C]//2021 6th International Conference on Control, Robotics and Cybernetics (CRC). IEEE, 2021, 291(32): 74-79. [23] Li B, Qi P, Liu B, et al. Trustworthy AI: From principles to practices[J]. ACM Computing Surveys, 2023, 55(9): 1-46. [24] Tao J H, Du Z D, Guo Q, et al. BENCHIP: Benchmarking intelligence processors[J]. Journal of Computer Science and Technology, 2018, 33(11): 1-23. [25] Lei L, Ma C. Research on computing power test of car-level chips based on neural network algorithm[C]//International Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2023). SPIE, 2024, 130(71): 627-633. [26] Ihde N, Marten P, Eleliemy A, et al. A survey of big data, high performance computing, and machine learning benchmarks[C]//Performance Evaluation and Benchmarking: 13th TPC Technology Conference, TPCTC 2021, Copenhagen, Denmark, August 20, 2021, Revised Selected Papers 13. Springer International Publishing, 2022, 17 (52): 98-118. [27] Elordi U, Unzueta L, et al. Benchmarking deep neural network inference performance on serverless environments with MLPerf[J]. IEEE Software, 2020, 38(1): 81-87. [28] Hodak M, Ellison D, Dholakia A. Benchmarking AI inference: where we are in 2020[C]//Technology Conference on Performance Evaluation and Benchmarking. Cham: Springer International Publishing, 2020, 22(33): 93-102. [29] Reddi V J, Cheng C, et al. MLperf inference benchmark[C]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, 124(11): 446-459. [30] 翟季冬. AIPerf：大规模人工智能算力基准测试程序[J]. 大数据, 2021, 7(03): 153-155. [31] Ren Z, Liu Y, Shi T, et al. AIPerf: Automated machine learning as an AI-HPC benchmark[J]. Big Data Mining and Analytics, 2021, 4(3): 208-220. [32] Thiyagalingam J, Shankar M, Fox G, et al. Scientific machine learning benchmarks[J]. Nature Reviews Physics, 2022, 4(6): 413-420. [33] Coleman C, Kang D, Narayanan D, et al. Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark[J]. ACM SIGOPS Operating Systems Review, 2019, 53(1): 14-25. [34] Zhang W, Wei W, Xu L, et al. AI Matrix: a deep learning benchmark for Alibaba data centers[J]. Computer Science, 2019, 46(7): 158-160. [35] Zhang W M, Zhang L, Zhang Z, et al. IBD: The metrics and evaluation method for DNN processor benchmark while doing Inference task[J]. Journal of Intelligent & Fuzzy Systems, 2021, 40(5): 9949-9961. [36] Lee J, Lee H, Lee S, et al. A new ISA for high-speed and area-efficient ALPG[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 22(14): 18-21. [37] Hughes G D, Peek S E, Shah A, et al. Increasing chip-to-substrate spacing using in capped SnPb pillars as flip chip interconnects for physical isolation in superconducting applications[J]. IEEE Transactions on Applied Superconductivity, 2024, 37(22): 13-24. [38] Leon V, Bezaitis C, Lentaris G, et al. FPGA & VPU Co-Processing in space Applications: development and testing with DSP/AI Benchmarks[C]//2021 28th IEEE international conference on electronics, circuits, and systems (ICECS). IEEE, 2021, 45(10): 1-5. [39] Kojima T, Ando N, et al. Real chip evaluation of a low power CGRA with optimized application mapping[C]//Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 2018, 19(21): 1-6. [40] Nawarathna L, Udugampola N, Yasawardhana Y, et al. Low-cost automatic test equipment for digital ICs using DE0-Nano-Altera Cyclone IV FPGA[C]//2021 3rd International Conference on Electrical, Control and Instrumentation Engineering (ICECIE). IEEE, 2021, 33(14): 1-4. [41] Patel A, Gosain V, Mohal R S, et al. FPGA based low-cost portable tester with on-board supplies[C]//2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA). IEEE, 2020, 19(26): 301-306. [42] Shan R, Jiang L, Wu H, et al. Dynamical self-reconfigurable mechanism for data-driven cell array[J]. Journal of Shanghai Jiaotong University (Science), 2021, 26(4): 511-521. [43] Yang K, Jiang L, Shan R, et al. RMSRM: real-time monitoring-based self-reconfiguration mechanism in reconfigurable PE array[J]. The Journal of Supercomputing, 2024, 80(5): 7071-7101. [44] Anders J, Bardin J C, Bashir I, et al. CMOS integrated circuits for the quantum information sciences[J]. IEEE transactions on quantum engineering, 2023 129(4): 1-30. [45] Monica K M. Design and study of system on chip design for signal processing applications in terms of energy and area[J]. Materials Today: Proceedings, 2023, 80(12): 3252-3262. [46] Lei L, Ma C. Research on computing power test of car-level chips based on neural network algorithm[C]//International Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2023). SPIE, 2024, 130(71): 627-633. [47] Rabehi A, Garlan B, Achtsnicht S, et al. Magnetic detection structure for lab-on-chip applications based on the frequency mixing technique[J]. Sensors, 2018, 18(6): 17-47. [48] Tapio V, Hemadeh I, Mourad A, et al. Survey on reconfigurable intelligent surfaces below 10 GHz[J]. EURASIP Journal on Wireless Communications and Networking, 2021(1): 17-55. [49] Liu M, Wang Y, Li S. A field programmable gate array placement methodology for netlist-level circuits with GPU acceleration[J]. Electronics, 2023, 13(1): 13-27. [50] Wei F, Cui X. An In-Array Build-In Self-Test Scheme for Embedded SRAM Array[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 20(4): 18-31. [51] 马宝良,崔丽珍,李敏超,等. 露天煤矿环境下基于LiDAR/IMU的紧耦合SLAM算法研究[J]. 煤炭科学技术, 2024, 52(03): 236-244. [52] 常立博,张盛兵. 面向混合量化CNNs的可重构处理器设计[J]. 西北工业大学学报, 2022, 40(02): 344-351. [53] Wei S, Lin X, Tu F, et al. Reconfigurability, why it matters in AI tasks processing: a survey of reconfigurable AI chips[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2022, 136(19): 410-425. [54] Yang J, Zheng H, Louri A. Venus: A versatile deep neural network accelerator architecture design for multiple applications[C]//Proceedings of Design Automation Conference (DAC). 2023, 47(10): 156-162. [55] Chen Y, Liu S, Lombardi F, et al. A technique for approximate communication in network-on-chips for image classification[J]. IEEE Transactions on Emerging Topics in Computing, 2022, 11(1): 30-42. [56] Moon S, Lee K J, Mun H G, et al. An 8.9–71.3 TOPS/W deep learning accelerator for arbitrarily quantized neural networks[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2022, 69(10): 4148-4152. [57] Yin S, Ouyang P, Tang S, et al. A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications[C]//2017 Symposium on VLSI Circuits. IEEE, 2017: 45(62): 26-27. [58] Li W, Hu A, Wang G, et al. Low-complexity precision-scalable multiply-accumulate unit architectures for deep neural network accelerators[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2022, 70(4): 1610-1614. [59] Xia Z, Chen J, Huang Q, et al. Neural synaptic plasticity-inspired computing: A high computing efficient deep convolutional neural network accelerator[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 68(2): 728-740. [60] 吴丹,郭梦琪. 基于ZYNQ的编译码通用高斯测试平台设计[J]. 无线电工程, 2023, 53(02): 465-470. [61] Maragkoudaki E, Toms W, Pavlidis V F. Energy-efficient encoding for high-speed serial interfaces[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(10): 1484-1496. [62] Nag S N. Technical analysis of PCIe to PCIe 6: a next-generation interface evolution[J]. World Journal of Engineering and Technology, 2023, 11(3): 504-525. [63] 刘俊秀,黄星月,罗玉玲,等.脉冲神经网络硬件互连系统的动态优先级仲裁策略[J].电子学报,2018,46(08):1898-1905. [64] Mahalat M H, Mandal S, Mondal A, et al. An efficient implementation of arbiter PUF on FPGA for IoT application[C]//2019 32nd IEEE International System-on-Chip Conference (SOCC). IEEE, 2019, 19(8): 324-329. [65] Kulkarni S K, Vani R M, Hunagund P V. Implementation of arbiter physical unclonable function on the Xilinx system on chip FPGA[C]//2022 9th International Conference on Computing for Sustainable Global Development (INDIACom). IEEE, 2022, 48(23): 648-653. [66] Gogula S, Damodaran V. Design of a VLSI router for the faster data transmission using buffer[C]//2023 2nd International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN). IEEE, 2023, 15(33): 1-5. [67] Yu Z, Bouganis C S. A parameterisable FPGA-tailored architecture for YOLOv3-tiny[C] //Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, April 1–3, 2020, Proceedings 16. Springer International Publishing, 2020, 35(152): 330-344. [68] Zhang H, Jiang J, Fu Y, et al. Yolov3-tiny object detection SoC based on FPGA platform[C]//2021 6th International Conference on Integrated Circuits and Microsystems (ICICM). IEEE, 2021, 42(11): 291-294. [69] Chang L, Zhang S, Du H, et al. A reconfigurable neural network processor with tile-grained multicore pipeline for object detection on FPGA[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 29(11): 1967-1980. [70] Xuan L, Un K F, Lam C S, et al. An FPGA-based energy-efficient reconfigurable depth wise separable convolution accelerator for image recognition[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2022, 69(10): 4003-4007. ﹀
中图分类号：	TN492
开放日期：	2028-06-26

附件下载