查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于深度强化学习的移动机器人路径规划研究
姓名：	潘子豪
学号：	22207223072
保密级别：	公开
论文语种：	chi
学科代码：	085400
学科名称：	工学 - 电子信息
学生类型：	硕士
学位级别：	工学硕士
学位年度：	2025
培养单位：	西安科技大学
院系：	通信与信息工程学院
专业：	电子信息
研究方向：	移动机器人路径规划
第一导师姓名：	李明明
第一导师单位：	西安科技大学
论文提交日期：	2025-06-16
论文答辩日期：	2025-06-05
论文外文题名：	Research on Path Planning of Mobile Robot Based on Deep Reinforcement Learning
论文中文关键词：	路径规划 ; 移动机器人 ; 深度强化学习 ; 探索策略
论文外文关键词：	Path planning ; Mobile robots ; Deep reinforcement learning ; Exploration strategy
论文中文摘要：	︿路径规划是移动机器人的关键技术之一。如A*、Dijkstra等大多数传统路径规划算法在面对未知环境时往往需要在有先验环境地图的条件下才能有效地规划路径。基于深度强化学习的路径规划算法因其不依赖先验地图，且拥有自主学习和决策能力而备受关注。传统的深度强化学习路径规划算法存在着训练速度慢、样本利用率低、泛化能力有限等问题。本文对深度强化学习算法进行研究，以提高其在移动机器人路径规划中的实际表现。为了缓解奖励稀疏问题和减少移动机器人的碰撞，设计了一个带目标导向反馈项和距离衰减惩罚项的奖励函数。目标导向反馈项为移动机器人的每一步提供反馈，缓解奖励稀疏问题。距离衰减惩罚项使移动机器人和障碍物之间保持一定安全距离，减少碰撞的发生。改进了双延迟深度确定性策略梯度（Twin Delayed Deep Deterministic Policy gradient，TD3）算法的探索策略，提出了TD3PN算法。该算法在策略网络输出中引入具有时间相关性的粉红噪声作为动作扰动，从而使探索过程更加稳定和平滑。仿真训练结果表明，TD3PN算法相比TD3算法、TD3OU算法（加入OU噪声），总训练步数分别约减少了6.25%、9.96%，总训练时间分别约减少了13.80%、10.20%。奖励函数有效性验证实验的结果表明了本文设计奖励函数的有效性。泛化能力测试和真实环境的实验结果表明，TD3PN算法对未知环境的泛化能力整体表现更优，能够更好地适应真实场景。进一步针对TD3PN算法训练速度慢、样本利用率低的问题，提出了一种结合n步和损失调整近似策略优先经验回放（Loss Adjusted Approximate Actor Prioritized Experience Replay，LA3P）的n-TD3PN-LA3P算法。该算法将TD3PN算法中的单步TD目标推广到n步TD目标形式，从而能更准确地捕捉长期奖励信号。同时，使用LA3P方法取代均匀抽样来高效地利用经验池中的n步经验，从而提高样本利用率。通过实验确定n值为5，于是得到改进算法5-TD3PN-LA3P。仿真训练结果表明，改进算法5-TD3PN-LA3P相比TD3PN算法、TD3PN-LA3P算法、5-TD3PN算法、TD3-PER算法、SAC算法、DDPG算法，总训练步数分别约减少了16.55%、4.64%、2.79%、32.68%、28.56%、42.27%，总训练时间分别约减少了19.96%、7.12%、2.41%、37.57%、34.85%、47.36%，说明了改进算法训练速度更快。并且初步分析了5-TD3PN-LA3P算法样本利用率更高。泛化能力测试和真实环境的实验结果表明，改进算法5-TD3PN-LA3P对未知环境的泛化能力更强，在真实场景中的表现更优。以上研究表明了，本文设计的带目标导向反馈项和距离衰减惩罚项奖励函数的有效性。改进探索策略相比独立高斯噪声和OU噪声作为策略噪声更具优势，提高了算法的实际表现。n步方法和LA3P方法的有效结合，进一步提高了TD3PN算法的训练速度和样本利用率。﹀
论文外文摘要：	︿ Path planning is one of the key technologies for mobile robots. Most traditional path planning algorithms, such as A* and Dijkstra, often require a priori environmental map to effectively plan paths when facing unknown environments. Path planning algorithms based on deep reinforcement learning have attracted much attention because they do not rely on prior maps and have autonomous learning and decision-making capabilities. Traditional deep reinforcement learning path planning algorithms have problems such as slow training speed, low sample utilization, and limited generalization ability. This thesis investigates deep reinforcement learning algorithms with the aim of enhancing their practical performance in mobile robot path planning tasks. (1) In order to alleviate the problem of sparse rewards and reduce the collision of mobile robots, a reward function with goal-oriented feedback and distance decay penalty is designed. The goal-oriented feedback provides feedback for each step of the mobile robot to alleviate the problem of sparse rewards. The distance decay penalty keeps a certain safe distance between the mobile robot and obstacles to reduce the occurrence of collisions. The exploration strategy of the Twin Delayed Deep Deterministic Policy gradient (TD3) algorithm is improved, and the TD3PN algorithm is proposed. The algorithm introduces time-correlated pink noise as action perturbation in the output of the policy network, making the exploration process more stable and smooth. The simulation training results show that the total number of training steps of the TD3PN algorithm is reduced by about 6.25% and 9.96% respectively, and the total training time is reduced by about 13.80% and 10.20% respectively compared with the TD3 algorithm and the TD3OU algorithm (adding OU noise). The results of the reward function validation experiments demonstrate the effectiveness of the reward function designed in this thesis. The generalization ability test and the experimental results of the real environment show that the TD3PN algorithm has better overall generalization ability for unknown environments and can better adapt to real scenes. (2) Further to address the slow training speed and low sample utilization of the TD3PN algorithm, an n-TD3PN-LA3P algorithm combining n-step and Loss Adjusted Approximate Actor Prioritized Experience Replay (LA3P) is proposed. The algorithm generalizes the single-step TD objective in the TD3PN algorithm to the n-step TD objective form, which enables more accurate capture of long-term reward signals. Meanwhile, the LA3P method is used instead of uniform sampling to efficiently utilize the n-step experience in the experience pool, thus improving sample utilization. The value of n is determined to be 5 through experiments, and thus the improved algorithm 5-TD3PN-LA3P is obtained. simulation training results show that the improved algorithm 5-TD3PN-LA3P reduces the total number of training steps by approximately 16.55% compared to the TD3PN algorithm, the TD3PN-LA3P algorithm, the 5-TD3PN algorithm, the TD3-PER algorithm, the SAC algorithm, and the DDPG algorithm, respectively , 4.64%, 2.79%, 32.68%, 28.56%, and 42.27%, and the total training time was reduced by approximately 19.96%, 7.12%, 2.41%, 37.57%, 34.85%, and 47.36%, respectively, which indicates that the improved algorithms are faster to train. And the preliminary analysis of the 5-TD3PN-LA3P algorithm sample utilization is higher. The generalization ability test and the experimental results of real environment show that the improved algorithm 5-TD3PN-LA3P has better generalization ability to unknown environment and performs better in real scenes. The above research shows the effectiveness of the reward function with goal-oriented feedback and distance decay penalty designed in this thesis. The improved exploration strategy has more advantages than independent Gaussian noise and OU noise as strategy noise, which improves the actual performance of the algorithm. The effective combination of the n-step method and the LA3P method further improves the training speed and sample utilization of the TD3PN algorithm. ﹀
中图分类号：	TP242.6
开放日期：	2025-06-16

附件下载