查看论文信息

免费浏览

查看论文信息

论文中文题名：	基于强化学习的智能体博弈决策研究
姓名：	梁媛
学号：	19201221008
保密级别：	保密（1年后开放）
论文语种：	chi
学科代码：	025200
学科名称：	经济学 - 应用统计
学生类型：	硕士
学位级别：	经济学硕士
学位年度：	2022
培养单位：	西安科技大学
院系：	理学院
专业：	应用统计
研究方向：	智能博弈
第一导师姓名：	苏军
第一导师单位：	西安科技大学
论文提交日期：	2022-06-22
论文答辩日期：	2022-06-09
论文外文题名：	Agent Game Decision Based on Reinforcement Learning
论文中文关键词：	强化学习 ; 空战博弈 ; 机动决策 ; DQN算法 ; WoLF-PHC算法
论文外文关键词：	Reinforcement learning ; air combat games ; maneuver decision ; DQN algorithm ; WoLF-PHC algorithm
论文中文摘要：	︿强化学习是人工智能的核心技术之一。大量的实践研究表明，利用强化学习技术解决智能博弈决策问题具有环境适应性强、计算简单、符合实际情形等优点，能够训练智能体完成复杂的博弈任务。因此，将强化学习应用于智能博弈背景具有十分重要的理论意义和实践价值。本文以无人作战机(UCAV)作为自主决策的智能载体，近距离空战对抗博弈作为研究背景，强化学习算法作为技术手段，对简易三维空间内沿固定方向移动的UCAV机动决策进行研究，并在此基础上，提出改进的WoLF-PHC-DQN算法，探讨强化学习技术应用于智能空战博弈环境的决策问题。主要工作如下：根据UCAV动力学与运动学模型得到控制参数的表达式，结合平飞机动、水平转弯、俯冲爬升三个机动层面的状态特征，推导出各层面的控制模型，并分析各机动动作特性得出控制量，设计可用的空战博弈动作集，完善博弈环境。设计了一种具有引导作用的奖励函数，能在算法训练过程中对智能体加以引导从而提高收敛速度和学习效率。将奖励函数分为引导奖励函数和终止奖励函数，从影响空战环境决定性因素角度和距离入手，分别设计相对角度函数和相对距离函数作为引导奖励函数；根据强化学习回报函数的等比变形式，结合现实情形中空战博弈最大步长数，设置合理终止奖励函数值。使用DQN算法进行不同博弈场景的仿真实验，结果表明深度强化学习技术用于处理空战博弈问题是可行有效的，并且基于引导奖励的DQN算法明显能够提升训练速度、提高博弈成功率。提出了一种深度学习技术与博弈思想结合的创新性强化学习算法，以适应动态空战博弈环境。对博弈信息和决策顺序作出合理假设，并构建空战环境下的随机博弈模型；提出一种用于解决动态博弈问题的WoLF-PHC-DQN算法，利用DQN算法神经网络近似拟合Q值和策略值，引入WoLF机制适应动态环境，采用PHC规则更新混合策略。实验表明，与固定决策输出的非理性智能体对抗，WoLF-PHC-DQN相比传统DQN收敛速度更快、任务成功率更高；与复杂机动决策输出的理性智能体对抗，WoLF-PHC-DQN也能灵活选择优势决策并最终取胜，动态智能博弈环境适应性好。﹀
论文外文摘要：	︿ Reinforcement learning is one of the core technologies of artificial intelligence. Large number of practical studies have shown that using reinforcement learning techniques to solve intelligent game decision problems has the advantages of high environmental adaptability, simple computation, and conformity to real situations, it can complete complex game tasks through training agents. Therefore, it has great theoretical significance and practical value to apply reinforcement learning to intelligent games. Taking air combat game as the application background, using an unmanned combat air vehicle(UCAV) as intelligent carrier for autonomous decision and reinforcement learning algorithms as the technical means, this dissertation conducts research on the maneuver decision problem of the UCAV facing a moving target with directional constraints in the simple three-dimensional space and bases on this research, proposing the improved WoLF-PHC-DQN algorithm to explore the application of reinforcement learning techniques to the decision problem in an intelligent air combat game environment. The main contributions are as follows: The expressions of control parameters can obtained according to the UCAV dynamics and kinematics model. Combining the state characteristics of three manoeuvre levels which are namely, flat plane maneuver, horizontal turn and dive elevation can derive the control models of maneuver at each level, and analyzing each maneuver characteristics to get the control quantities, while the available set of air combat game is determined, and the air combat game training environment is improved. A reward function with guiding function is designed in air combat games, which can guide the agent during the algorithm training process to improve the convergence speed and learning efficiency. The reward function is divided into a guide reward function and a termination reward function, designing the relative angle function and the relative distance function as the guide reward function based on the angle and distance which are decisive factors affecting the air combat environment; according to equivalence formula of the reinforcement learning reward function, combined with the maximum number of steps in realistic air game situations, set a reasonable termination reward function value. The simulation experiments of different game scenarios using DQN algorithm show that the deep reinforcement learning technique is feasible and effective to deal with air combat game problems, and the DQN algorithm based on guide reward can obviously improve the training speed and increase the success rate of the game. A new reinforcement learning algorithm combining game ideas and deep learning technology is innovatively proposed, which applied to the dynamic air combat gaming environment. Reasonable assumptions are made on the game information and decision order, a stochastic game model in the air combat environment is constructed based on the assumptions; A WoLF-PHC-DQN algorithm is proposed for solving dynamic game problems, using the DQN algorithm neural network for approximating the Q and strategy values, combining the WoLF mechanism to adapt to the dynamic environment, and PHC rules to update the hybrid strategy. The experiments show that WoLF-PHC-DQN converges faster and has a higher mission success rate compared to the traditional DQN against an intelligent with fixed decision output; WoLF-PHC-DQN can also flexibly and precisely choose the superior decision and finally win against a rational intelligent with complex maneuvering decision output, indicating that the algorithm has great adaptability to dynamic intelligent environment. ﹀
参考文献：	︿ [1]郑南宁. 人工智能新时代[J]. 智能科学与技术学报, 2019, 1(1): 1-3. [2]王飞跃, 曹东璞, 魏庆来. 强化学习：迈向知行合一的智能机制与算法[J]. 智能科学与技术学报, 2020, 2(2): 101-106. [3]Silver D, Scheritt Wieser J, Simonyan K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676): 354-359. [4]Kober J, Peters J. Reinforcement learning in robotics: a survey[M]. England: Sage Publications Inc, 2013: 1238-1274. [5]李渊, 徐新海. 基于组合训练的规则嵌入多智能体强化学习方法[J]. 计算机应用研究, 2022, 39(03): 802-806. [6]Wooldrige M, Jennings N R. Intelligent agents: Theory and practice[J]. The Knowledge Engineering Review, 1995, 10(2): 115-152. [7]Bellman R. Dynamic programming and lagrange multipliers[J]. Proceedings of the National Academy of Sciences, 1956, 42(10): 767-769. [8]Howard R A. Dynamic programming and markov processes[M]. Cambridge: The MIT Press, 1960: 136. [9]Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence[J]. General Systems, 1977: 25-38. [10] Werbos P J. Approximate dynamic programming for real time control and neural modeling[M]. New York: Van Nostrand Reinhold, 1992: 493-525. [11] 贾琰. 基于近似动态规划的交通控制算法的研究[D]. 北京: 北京交通大学,2008. [12] 曹晓英, 张智军, 向建军, 等. 基于改进动态规划的雷达弱小目标检测与跟踪[J].现代防御技术, 2013, 41(04): 141-146. [13] 黄长强, 赵克新, 韩邦杰, 等. 一种近似动态规划的UCAV机动决策方法[J]. 电子与信息学报, 2018, 40(10): 2447-2452. [14] 刘嘉航. 基于近似动态规划的自学习控制方法及应用研究[D]. 武汉: 国防科技大学, 2017. [15] Sutton R, Barto A . Reinforcement learning: An introduction[M]. Cambridge: The MIT Press, 1998: 338. [16] Watkins C. Learning from delayed rewards[D]. Cambridge: Cambridge University, 1989. [17] Tesauro G. TD-Gammon, a self-teaching backgammon program, achieves master-level play [J]. Neural Computation, 1994, 6(2): 215-219. [18] Watkins C J C H, Dayan P. Q-leaning[J]. Machine Learning, 1992:8(3), 279-292. [19] Rummery G A, Niranjan M. On-line Q-learning using connectionist systems[R]. Cambridge: University of Cambridge, 1994. [20] 赵辉, 刘雅喆. 改进的Q学习算法在轨迹规划中的应用[J]. 吉林大学学报(信息科学版), 2016, 34(05): 697-702. [21] 张峰, 钱辉, 董春茹, 花强. 随机状态下基于期望经验回放的Q学习算法[J]. 深圳大学学报(理工版), 2020, 37(02): 202-207. [22] 史建勋, 张冲标, 吴晗, 等. 高比例光伏微网无功均分控制中的Q学习方法[J]. 电力系统及其自动化学报, 2021, 33(08): 88-93. [23] 陈垚钢. 基于时间差分学习的随机跳变系统鲁棒控制[D]. 无锡: 江南大学, 2021. [24] Mnih, V, Kavukcuoglu, K, Silver D, et al. Playing atari with deep reinforcement learning[EB/OL]. https:// arxiv.org/abs/1312.5602, 2021-2-1. [25] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. [26] Hasselt H V, Guez A, Silver D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2016: 2094-2100. [27] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[EB/OL]. https:// arxiv.org/abs/1511.05952v4, 2016-7-1. [28] Ziyu Wang 0001, Nando de Freitas, Marc Lanctot. Dueling network architectures for deep reinforcement learning[C]//Proceedings of The 33rd International Conference on Machine Learning. New York, PMLR, 2015: 1995-2003. [29] Levine N, Zahavy T, Mankowitz D J, et al. Shallow updates for deep reinforcement learning[C]//Proceedings of the Annual Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 3188-3148. [30] Anschel O, Baram N, Shimkin N.Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning[C]//Proceedings of the 34th International Conference on Machine learning. New York: PMLR, 2017: 176-185. [31] Narasimhan K, Kulkarni T, Barzilay R. Language understanding for text-based games using deep reinforcement learning[EB/OL]. https:// arxiv.org/abs/1506.08941v2, 2015-6-30. [32] Sutton R, McAllester D, Singh S, et al. Policy gradient method for reinforcement learning with function approximation[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge: The MIT Press, 1999: 1057-1063. [33] Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]// Proceedings of the 33rd International Conference on Machine Learning. New York: PMLR, 2016: 1928-1937. [34] Schulman J, Wolski, Dhariwal P, et al. Proximal policy optimization algorithms[EB/OL]. https:// arxiv.org/abs/1707.06347, 2017-7-20. [35] Hu J. Multiagent reinforcement learning : theoretical framework and an algorithm[C]// Processing of the 15th International Conference on Machine Learning. New York: PMLR, 1998: 242-250. [36] Hu J, Wellman M P. Nash Q-learning for general-sum stochastic games[J]. Journal of Machine Learning Research, 2003, 4(4): 1039-1069. [37] Littman M L. Friend-or-foe Q-learning in general sum games[C]//Processing of the International 18th Conference on Machine Learning.New York: PMLR, 2001: 322-328. [38] Bowling M, Velose M. Multiagent learning using a variable learning rate[J]. Intelligence, 2002, 136(02): 215-250. [39] Yang E, Gu D. A survey on multiagent reinforcement learning towards multi-robot systems[C]//Processing of the IEEE Symposium Series on Computational Intelligence and Games. Piscataway: IEEE, 2005: 1213-1531. [40] Busoniu L, Babuska R, Schutter B D. Multiagent reinforcement learning:a survey[C]//Processing of the 9th International Conference on Control, Automation, Robotics and Vision. Piscataway: IEEE, 2006: 1-6. [41] Lu X, Schwartz H M. An investigation of guarding a territory problem in a grid word[C]// Processing of the American Control Conference. Piscataway: IEEE, 2010: 3204-3210. [42] Ryan L, Yi W, Aviv T, et al. Multi-Agent actor-critic for mixed cooperative-competitive environment[EB/OL]. https:// arxiv.org/abs/1706.02275v3, 2018-7-1. [43] 王云, 韩伟. 对称协调博弈问题的多智能体强化学习[J]. 计算机工程与应用, 2008, 44(36): 230-233+248. [44] 成驰. 一类基于Stackelberg博弈的多智能体强化学习算法[D]. 南京: 南京大学, 2017. [45] 李世豪. 复杂空战环境下基于博弈模型的UCAV机动决策方法研究[D]. 南京: 南京航空航天大学, 2019. [46] 方振平. 航空飞行器飞行动力学[M]. 北京: 北京航空航天大学出版社, 2005. [47] 孙雨婷. 旋翼无人机在无人艇上自主起降的研究与设计[D]. 镇江: 江苏科技大学, 2020. [48] 徐明友, 丁松滨. 飞行动力学[M]. 北京: 科学出版社, 2003. [49] 王壮. 近距空战飞行器智能机动决策生成研究[D]. 四川: 四川大学, 2021. [50] Wiering M, Otterlo M V. Reinforcement learning: state-of-the-art[M]. Berlin: Springer, 2012: 1-638. [51] Mcgregor S, Buckingham H, Dietterich T G, et al. Interactive visualization for testing Markov decision processes: MDPVIS[J]. Journal of Visual Languages & Computing, 2017, 39(4): 93-106. [52] Schwartz H M. Multi-agent machine learning: a reinforcement approach[M]. New Jersey: John Wiley & Sons Inc, 2014: 1-242. [53] 余萌迪. 基于博弈论与强化学习的移动边缘计算卸载技术研究[D]. 上海: 上海交通大学, 2020. [54] 金华剑. 无人机数据传输与能量管理组合优化的算法与分析[D]. 武汉: 武汉理工大学, 2020. [55] 殷苌茗. 激励学习的若干新算法及其理论研究[D]. 上海: 上海大学, 2006. [56] 郭宪, 方勇纯. 深入浅出强化学习:原理入门[M]. 北京: 电子工业出版社, 2018 [57] Nguyen T, Nguyen N D, Nahavandi S. Multi-agent deep reinforcement learning with human strategie[EB/OL]. https:// arxiv.org/abs/1806.04562, 2018-6-12. [58] 张堃, 李珂, 时昊天, 等. 基于深度强化学习的 UAV 航路自主引导机动控制决策算法[J].系统工程与电子技术, 2020, 42(7): 1567-1574. [59] 李迎春, 程建博, 于尧. 基于博弈论的UCAV战场攻防策略求解模型[J]. 兵器装备工程学报, 2017, 38(06): 70-72+103. [60] Wenjie Zhao, Zhou Fang, Zuqiang Yang. Four-dimensional trajectory generation for UAVs based on multi-agent Q learning[J]. Journal of Navigation, 2020, 73(4): 1-18. [61] 胡裕靖. 多智能体强化学习中的博弈、均衡和知识迁移[D]. 南京: 南京大学, 2015. [62] Osborne M, Rubinstein A. A course in game theory[M]. Cambridge: The MIT Press, 1994: 1-368． [63] Nash J. Equilibrium points in n-person games[J]. Proceedings of the National Academy of Sciences, 1950, 36(1): 48-49. [64] Nash, J. Non-cooperative games[J]. Annals of Mathematics, 1951, 1(2): 286-295. [65] 毛梦月, 张安, 周鼎, 等. 基于机动预测的强化学习UCAV空中格斗研究[J]. 电光与控制, 2019, 26(02): 5-10+22. [66] 丁林静, 杨啟明. 基于强化学习的UCAV空战机动决策[J]. 航空电子技术, 2018, 49(02): 29-35. [67] Shaw R L. Fighter combat-tactics and maneuvering[M]. Annapolis: Naval Institute, 1985, 1-358. [68] Breitber M H, Pesch H J and Grimm W. Complex differential games of pursuit-evasion type with state constraints, part 2: Numerical computation of optimal open-loop strategies[J]. Journal of Optimization Theory and Applications, 1993, 78: 443-463. [69] 刘勋. 自动空战模拟中的机动决策和控制研究[D]. 西安: 西北工业大学, 2006. [70] 杨峻楠, 张红旗, 张传富. 基于随机博弈与改进WoLF-PHC的网络防御决策方法[J]. 计算机研究与发展, 2019, 56(05): 942-954. [71] Mcgregor S, Buckingham H, Diettrich T G, et al. Interactive visualization for testing markov decision processes: MDPVIS[J]. Journal of Visual Languages & Computing, 2017, 39(4): 93-106. ﹀
中图分类号：	TP181
开放日期：	2023-06-22

附件下载