ارائه روشی جدید برای کسب مهارت در یادگیری تقویتی با کمک خوشهبندی گراف
محورهای موضوعی : مهندسی برق و کامپیوترمرضیه داودآبادی فراهانی 1 , ناصر مزینی 2 *
1 - دانشگاه علم و صنعت ایران
2 - دانشگاه علم و صنعت ایران
کلید واژه: یادگیری تقویتی سلسلهمراتبیگزینهانتزاع زمانیمهارتارزیابی مهارتهاخوشهبندی گراف,
چکیده مقاله :
یادگيري تقويتي، يكي از انواع يادگيري ماشين است كه در آن عامل با استفاده از تراکنش با محيط، به شناخت محیط و بهبود رفتار خود میپردازد. يكي از مشكلات اصلي الگوريتمهاي استاندارد يادگيري تقويتي مانند یادگیری Q اين است که نمیتوانند مسایل بزرگ را در زمان قابل قبولی حل کنند. کسب خودکار مهارتها میتواند به شکستن مسأله به زيرمسألههاي کوچکتر و حل سلسلهمراتبی آن کمک کند. با وجود نتایج امیدوارکننده استفاده از مهارتها در یادگیری تقویتی سلسلهمراتبی، در برخی تحقیقات دیگر نشان داده شد که بر اساس وظیفه مورد نظر، اثر مهارتها بر کارایی یادگیری میتواند کاملاً مثبت یا منفی باشد و اگر به درستی انتخاب نشوند میتوانند پیچیدگی حل مسأله را افزایش دهند. از این رو یکی از نقاط ضعف روشهای قبلی کسب خودکار مهارتها، عدم ارزیابی هر یک از مهارتهای کسبشده میباشد. در این مقاله روشهای جدیدی مبتنی بر خوشهبندی گراف برای استخراج زیرهدفها و کسب مهارتها ارائه میگردد. همچنین معیارهای جدید برای ارزیابی مهارتها مطرح میشود که با کمک آنها، مهارتهای نامناسب برای حل مسأله حذف میگردند. استفاده از این روشها در چندین محیط آزمایشگاهی افزایش سرعت یادگیری را به شکل قابل ملاحظهای نشان میدهد.
Reinforcement learning is atype of machine learning methods in which the agent uses its transactions with the environment to recognize the environment and to improve its behavior.One of the main problems of standard reinforcement learning algorithms like Q-learning is that they are not able to solve large scale problems in a reasonable time. Acquiring skills helps to decompose the problem to a set of sub-problems and to solve it with hierarchical methods. In spite of the promising results of using skills in hierarchical reinforcement learning, it has been shown in some previous studies that based on the imposed task, the effect of skills on learning performance can be quite positive. On the contrary, if they are not properly selected, they can increase the complexity of problem-solving. Hence, one of the weaknesses of previous methods proposed for automatically acquiring skills is the lack of a systematic evaluation method for each acquired skill. In this paper, we propose new methods based on graph clustering for subgoal extraction and acquisition of skills. Also, we present new criteria for evaluating skills, with the help of which, inappropriate skills for solving the problem are eliminated. Using these methods in a number of experimental environments shows a significant increase in learning speed.
[1] A. G. Barto and S. Mahadevan, "Recent advances in hierarchical reinforcement learning," Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341-379, Jan. 2003.
[2] R. S. Sutton and A. G. Barto, "Reinforcement learning: an introduction," IEEE Trans. on Neural Networks, vol. 9, no. 5, pp. 1054-1054, Sep. 1998.
[3] W. Moerman, Hierarchical Reinforcement Learning: Assignment of Behaviours to Subpolicies by Self-Organization, Ph.D Thesis, Cognitive Artificial Intelligence, Utrecht University, 2009.
[4] J. Pfau, Plans as a Means for Guiding Reinforcement Learner, Master Thesis, Department of Information Systems, University of Melbourn, 2008.
[5] T. Mann and S. Mannor, "Scaling up approximate value iteration with options: better policies with fewer iterations," in Proc. of the 31st Int. Conf. on Machine Learning, vol. 1, pp. 127-137, Beijing, China, 21-26 Jun. 2014.
[6] A. McGovern and R. S. Sutton, Macro-Actions in Reinforcement Learning: An Empirical Analysis, University of Massachusetts, Department of Computer Science, Tech. Rep, pp. 98-70, 1998.
[7] N. K. Jong, T. Hester, and P. Stone, "The utility of temporal abstraction in reinforcement learning," in Proc. of the 7th Int. Joint Conf. on Autonomous Agents and Multiagent Systems, vol. 1, pp. 299-306, Estoril, Portugal, 12-16 May 2008.
[8] R. S. Sutton, D. Precup, and S. Singh, "Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning," Artificial Intelligence, vol. 112, no. 1-2, pp. 181-211, Aug. 1999.
[9] T. Dietterich, "An overview of MAXQ hierarchical reinforcement learning," Abstraction, Reformulation, and Approximation, pp. 26-44, 2000.
[10] M. Stolle, Automated Discovery of Options in Reinforcement Learning, M.Sc Thesis, McGill University, 2004.
[11] A. McGovern and A. G. Barto, "Automatic discovery of subgoals in reinforcement learning using diverse density," in Proc. Int. Workshop Conf. Machine Learning, pp. 361-368, 28 Jun-1 Jul. 2001.
[12] S. Mannor, I. Menache, A. Hoze, and U. Klein, "Dynamic abstraction in reinforcement learning via clustering," in Proc. of the 21st Int. Conf. on Machine Learning, p. 71, Banff, Alberta, Canada, 4-8 Jul. 2004.
[13] O. Simsek and A. Barto, "Identifying useful subgoals in reinforcement learning by local graph partitioning," in Proc. of the 22nd Int. Conf. on Machine Learning, pp. 816-823, Bonn, Germany, 7-11 Aug. 2005.
[14] A. Jonsson and A. Barto, "A causal approach to hierarchical decomposition of factored MDPs," in Proc. of the 22nd Int. Conf. on Machine Learning, pp. 401-408, Bonn, Germany, 7-11 Aug. 2005.
[15] N. Mehta, S. Ray, P. Tadepalli, and T. Dietterich, "Automatic discovery and transfer of MAXQ hierarchies," in Proc. of the 25th Int. Conf. on Machine Learning, pp. 648-655, Helsinki, Finland, 5-9 Jul. 2008.
[16] P. Zang, P. Zhou, D. Minnen, and C. Isbell, "Discovering options from example trajectories," in Proc. of the 26th Annual Int. Conf. on Machine Learning, pp. 1217-1224, Montreal, Quebec, Canada, 14-18 Jun. 2009.
[17] I. Menache, S. Mannor, and N. Shimkin, "Q-cut-dynamic discovery of sub-goals in reinforcement learning," in Proc. 13th European Conf. on Machine Learning, ECML'02, pp. 187-195, 19-23 Aug. 2002.
[18] O. Simsek, Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning, Ph.D. Thesis, Department of Computer Science, University of Massachusetts Amherst, 2008.
[19] V. Mathew, K. Peeyush, and B. Ravindran, "Abstraction in reinforcement learning in terms of metastability," in Proc. of the 10th European Workshop on Reinforcement Learning, EWRL'12, pp. 1-14, 2012.
[20] C. C. Chiu and V. W. Soo, "Automatic complexity reduction in reinforcement learning," Computational Intelligence, vol. 26, no. 1, pp. 1-25, Feb. 2010.
[21] J. H. Metzen and F. Kirchner, "Incremental learning of skill collections based on intrinsic motivation," Frontiers in Neurorobotics, vol. 7, p. 11, Jan. 2013.
[22] P. Bacon and D. Precup, "Using label propagation for learning temporally abstract actions in reinforcement learning," in Proc. of the Workshop on Multiagent Interaction Networks, pp. 357-368, 2013.
[23] M. Davoodabadi and H. Beigy, "A new method for discovering subgoals and constructing options in reinforcement learning," in Proc. IICAI, pp. 441-450, 2011.
[24] O. Simsek and A. G. Barto, "Skill characterization based on betweenness," in Proc. 21th Int. Conf. on Neural Information Processing Systems, NIPS'08, pp. p.1497-1504, Vancouver, Canada, 8-10 Dec. 2008.
[25] A. A. Rad, M. Hasler, and P. Moradi, "Automatic skill acquisition in reinforcement learning using connection graph stability centrality," in Proc. of 2010 IEEE Int. Symp. on Circuits and Systems, ISCAS'10, pp. 697-700, Paris, France, 30 May-2 Jun. 2010.
[26] L. J. Lin, "Self-improving reactive agents based on reinforcement learning, planning and teaching," Machine Learning, vol. 8, no. 3-4, pp. 293-321, May 1992.
[27] R. S. Sutton, D. Precup, and S. Singh, "Intra-option learning about temporally abstract actions," in Proc. of the Fifteenth Int. Conf. on Machine Learning, pp. 556-564, 1998.
[28] M. E. J. Newman and M. Girvan, "Finding and evaluating community structure in networks," Physical Review E, vol. 69, no. 2, p. 026113, Feb. 2004.
[29] U. N. Raghavan, R. Albert, and S. Kumara, "Near linear time algorithm to detect community structures in large-scale networks," Physical Review E, vol. 76, no. 3, p. 036106, Sep. 2007.
[30] M. E. J. Newman, "Fast algorithm for detecting community structure in networks," Physical Review E, vol. 69, no. 6, p. 066133, Jun. 2004.
[31] K. Merrick, Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning, Ph.D. Thesis, School of Information Technologies, the University of Sydney, 2007.
[32] M. Davoodabadi and N. Mozayani, "Automatic construction and evaluation of options in reinforcement learning," Submitted in Artificial Intelligence, https://www.journals.elsevier.com/artificial-intelligence/
[33] A. G. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in Proc. of the 3rd Int. Conf. on Development and Learning, ICDL’04, pp. 112-119, Salk Institute, San Diego, USA, Oct. 2004.
[34] J. Murata, "Controlled use of subgoals in reinforcement learning," Robotics, Automation and Control, pp. 167-182, 2008.