بهبود یادگیری Q با استفاده از هم‌زمانی به روز رسانی و رویه تطبیقی بر پایه عمل متضاد

محورهای موضوعی : مهندسی برق و کامپیوتر

مریم پویان ^{1
*} , شهرام گلزاري ² , امین موسوی ³ , احمد حاتم ⁴

1 - دانشگاه هرمزگان
2 - دانشگاه هرمزگان
3 - دانشگاه هرمزگان
4 - دانشگاه هرمزگان

تاریخ دریافت : 1396/04/22 تاریخ پذیرش : 1396/04/22 تاریخ انتشار : 1395/06/31

کلید واژه: رویه تطبیقی سرعت همگرایی عمل متضاد هم‌زمانی به روز رسانی یادگیری Q,

چکیده مقاله :

روش یادگیری Q یکی از مشهورترین و پرکاربردترین روش‌های یادگیری تقویتی مستقل از مدل است. از جمله مزایای این روش عدم وابستگی به آگاهی از دانش پیشین و تضمین در رسیدن به پاسخ بهینه است. یکی از محدودیت‌های این روش کاهش سرعت همگرایی آن با افزایش بعد است. بنابراین افزایش سرعت همگرایی به عنوان یک چالش مطرح است. استفاده از مفاهیم عمل متضاد در یادگیری Q، منجر به بهبود سرعت همگرایی می‌شود زیرا در هر گام یادگیری، دو مقدار Q به طور هم‌زمان به روز می‌شوند. در این مقاله روشی ترکیبی با استفاده از رویه تطبیقی در کنار مفاهیم عمل متضاد برای افزایش سرعت همگرایی مطرح شده است. روش‌ها برای مسئله Grid world شبیه‌سازی شده است. روش‌های ارائه‌شده بهبود در میانگین درصد نرخ موفقیت، میانگین درصد حالت‌های بهینه، متوسط تعداد گام‌های عامل برای رسیدن به هدف و میانگین پاداش دریافتی را نشان می‌دهند.

چکیده انگلیسی:

Q-learning is a one of the most popular and frequently used model-free reinforcement learning method. Among the advantages of this method is independent in its prior knowledge and there is a proof for its convergence to the optimal policy. One of the main limitations of this method is its low convergence speed, especially when the dimension is high. Accelerating convergence of this method is a challenge. Q-learning can be accelerated the convergence by the notion of opposite action. Since two Q-values are updated simultaneously at each learning step. In this paper, adaptive policy and the notion of opposite action are used to speed up the learning process by integrated approach. The methods are simulated for the grid world problem. The results demonstrate a great advance in the learning in terms of success rate, the percent of optimal states, the number of steps to goal, and average reward.

منابع و مأخذ:

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998.
[2] J. Qiao, R. Fan, H. Han, and X. Ruan, "Q-learning based on dynamical structure neural network for robot navigation in unknown environment," in Proc. of the 6th Int. Symp. on Neural Networks: Advances in Neural Networks - Part III, ISNN'09, pp. 188-196, 2009.
[3] W. Y. Kwon, I. H. Suh, and S. Lee, "SSPQL: stochastic shortest path-based Q-learning," International J. of Control, Automation, and Systems, vol. 9, no. 2, pp. 328-338, 2011.
[4] P. K. Das, S. C. Mandhata, H. S. Behera, and S. N. Patro, "An improved Q-learning algorithm for path-planning of a mobile robot," International J. of Computer Applications, vol. 51, no. 9, pp. 40-46, 2012.
[5] M. B. Naghibi-Sistani, M. R. Akbarzadeh-Tootoonchi, M. H. Javidi-Dashte Bayaz, and H. Rajabi-Mashhadi, "Application of Q-learning with temperature variation for bidding strategies in market based power systems," Energy Conversion and Management, vol. 47, no. 11, pp. 1529-1538, 2006.
[6] Y. Ozbek, A. Zeid, and S. Kamarthi, "A Q-learning-based adaptive grouping policy for condition-based maintenance of a flow line manufacturing system," International J. of Collaborative Enterprise, vol. 2, no. 4, pp. 302-321, 2011.
[7] R. A. Bianchi, A. Ramisa, and R. L. De Mantaras, "Automatic selection of object recognition methods using reinforcement learning," in Advances in Machine Learning I, Springer Berlin Heidelberg, pp. 421-439, 2010.
[8] H. R. Tizhoosh, "Opposition-based reinforcement learning," J. of Advanced Computational Intelligence and Intelligent Informatics, vol. 10, no. 4, pp. 578-585, 2006.
[9] X. Ma, Y. Xu, G. Q. Sun, L. X. Deng, and Y. B. Li, "State-chain sequential feedback reinforcement learning for path planning of autonomous mobile robots," J. of Zhejiang University Science C, vol. 14, no. 3, pp. 167-178, Mar. 2013.
[10] A. Lampton and J. Valasek, "Multiresolution state-space discretization method for Q-learning," in Proc. American Control Conf., pp. 1646-1651, 2009.
[11] D. Vincze and S. Kovacs, "Incremental rule base creation with fuzzy rule interpolation-based Q-learning," in Proc. Computational Intelligence in Engineering, pp. 191-203, 2010.
[12] K. Terashima and J. Murata, "A study on use of prior information for acceleration of reinforcement learning," in Proc. SICE Annual Conf., pp. 537-543, 2011.
[13] B. Marthi, "Automatic shaping and decomposition of reward functions," in Proc. of the 24th Int. Conf. on Machine Learning, pp. 601-608, 2007.
[14] S. Manju and M. Punithavalli, "An analysis of Q-learning algorithms with strategies of reward function," IJCSE, vol. 3, no. 2, pp. 814-820, Feb. 2011.
[15] M. Guo, Y. Liu, and J. Malec, "A new Q-learning algorithm based on the metropolis criterion," IEEE Trans. Syst. Man Cybern. B, vol. 34, no. 5, pp. 2140-2143, Oct. 2004.
[16] M. Tokic, "Adaptive ε-greedy exploration in reinforcement learning based on value differences," in Proc. of the 33rd annual German Conf. on Advances in Artificial Intelligence, KI'10, pp. 203-210, 2010.
[17] M. Tokic and G. Palm, "Value-difference based exploration: adaptive exploration between epsilon-greedy and softmax," in Proc. of the 34rd annual German Conf. on Advances in Artificial Intelligence, KI'11, pp. 335-346, 2011.
[18] م. پویان، ا. موسوی، ش. گلزاری و ا. حاتم، "روشی نوین برای بهبود عملکرد یادگیری Q با افزایش تعداد به روز رسانی مقادیر Q بر پایه عمل متضاد،" مجموعه مقالات بیستمین کنفرانس سالانه کامپیوتر ایران، دانشگاه فردوسی مشهد، صص. 233-226، 14-12 اسفند 93.
[19] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph. D Thesis, Cambridge University, Cambridge, England, 1989.
[20] M. Pouyan, A. Mousavi, S. Golzari, and A. Hatam, "Improving the performance of Q-learning using simultanous Q-values updating," in Proc. 2014 Int. Congress on Technology, Communication and Knowledge, ICTCK'14 , 6 pp., 26-27 Nov. 2014.
[21] M. Shokri, "Knowledge of opposite actions for reinforcement learning," Applied Soft Computing, vol. 11, no. 6, pp. 4097-4109, 2011.
[22] U. Nehmzow, Scientific Methods in Mobile Robotics: Quantitative Analysis of Agent Behavior, London: Springer-Verlag London Limited, 2006.
[23] L. A. Celiberto, J. P. Matsuura, D. Mantaras, R. Lopez, and R. A. Bianchi, "Using transfer learning to speed-up reinforcement learning: a cased-based approach," in Proc. 2010 Latin American Robotics Symp. and Intelligent Robotic Meeting, LARS'10, pp. 55-60, Sao Bernardo do Campo, Brazil, 23-28 Oct. 2010.

مقالات مرتبط

تشخيص تغييرات صحنه به روش زمينه‏ گيری هوشمند
تاریخ چاپ : 1382/01/01
تخمين سرعت موتور القايي تكفاز و بهينه‎سازي گشتاور آن بدون استفاده از حسگر مكانيكي
تاریخ چاپ : 1382/03/31
طراحی بهينة موتور القائی سه فاز قفس سنجابی برای خودروی برقی
تاریخ چاپ : 1382/03/31
روشي نو در طراحي و ساخت سنكرونايزر الكترونيكي براساس قفل كردن فاز (PLL) جهت موازي كردن سريع ديزل‏ژنراتور‏ها
تاریخ چاپ : 1382/03/31
يك شيوه مداري جديد جهت حفاظت تريستورهاي قدرت سري
تاریخ چاپ : 1382/03/31
همكاري در سيستمهاي چند عامله با استفاده از اتوماتاهاي يادگير
تاریخ چاپ : 1382/03/31

اشتراک گذاری

آدرس مقاله

بهبود یادگیری Q با استفاده از هم‌زمانی به روز رسانی و رویه تطبیقی بر پایه عمل متضاد