بازشناسی کنش انسان از روی تصویر ایستا با استفاده از ژست انسان در شبکه چندشاخه
محورهای موضوعی : مهندسی برق و کامپیوتر
1 - دانشگاه آزاد قزوین
2 - دانشگاه صنعتی امیرکبیر
کلید واژه: بازشناسی کنش انسانپیشبینی ژستشبکه سهشاخه شبکه عصبی عمیق,
چکیده مقاله :
امروزه بازشناسی کنش انسان از روی تصویر ایستا به یکی از موضوعات فعال در زمینه بینایی ماشین و شناسایی الگو تبدیل شده است. تمرکز این کار بر روی شناسایی کنش یا رفتار انسان از روی یک تصویر است. برخلاف روشهای سنتی که از ویدئوها یا دنبالهای از تصاویر برای بازشناسی کنش انسان استفاده میکنند، یک تصویر ایستا فاقد اطلاعات زمانی است. بنابراین بازشناسی کنش مبتنی بر تصویر ایستا دارای چالش بیشتری نسبت به بازشناسی کنش مبتنی بر ویدئو است. با توجه به اهمیت اطلاعات حرکتی در بازشناسی کنش از روش flow2Im برای تخمین اطلاعات حرکتی از روی تصویر ایستا استفاده شده است. ساختار پیشنهادی در این مقاله، حاصل ترکیب سه شبکه عصبی عمیق است که تحت عنوان شبکه سهشاخه یاد شده است. شبکه اول بر روی تصویر خام رنگی و شبکه دوم بر روی شار نوری پیشبینی شده از روی تصویر و شبکه سوم بر روی ژست به دست آمده از انسان موجود در تصویر آموزش میبیند. در نهایت تلفیق این سه شبکه عصبی عمیق سبب افزایش دقت بازشناسی کنش انسان شده است. به عبارت دیگر در این مقاله علاوه بر اطلاعات مکانی و زمانی پیشبینی شده از اطلاعات ژست انسان نیز برای بازشناسی کنش استفاده شده است زیرا ویژگی ژست برای بازشناسی کنش بسیار حائز اهمیت است. روش پیشنهادی در این مقاله توانسته است به دقت 80/91 درصد بر روی مجموعه داده action 7Willow، به دقت 02/91 درصد بر روی مجموعه داده 2012Pascal voc و به دقت 87/96 درصد بر روی مجموعه داده 10Stanford دست یابد. با توجه به مقایسه نتایج با روشهای قبلی متوجه خواهیم شد که روش پیشنهادی بالاترین دقت را بر روی هر سه مجموعه داده نسبت به کارهای اخیر به دست آورده است.
Today, human action recognition in still images has become one of the active topics in computer vision and pattern recognition. The focus is on identifying human action or behavior in a single static image. Unlike the traditional methods that use videos or a sequence of images for human action recognition, still images do not involve temporal information. Therefore, still image-based action recognition is more challenging compared to video-based recognition. Given the importance of motion information in action recognition, the Im2flow method has been used to estimate motion information from a static image. To do this, three deep neural networks are combined together, called a three-stream neural network. The proposed structure of this paper, namely the three-stream network, stemmed from the combination of three deep neural networks. The first, second and third networks are trained based on the raw color image, the optical flow predicted by the image, and the human pose obtained in the image, respectively. In other words, in this study, in addition to the predicted spatial and temporal information, the information on human pose is also used for human action recognition due to its importance in recognition performance. Results revealed that the introduced three-stream neural network can improve the accuracy of human action recognition. The accuracy of the proposed method on Willow7 action, Pascal voc2012, and Stanford10 data sets were 91.8%, 91.02%, and 96.97%, respectively, which indicates the promising performance of the introduced method compared to state-of-the-art performance.
[1] G. Guo and A. Lai, "A survey on still image based human action recognition," Pattern Recognition, vol. 47, no. 10, pp. 3343-33612014.
[2] Z. Zhao, H. Ma, and S. You, "Single image action recognition using semantic body part actions," in Proc. IEEE Int. Conf. on Computer Vision, ICCV’17, pp. 3391-3399, Venice, Italy, 17-21 Jul. 2017.
[3] K. Simonyan and V. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proc. Advances in Neural Information Processing Systems, NIPS’14, 9 pp., Montreal, Canada, 8 Dec. 2014
[4] http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
[5] https://www.di.ens.fr/willow/research/stillactions
[6] http://vision.stanford.edu/Datasets/40actions.html
[7] L. Zhang, L. Changxi, P. Peipei, X. Xuezhi, and S. Jingkuan, "Towards optimal VLAD for human action recognition from still images," in Proc. IEEE Int. Acoustics, Speech and Signal Processing Conf., ICASSP’16, pp. 53-63, Shanghai, China, 20-25 Mar. 2016.
[8] Y. Tani and K. Hotta, "Robust human detection to pose and occlusion using bag-of-words," in Proc. Inte. Conf. on Pattern Recognition, ICPR’’14, pp. 4376-4381, Stockholm, Sweden, 24-28 Aug. 2014.
[9] F. S. Khan, et al., "Coloring action recognition in still images," International J. of Computer Vision, vol. 105, no. 3, pp. 205-221, Dec. 2013.
[10] G. Gkioxari, R. Girshick, and J. Malik, "Actions and attributes from wholes and parts," in Proc. of the IEEE Int. Conf. on Computer Vision, pp. 2470-2478, Santiago, Chile, 7-13 Dec. 2015.
[11] V. Yao, X. Jiang, and A. Khosla, "Human action recognition by learning bases of action attributes and parts," in Proc. of ICCV, pp. 1331-1338, Barcelona, Spain, 6-13 Nov. 2011.
[12] V. Delaitre, J. Sivic, and I. Laptev, "Learning person-object interactions for action recognition in still images," in Proc. Advances in Neural Information Processing Systems, NIPS’11, pp. 1503-1511, Granada, Spain, 12-17 Dec. 2011.
[13] B. Yao and L. Fei-Fei, "Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp.1691-1703, Sept. 2012.
[14] A. Prest, C. Schmid, and V. Ferrari, "Weakly supervised learning of interactions between humans and objects," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 601-614, Mar. 2012.
[15] L. Zhujin, X. Wang, R. Huang, and L. Lin, "An expressive deep model for human action parsing from a single image," in Proc. IEEE Int. Conf. on Multimedia and Expo, ICME’14, 6 pp., Chengdu, China, 14-18 Jul. 2014.
[16] G. Sharma, F. Jurie, and C. Schmid, "Expanded parts model for semantic description of humans in still images," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 87-101, Mar. 2016.
[17] Z. Zhao, H. Ma, and S. You, "Single image action recognition using semantic body part actions," in Proc. IEEE Int. Conf. on Computer Vision, ICCV’17, pp. 3391-3399, Venice, Italy, 22-29 Oct. 2017.
[18] W. Yang, Y. Wang, and G. Mori, "Recognizing human actions from still images with latent poses," Computer Vision and Pattern Recognition, CVPR’10, pp. 2030-2037, San Francisco, CA, USA, 15-17 Jun. 2010.
[19] Y. Zheng, Y. J. Zhang, X. Li, and B. D. Liu, "Action recognition in still images using a combination of human pose and context information," in Proc. 19th IEEE Int. Conf. on Image Processing, ICIP’12, pp. 785-788, Orlando, FL, USA, 30 Sept.-3 Oct. 2012.
[20] G. Sharma, F. Jurie, and C. Schmid, "Expanded parts model for human attribute and action recognition in still images," in the IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’13, pp. 652-659, Portland, ON, USA, 25-27 Jun. 2013.
[21] B. C. Ko, J. H. Hong, and J. Y. Nam, "Human action recognition in still images using action poselets and a two-layer classification model," J. of Visual Languages & Computing, vol. 28, no. 1, pp. 163-175, Jun. 2015.
[22] Y. Zhang, L. Cheng, J. Wu, and J. Cai, "Action recognition in still images with minimum annotation efforts," IEEE Trans. on Image Processing, vol. 25, no. 11, pp. 5479-5490, Nov 2016.
[23] G. Gkioxari, R. Girshick, and J. Malik, "Contextual action recognition with r*cnn," in Proc. of the IEEE Int. Conf. on Computer Vision, pp. 1080-1088, Santiago, Chile, 7-13 Dec. Dec. 2015.
[24] M. Safaei and H. Foroosh, "Single image action recognition by predicting space-time saliency," arXiv:1705.04641v1, 12 May 2017.
[25] M. Safaei, P. Balouchian, and H. Foroosh, "TICNN: a hierarchical deep learning framework for still image action recognition using temporal image prediction," in Proc 25th IEEE Int. Conf. on Image Processing, ICIP’18, pp. 3463-3467, Athens, Greece, 7-10 Oct. 2018.
[26] M. Safaei and H. Foroosh, "A zero-shot architecture for action recognition in still images," in Proc 25th IEEE Int. Conf. on Image Processing, ICIP’18, pp. 460-464, Athens, Greece, 7-10 Oct. 2018.
[27] R. Gao, B. Xiong, and K. Grauman, "Im2flow: motion hallucination from static images for action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5937-5947, Istanbul, Turkey, 18-22 Jun. 2018.
[28] V. Badrinarayanan, A. Kendall, and R. Cipolla, Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, arXiv preprint arXiv:1511.00561, 2015.
[29] C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1933-1941, Las Vegas, NV, USA, 27-30 Jun. 2016.