Human Activity Recognition using Switching Structure Model
Subject Areas : electrical and computer engineeringMohammad Mahdi Arzani 1 , M. Fathy 2 * , Ahmad Akbari 3
1 - University of Science and Technology
2 -
3 -
Keywords: Probabilistic graphical modelshuman activity recognitiondistributed structured predictionskeleton,
Abstract :
To communicate with people interactive systems often need to understand human activities in advance. However, recognizing activities in advance is a very challenging task, because people perform their activities in different ways, also, some activities are simple while others are complex and comprised of several smaller atomic sub-activities. In this paper, we use skeletons captured from low-cost depth RGB-D sensors as high-level descriptions of the human body. We propose a method capable of recognizing simple and complex human activities by formulating it as a structured prediction task using probabilistic graphical models (PGM). We test our method on three popular datasets: CAD-60, UT-Kinect, and Florence 3D. These datasets cover both simple and complex activities. Also, our method is sensitive to clustering methods that are used to determine the middle states, we evaluate test different clustering, methods.
[1] A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, Efficient Structured Prediction with Latent Variables for General Graphical Models, arXiv preprint arXiv:1206.6436, 2012.
[2] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, "Distributed structured prediction for big data," in Proc. NIPS Workshop on Big Learning, 5 pp., 2012.
[3] A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, "Distributed message passing for large scale graphical models," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR’11, pp. 1833-1840, Providence, RI, USA, 20-25 Jun. 2011.
[4] J. Piger, "Econometrics: models of regime changes," Complex Systems in Finance and Econometrics, pp. 190-202, Jul. 2009.
[5] H. Tong, Threshold Models in Non-Linear Time Series Analysis, Lecture Notes in Statistics, vol. 21, Springer Science & Business Media, 2012.
[6] J. D. Hamilton, "A new approach to the economic analysis of nonstationary time series and the business cycle," Econometrica: J. of the Econometric Society, vol. 57, no. 2, pp. 357-384, Mar. 1989.
[7] F. Han, B. Reily, W. Hoff, and H. Zhang, "Space-time representation of people based on 3d skeletal data: a review," Computer Vision and Image Understanding, vol. 158, pp. 85-105, May 2017.
[8] J. K. Aggarwal and M. S. Ryoo, "Human activity analysis: a review," ACM Computing Surveys, vol. 43, no. 3, Article No. 16, 43 pp., Apr. 2011.
[9] J. Sung, C. Ponce, B. Selman, and A. Saxena, "Unstructured human activity detection from RGBD images," in Proc. IEEE Int. Conf. on Robotics and Automation, ICRA’12, pp. 842-849, Saint Paul, MN, USA, 14-18 May. 2012.
[10] N. Hu, G. Englebienne, Z. Lou, and B. Krose, "Learning to recognize human activities using soft labels," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 39, no. 10, pp. 1973-1984, Oct. 2016.
[11] H. S. Koppula, R. Gupta, and A. Saxena, "Learning human activities and object affordances from RGB-D videos," The International J. of Robotics Research, vol. 32, no. 8, pp. 951-970, Jul. 2013.
[12] M. M. Arzani, et al., "Structured prediction with short/long-range dependencies for human activity recognition from depth skeleton data," in Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, IROS’17, pp. 560-567, Vancouver, BC, Canada, 24-28 Sept. 2017.
[13] C. N. J. Yu and T. Joachims, "Learning structural svms with latent variables," in Proc. of the ACM 26th Annual Int. Conf. on Machine Learning, pp. 1169-1176, Montreal,Canada, Jun. 2009.
[14] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori, "Similarity constrained latent support vector machine: an application to weakly supervised action classification," Computer Vision-ECCV, vol. 7578, pp. 55-68, Oct. 2012.
[15] M. Khodabandeh, et al., "Discovering human interactions in videos with limited data labeling," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops, pp. 9-18, Boston, MA, USA, 7- 12 Jun. 2015.
[16] B. Ni, Y. Pei, P. Moulin, and S. Yan, "Multilevel depth and image fusion for human activity detection," IEEE Trans. on Cybernetics, vol. 43, no. 5, pp. 1383-1394, Aug. 2013.
[17] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori, "Discriminative latent models for recognizing contextual group activities," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1549-1562, Dec. 2012.
[18] X. Zhang, Y. Wang, M. Gou, M. Sznaier, and O. Camps, "Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4498-4507, Las Vegas, NV, USA, 27- 30 Jun. 2016.
[19] C. Wang, J. Flynn, Y. Wang, and A. L. Yuille, "Recognizing actions in 3D using action-snippets and activated simplices," in Proc. 31st. AAAI Conf. on Artificial Intelligence, AAAI’16, pp. 3604-3610, Phoenix, USA, 12-17 Feb. 2016.
[20] R. Anirudh, P. Turaga, J. Su, and A. Srivastava, "Elastic functional coding of riemannian trajectories," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 39, no. 5, pp. 922-936, May 2017.
[21] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, "3-d human action recognition by shape analysis of motion trajectories on riemannian manifold," IEEE Trans. on Cybernetics, vol. 45, no. 7, pp. 1340-1352, Sept. 2015.
[22] J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-temporal lstm with trust gates for 3d human action recognition," in Proc. European Conf. on Computer Vision, ECCV'16, pp. 816-833, Amsterdam, The Netherlands, 8-16 Oct. 2016.
[23] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, "Joint training of a convolutional network and a graphical model for human pose estimation," Advances in Neural Information Processing Systems, vol. 1, pp. 1799-1807, Dec. 2014.
[24] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, "Structural-rnn: deep learning on spatio-temporal graphs," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5308-5317, Las Vegas , NV, USA, 27-30 Jun. 2016.
[25] Z. Shi and T. K. Kim, Learning and Refining of Privileged Information-Based RNNs for Action Recognition from Depth Sequences, arXiv preprint arXiv:1703.09625, 2017.
[26] J. Liu, G. Wang, L. Y. Duan, P. Hu, and A. C. Kot, Skeleton Based Human Action Recognition with Global Context-Aware Attention LSTM Networks, arXiv preprint arXiv:1707.05740, 2017.
[27] W. Chen and G. Guo, "Triviews: a general framework to use 3d depth data effectively for action recognition," J. of Visual Communication and Image Representation, vol. 26, pp. 182-191, Jan. 2015.
[28] A. Eweiwi, M. S. Cheema, C. Bauckhage, and J. Gall, "Efficient pose-based action recognition," in Proc. Asian Conf. on Computer Vision, pp. 428-443, Singapore, Singapore, 1-5 Nov. 2014.
[29] R. Slama, H. Wannous, and M. Daoudi, "Grassmannian representation of motion depth for 3d human gesture and action recognition," in Proc. 22nd IEEE Int. Conf. on Pattern Recognition, ICPR’14, pp. 3499-3504, Stockholm, Sweden, 24-28 Aug. 2014.
[30] Y. Zhu, W. Chen, and G. Guo, "Evaluating spatiotemporal interest point features for depth-based action recognition," Image and Vision Computing, vol. 32, no. 8, pp. 453-464, Aug. 2014.
[31] Y. Kong and Y. Fu, "Bilinear heterogeneous information machine for RGB-D action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1054-1062, Boston, MA, USA, 8-10 Jun. 2015.
[32] R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3d skeletons as points in a lie group," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 588-595, Columbus, Ohio, USA, 24-27 Jun. 2014.
[33] G. Zhu, L. Zhang, P. Shen, and J. Song, "Human action recognition using multi-layer codebooks of key poses and atomic motions," Signal Processing: Image Communication, vol. 42, pp. 19-30, Mar. 2016.
[34] B. Ni, P. Moulin, and S. Yan, "Order-preserving sparse coding for sequence classification," in Proc. European Conf. on Computer Vision, ECCV'12, pp. 173-187, Firenze, Italy , 7-13 Oct. 2012.
[35] G. I. Parisi, C. Weber, and S. Wermter, "Self-organizing neural integration of pose-motion features for human action recognition," Frontiers in Neurorobotics, vol. 9, 3 pp., 2015.
[36] P. Koniusz, A. Cherian, and F. Porikli, Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons (Extended Version), arXiv preprint arXiv:1604.00239, 2016.
[37] D. R. Faria, C. Premebida, and U. Nunes, "A probabilistic approach for human everyday activities recognition using body motion from rgb-d images," in Proc. 23rd IEEE Int. Symp. on Robot and Human Interactive Communication, RO-MAN’14, pp. 732-737, Edinburgh, UK, 25-29 Aug. 2014.
[38] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, "Real-time human pose recognition in parts from single depth images," Communications of the ACM, vol. 56, no. 1, pp. 116-124, Jun. 2013.
[39] A. Manzi, P. Dario, and F. Cavallo, "A human activity recognition system based on dynamic clustering of skeleton data," Sensors, vol. 17, no. 5, p. 1100, May 2017.
[40] P. Wang, C. Yuan, W. Hu, B. Li, and Y. Zhang, "Graph based skeleton motion representation and similarity measurement for action recognition," in Proc. European Conf. on Computer Vision, ECCV'16, pp. 370-385, Amsterdam, The Netherlands, 8-16 Oct. 2016.
[41] J. Shan and S. Akella, "3D human action segmentation and recognition using pose kinetic energy," in Proc. IEEE Workshop on Advanced Robotics and Its Social Impacts, ARSO’14, pp. 69-75, Evanston, IL, USA, 11-13 Sept. 2014.
[42] M. I. Jordan and Y. Weiss, Probabilistic Inference in Graphical Models, Handbook of Neural Networks and Brain Theory, 2002.
[43] A. Quattoni, S. Wang, L. P. Morency, M. Collins, and T. Darrell, "Hidden conditional random fields," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1848-1852, Oct. 2007.
[44] S. Nowozin, C. H. Lampert, et al., "Structured learning and prediction in computer vision," Foundations and Trends in Computer Graphics and Vision, vol. 6, no. 3-4, pp. 185-365, May 2011.
[45] T. Hazan and R. Urtasun, "A primal-dual message-passing algorithm for approximated large scale structured prediction," in Proc. of the 23rd In. Conf. on Neural Information Processing Systems, NIPS’10, vol. 1, pp. 838-846, Dec. 2010.
[46] H. Tong, Non-Linear Time Series: A Dynamical System Approach, Oxford University Pres, 1990.
[47] L. Xia, C. C. Chen, and J. Aggarwal, "View invariant human action recognition using histograms of 3d joints," in Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition Workshops, CVPRW’12, pp. 20-27, Rhode Island, USA, 18-20 Jun. 2012.
[48] L. Seidenari, V. Varano, S. Berretti, A. Bimbo, and P. Pala, "Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops, CVPRW’13, pp. 479-485, Portland, ON, USA, 23-24 Jun. 2013.
[49] J. Ye, K. Li, G. J. Qi, and K. A. Hua, "Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams," in Proc. of the 5th ACM on Int. Conf. on Multimedia Retrieval, pp. 99-106, Shanghai, China, 23-26 Jun. 2015.
[50] X. Zhang, Y. Wang, M. Gou, M. Sznaier, and O. Camps, "Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4498-4507, Las Vegas, NV, USA, 27-30 Jun 2016.
[51] C. Zhang and Y. Tian, "RGB-D camera-based daily living activity recognition," J. of Computer Vision and Image Processing, vol. 2, no. 4, p. 12, Dec. 2012.
[52] X. Yang and Y. Tian, "Effective 3d action recognition using eigenjoints," J. of Visual Communication and Image Representation, vol. 25, no. 1, pp. 2-11, Jan. 2014.
[53] L. Piyathilaka and S. Kodagoda, "Gaussian mixture based hmm for human daily activity recognition using 3d skeleton features," in Proc. 8th IEEE Conf. on Industrial Electronics and Applications, ICIEA’13, , pp. 567-572, Melbourne, Australia, 19-21 Jun. 2013.
[54] R. Gupta, A. Y. S. Chia, and D. Rajan, "Human activities recognition using depth images," in Proc. of the 21st ACM Int. Conf. on Multimedia, pp. 283-292, Barcelona, Spain, 21-23 Oct. 2013.
[55] J. Wang, Z. Liu, and Y. Wu, "Learning actionlet ensemble for 3D human action recognition," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 914 - 927, May 2014.
[56] S. Gaglio, G. L. Re, and M. Morana, "Human activity recognition process using 3-d posture data," IEEE Trans. on Human-Machine Systems, vol. 45, no. 5, pp. 586-597, Dec. 2015.
[57] C. Wang, Y. Wang, and A. L. Yuille, "Mining 3d key-pose-motifs for action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2639-2647, Las Vegas, NV, USA, 27-30 Jun. 2016.
[58] R. Vemulapalli, F. Arrate, and R. Chellappa, "R3dg features: relative 3d geometry-based skeletal representations for human action recognition," Computer Vision and Image Understanding, vol. 152, pp. 155-166, Nov. 2016.
[59] C. Luo, C. Ma, C. Y. Wang, and Y. Wang, "Learning discriminative activated simplices for action recognition," in Proc. 32st. AAAI Conf. on Artificial Intelligence, AAAI’17, pp. 4211-4217, Feb. 2017.