Similarity-Based Processing of Motion Capture Data Jan Sedmidubsky Masaryk University Brno, Czech Republic xsedmid@fi.muni.cz Pavel Zezula Masaryk University Brno, Czech Republic zezula@fi.muni.cz ABSTRACT Motion capture technologies digitize human movements by tracking 3D positions of specific skeleton joints in time. Such spatiotemporal data have an enormous application potential in many fields, ranging from computer animation, through security and sports to medicine, but their computerized processing is a difficult problem. The recorded data can be imprecise, voluminous, and the same movement action can be performed by various subjects in a number of alternatives that can vary in speed, timing or a position in space. This requires employing completely different data-processing paradigms compared to the traditional domains such as attributes, text or images. The objective of this tutorial is to explain fundamental principles and technologies designed for similarity comparison, searching, subsequence matching, classification and action detection in the motion capture data. Specifically, we emphasize the importance of similarity needed to express the degree of accordance between pairs of motion sequences and also discuss the machine-learning approaches able to automatically acquire content-descriptive movement features. We explain how the concept of similarity together with the learned features can be employed for searching similar occurrences of interested actions within a long motion sequence. Assuming a user-provided categorization of example motions, we discuss techniques able to recognize types of specific movement actions and detect such kinds of actions within continuous motion sequences. Selected operations will be demonstrated by on-line web applications. CCS CONCEPTS • Information systems → Similarity measures; Clustering and classification; Multimedia and multimodal retrieval; • Computing methodologies → Supervised learning by classification; KEYWORDS motion capture data; similarity searching; subsequence matching; annotation; action detection; stream-based processing ACM Reference Format: Jan Sedmidubsky and Pavel Zezula. 2018. Similarity-Based Processing of Motion Capture Data. In 2018 ACM Multimedia Conference (MM ’18), October 22–26, 2018, Seoul, Republic of Korea. ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3240508.3241468 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). MM ’18, October 22–26, 2018, Seoul, Republic of Korea © 2018 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-5665-7/18/10. https://doi.org/10.1145/3240508.3241468 1 INTRODUCTION Motion capture data are multiple time series of 3D positions of human-skeleton joints recorded in a frame-by-frame manner. The interest in capturing these data is continuously growing and new application scenarios emerge in a variety of fields. For example, the data could be employed in military to virtually simulate a combat and conflict-resolving situations; in law-enforcement to identify suspicious subjects or events; in smart homes to detect anomalous behavior or body positions of elderly people; in sports to quantify the improvement of athlete’s performance or to predict possible injuries; or in medicine to evaluate progress in rehabilitation or to discover movement disorders as indicators for choosing suitable treatments. A great application potential together with a growing availability of capturing devices indicate a considerable increase of motion data volume in the near future. A capturing device – recording 3D positions of tens of skeleton joints simultaneously at the rate of 120 frames per second – can easily produce gigabytes of continuous data within a single day. Even though storing such quantity of data makes an issue, their intelligent management is a much more challenging problem. Content-based processing techniques, such as searching, organizing and analyzing, are crucial to fully exploit the data potential and make the expensively recorded data more accessible, valuable and reusable. In this tutorial, we focus on search-based techniques that can efficiently localize relevant motions within a large data collection or relevant subsequences within a very long motion sequence. We also discuss classification techniques able to determine specific kind of movement actions with respect to a user-provided categorization of example motions. Such categorization can additionally be employed for semantic segmentation of long sequences, e.g., to detect userspecified motion events in real time or to provide the long-sequence annotation. 2 SIMILARITY CONCEPT Intelligent processing of motion capture data requires to follow patterns used in real-life evolution and communication between species. There, recognition, learning and judgment presuppose an ability to categorize stimuli and classify situations by similarity, which is subjective and context-dependent. The common approach in processing complex, typically unstructured, digital data is to extract content-preserving structured features and use them for associative access. The most successful generic approach follows the metric space model [40], which has been already applied in numerous data processing domains [39]. However, contrary to more traditional data types such as text documents or images, the similarity in the motion-data domain has to additionally handle the dynamics of the time dimension [38]. To effectively model the spatial and temporal evolutions of different motions, robust and sufficiently discriminative features need to be extracted [30]. To become invariant towards the subject’s position, orientation and skeleton size, the input data are often normalized [22]. The normalized data are then processed to extract features on the level of frames or segments. The frame-based features [6] describe single-frame characteristics, for example, normalized distances of pairwise joints [35, 41], co-occurrence of joints [42] or other relational features [20]. The segment-based features describe a multiple-frame sequence by covariance matrices [33], fisher vectors [8], or learned representations extracted using convolutional neural networks [23], auto encoders [34] or support vector machines [11]. The learned representations [23, 29] generally achieve a higher descriptive power than hand-crafted features [20, 32]. The frame-based features are represented as a multi-dimensional time series whose length corresponds to the motion length. The time series of two different motions are compared by time-warping distance measures, such as the Dynamic Time Warping in [2]. On the other hand, the segment-based features have a fixed length and can be efficiently compared, for example, high-dimensional vectors by the Euclidean distance in [23] or bit strings by the Hamming distance in [34]. 3 SEARCHING, CLASSIFICATION AND SEMANTIC SEGMENTATION We consider that motion data appear in form of either a single long motion, or a collection of short motions. While the short motions represent semantically-indivisible actions (e.g., Rittberger jump taking 0.7 seconds), the long one relates to a more complex topic (e.g., figure-skating performance taking 3 minutes) and can contain many short actions. The long sequence can be processed either as a whole, or in the stream-based nature if the whole sequence is not known in advance or does not fit into main memory. 3.1 Similarity Searching To search a collection of short motions, a k-nearest neighbor (kNN) query can be evaluated to obtain the k motions that are the most relevant to a user-provided query motion, based on similarity of their features. Since the collection can be large, multi-dimensional or metric-based index structures [40] can be employed to speed-up similarity search. If the collection contains a long motion, subsequence search is applied to discover the long-motion parts that are similar to the query from both the content and length points of view. One way is to search for the long-motion frames whose features are similar to the features of selected query-motion frames. The retrieved sets of similar frames are then ranked in temporal order to identify queryrelevant subsequences [26]. Since searching in frame-based features needn’t be so effective, the segment-based features are extracted from overlapping [25] or disjoint [28] segments, which are detected in an unsupervised way [13, 37] from both the long motion and query. To identify query-relevant segments, sequential search can be used, such as the A-LTK method in [9] or string-matching-based algorithm in [5]. To improve scalability when searching in a large number of segments, an index structure is employed, for example, the trie-based structure in [12] or PPP-Codes index in [25]. 3.2 Action Recognition Action recognition, also referred to as action classification, is the problem of inferring the kind of movement action, based on a preclassified collection of short motions. The class of a query motion can be recognized by a kNN classifier that searches the input collection to retrieve the k most query-relevant motions whose class labels are then ranked [24, 27]. Such kNN classifiers have been gradually effaced by the increasing success of neural networks that recognize the query class directly. Specifically, deep convolutional networks are trained by the 2D-motion-image features that are also classified by the network [14]. Most attempts suggest to employ the architecture of recurrent neural networks to better model the contextual dependency in the temporal domain [17]. This architecture can be enriched by the Long Short-Term Memory (LSTM) to better learn long-term temporal dependencies [15, 18, 30, 42]. To further handle the noise and occlusion of skeleton sequences, gating mechanisms are integrated to learn the reliability of the sequential data and accordingly adjust their effect on updating the long-term context information stored in LSTM cells [17]. To benefit from different architectures at the same time, the combination of convolutional and LSTM networks is proposed [21]. The recurrent networks are also enriched by attention-based mechanisms to additionally detect the most discriminative moments within an action [1, 18, 30]. 3.3 Semantic Segmentation Most recognition approaches classify only the short motions that correspond to a single action. Only few of them [3, 7, 33, 35, 36, 41] can detect and recognize actions within a long unsegmented motion. Such semantic segmentation is more difficult as the beginnings and endings of actions are unknown and have to be determined. Similarly as in subsequence search, the long motion can be partitioned to extract segment-based features. The segment features are then used to search for the nearest matches within the features of the predefined class actions. If the computed similarity is high, the nearest-match class is considered as the segment label [7, 20]. The disadvantage is that many overlapping segments have to be processed and, when labeled, they do not have to straightforwardly mark the precise beginnings and endings of actions. Moreover, each segment has to be known before its processing begins, implying that labels are discovered with a slight delay. To avoid such disadvantages, a per-class probability can be estimated for each frame by exploiting learned class representations. To enhance the quality, the contextual information of so-far scanned frames is continuously encoded, for example, in recurrent frame-based features [41], hidden states of auto encoders [4], deep beliefs [35] or LSTM-based neural networks [10, 31]. This enables detecting actions even before they finish, which makes the frame-based semantic segmentation suitable for early action detection [16, 19] or future action prediction [4, 10]. On the other hand, learning-based approaches require costly training and cannot dynamically react when the specification of target actions changes. ACKNOWLEDGMENTS This research was supported by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822). REFERENCES [1] Fabien Baradel, Christian Wolf, and Julien Mille. 2017. Human Action Recognition: Pose-based Attention draws focus to Hands. In ICCV Workshop on Hands in Action. [2] Mathieu Barnachon, Saïda Bouakaz, Boubakeur Boufama, and Erwan Guillou. 2014. Ongoing human action recognition with motion capture. Pattern Recognition 47, 1 (2014), 238–247. [3] Said Yacine Boulahia, Eric Anquetil, Franck Multon, and Richard Kulpa. 2018. CuDi3D: Curvilinear displacement based approach for online 3D action detection. Computer Vision and Image Understanding (2018). https://doi.org/10.1016/j.cviu. 2018.07.003 [4] Judith Butepage, Michael J. Black, Danica Kragic, and Hedvig Kjellstrom. 2017. Deep Representation Learning for Human Motion Prediction and Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6158– 6166. [5] Zhigang Deng, Qin Gu, and Qing Li. 2009. Perceptually Consistent Examplebased Human Motion Retrieval. In Symposium on Interactive 3D Graphics and Games (I3D ’09). ACM, 191–198. [6] Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Int. Conference on Computer Vision and Pattern Recognition (CVPR). 1110–1118. [7] Petr Elias, Jan Sedmidubsky, and Pavel Zezula. 2017. A Real-Time Annotation of Motion Data Streams. In 19th International Symposium on Multimedia. IEEE Computer Society, 154–161. [8] Georgios Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Skeletal Quads: Human Action Recognition Using Joint Quadruples. In 22nd Int. Conference on Pattern Recognition (ICPR). 4513–4518. [9] Y. Fang, K. Sugano, K. Oku, H. H. Huang, and K. Kawagoe. 2015. Searching human actions based on a multi-dimensional time series similarity calculation method. In 14th International Conference on Computer and Information Science (ICIS). 235–240. https://doi.org/10.1109/ICIS.2015.7166599 [10] Ashesh Jain, Amir Roshan Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN: Deep Learning on Spatio-Temporal Graphs. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5308–5317. [11] Harshad Kadu and C.-C. Jay Kuo. 2014. Automatic Human Mocap Data Classification. IEEE Transactions on Multimedia 16, 8 (2014), 2191–2202. [12] Mubbasir Kapadia, I-kao Chiang, Tiju Thomas, Norman I Badler, and Joseph T Kider Jr. 2013. Efficient Motion Retrieval in Large Motion Databases. In ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D). ACM, New York, NY, USA, 19–28. [13] Björn Krüger, Anna Vögele, Tobias Willig, Angela Yao, Reinhard Klein, and Andreas Weber. 2017. Efficient Unsupervised Temporal Segmentation of Motion Data. IEEE Transactions on Multimedia 19, 4 (April 2017), 797–812. [14] Sohaib Laraba, Mohammed Brahimi, Joelle Tilmanne, and Thierry Dutoit. 2017. 3D skeleton-based action recognition by representing motion capture sequences as 2D-RGB images. Computer Animation and Virtual Worlds 28, 3-4 (2017). [15] Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and Jian Yang. 2018. Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition. In 32nd Conference on Artificial Intelligence (AAAI). AAAI Press. [16] Sheng Li, Kang Li, and Yun Fu. 2018. Early Recognition of 3D Human Actions. ACM Trans. Multimedia Comput. Commun. Appl. 14, 1s, Article 20 (March 2018), 21 pages. [17] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. 2016. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In European Conference on Computer Vision (ECCV). Springer International Publishing, Cham, 816–833. [18] Jun Liu, Gang Wang, Ling-Yu Duan, Ping Hu, and Alex C. Kot. 2018. Skeleton Based Human Action Recognition with Global Context-Aware Attention LSTM Networks. IEEE Transactions on Image Processing 27, 4 (2018), 1586–1599. [19] Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1942–1950. [20] Meinard Müller, Andreas Baak, and Hans-Peter Seidel. 2009. Efficient and Robust Annotation of Motion Capture Data. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA 2009). ACM Press, 17–26. [21] Juan C. Nunez, Raul Cabido, Juan J. Pantrigo, Antonio S. Montemayor, and Jose F. Velez. 2018. Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition 76 (2018), 80–94. [22] Ronald Poppe, Sophie Van Der Zee, Dirk K. J. Heylen, and Paul J. Taylor. 2014. AMAB: Automated measurement and analysis of body motion. Behavior Research Methods 46, 3 (2014), 625–633. [23] Jan Sedmidubsky, Petr Elias, and Pavel Zezula. 2017. Effective and Efficient Similarity Searching in Motion Capture Data. Multimedia Tools and Applications (2017), 1–22. [24] Jan Sedmidubsky, Petr Elias, and Pavel Zezula. 2017. Enhancing Effectiveness of Descriptors for Searching and Recognition in Motion Capture Data. In 19th International Symposium on Multimedia. IEEE Computer Society, 240–243. [25] Jan Sedmidubsky, Petr Elias, and Pavel Zezula. 2018. Searching for variable-speed motions in long sequences of motion capture data. Information Systems (2018). https://doi.org/10.1016/j.is.2018.04.002 [26] Jan Sedmidubsky, Jakub Valcik, and Pavel Zezula. 2013. A Key-Pose Similarity Algorithm for Motion Data Retrieval. In Advanced Concepts for Intelligent Vision Systems (ACIVS). Springer, 669–681. [27] Jan Sedmidubsky and Pavel Zezula. 2018. Probabilistic Classification of Skeleton Sequences. In 29th International Conference on Database and Expert Systems Applications (DEXA). Springer, 1–15. [28] Jan Sedmidubsky, Pavel Zezula, and Jan Svec. 2017. Fast Subsequence Matching in Motion Capture Data. In 21st European Conference on Advances in Databases and Information Systems (ADBIS). Springer, 1–14. [29] Roshan Singh, Jagwinder Kaur Dhillon, Alok Kumar Singh Kushwaha, and Rajeev Srivastava. 2018. Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition. Multimedia Tools and Applications (2018). https://doi.org/10.1007/s11042-018-6425-3 [30] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2016. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. CoRR abs/1611.06067 (2016). http://arxiv.org/abs/1611.06067 [31] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2018. Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing 27, 7 (July 2018), 3459–3471. [32] Bin Sun, Dehui Kong, Shaofan Wang, Lichun Wang, Yuping Wang, and Baocai Yin. 2018. Effective human action recognition using global and local offsets of skeleton joints. Multimedia Tools and Applications (2018). https://doi.org/10. 1007/s11042-018-6370-1 [33] Chang Tang, Wanqing Li, Pichao Wang, and Lizhe Wang. 2018. Online human action recognition based on incremental learning of weighted covariance descriptors. Information Sciences 467 (2018), 219–237. https://doi.org/10.1016/j.ins.2018. 08.003 [34] Yingying Wang and Michael Neff. 2015. Deep signatures for indexing and retrieval in large motion databases. In 8th ACM SIGGRAPH Conference on Motion in Games. ACM, 37–45. [35] D. Wu and L. Shao. 2014. Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 724–731. [36] Yan Xu, Zhengyang Shen, Xin Zhang, Yifan Gao, Shujian Deng, Yipei Wang, Yubo Fan, and EricI-Chao Chang. 2017. Learning multi-level features for sensor-based human action recognition. Pervasive and Mobile Computing 40 (2017), 324–338. [37] Xiaomin Yu, Weibin Liu, and Weiwei Xing. 2017. Behavioral segmentation for human motion capture data based on graph cut method. Journal of Visual Languages & Computing 43 (2017), 50–59. [38] Pavel Zezula. 2015. Similarity Searching for the Big Data. Mob. Netw. Appl. 20, 4 (2015), 487–496. https://doi.org/10.1007/s11036-014-0547-2 [39] Pavel Zezula. 2016. Similarity Searching for Database Applications. In Advances in Databases and Information Systems (ADBIS). Springer International Publishing, Cham, 3–10. [40] Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. 2006. Similarity Search: The Metric Space Approach. Advances in Database Systems, Vol. 32. Springer-Verlag. 220 pages. [41] Xin Zhao, Xue Li, Chaoyi Pang, Quan Z. Sheng, Sen Wang, and Mao Ye. 2014. Structured Streaming Skeleton – A New Feature for Online Human Gesture Recognition. ACM Trans. Multimedia Comput. Commun. Appl. 11, 1s (2014), 22:1–22:18. [42] Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, and Xiaohui Xie. 2016. Co-occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In 30th AAAI Conference on Artificial Intelligence (AAAI ’16). AAAI Press, 3697–3703.