Joint International Workshops on Structural and Syntactic Pattern Recognition and Statistical Techniques in Pattern Recognition 2016, preprint Walker-Independent Features for Gait Recognition from Motion Capture Data Michal Balazia (0000-0001-7153-9984) and Petr Sojka (0000-0002-5768-4007) Faculty of Informatics, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic xbalazia@mail.muni.cz and sojka@fi.muni.cz Abstract MoCap-based human identification, as a pattern recognition discipline, can be optimized using a machine learning approach. Yet in some applications such as video surveillance new identities can appear on the fly and labeled data for all encountered people may not always be available. This work introduces the concept of learning walker-independent gait features directly from raw joint coordinates by a modification of the Fisher’s Linear Discriminant Analysis with Maximum Margin Criterion. Our new approach shows not only that these features can discriminate different people than who they are learned on, but also that the number of learning identities can be much smaller than the number of walkers encountered in the real operation. 1 Introduction Recent rapid improvement in motion capture (MoCap) sensor accuracy brought affordable technology that can identify walking people. MoCap technology provides video clips of walking individuals containing structural motion data. The format keeps an overall structure of the human body and holds estimated 3D positions of major anatomical landmarks as the person moves. MoCap data can be collected online by a system of multiple cameras (Vicon) or a depth camera (Microsoft Kinect). To visualize motion capture data (see Figure 1), a simplified stick figure representing the human skeleton (graph of joints connected by bones) can be recovered from body point spatial coordinates. 2 Michal Balazia, Konstantinos N. Plataniotis The introduced method uses a motion capture technology that provides video clips of walking individuals containing structural motion data. The format keeps an overall structure of the human body and holds estimated 3D positions of major anatomical landmarks as the person moves. These so-called motion capture data (MoCap) can be collected online by a system of multiple cameras (Vicon) or a depth camera (Microsoft Kinect). To visualize motion capture data (see Figure 1), a simplified stick figure representing the human skeleton (a graph of joints connected by bones) can be recovered from the values of body point spatial coordinates. Shape-based solutions utilize human silhouettes that are often corrupt and overlapping, while structure-based solutions estimate body point positions with much higher accuracy using the 3D technology. x y z Fig. 1 Motion capture data. Skeleton is represented by a stick figure of 17 joints (left). Seven selected video frames (right) of a walk sequence contain 3D spatial coordinates of each joint in time. The red and blue lines track the trajectories of hands and feet. People spotted in our tracking space do not walk all the time; on the contrary, they perform various activities. Recognizing people by gait requires processing video segments where they are actually walking. Clean gait cycles need to be first filtered out from the video sequences of general motion. Some methods focus on detecting gait cycles directly [1,2] or we can use a general action recognition method [3,4,5,6,7] that only need an example of a gait cycle to search general motion sequences. Having a query motion clip where a person performs an action classified as a gait cycle, the system can proceed to the identification phase (see Figure 2). Gait sample of each recorded walker in a raw MoCap form is pre-processed to contain the discriminative gait information. A collection of extracted gait features, such as feet distance or elbow angle, builds a gait pattern descriptor. Descriptor, also referred to as gait pattern, serves as a walker’s signature. Associated with the walker’s identity, the descriptors are stored in a central database. To identify someone by gait means to classify an identity for their gait pattern that is unknown at the moment. The classifier composes a query to search this database for a set of similar gait patterns, retrieving a collection of candidate identities and reporting the most likely one as the classified identity. Here, the similarity of two gait patterns is expressed in a single number computed by a similarity/distance function of their descriptors. Recognition rate is clearly the most important qualitative measure; however, as video surveillance applications process large amounts of data, we are also interested in how the system performs with our limited time and space resources. Computational complexity, as a quantitative measure, is the amount of computational resources used to operate the gait recognition system. This measure involves computational time and space of frequent operations, such as descriptor extraction, identity classification, and database maintenance. Even a 100% method is unacceptable if processing a short motion clip takes too long or if a small database occupies too much memory. z y x Figure 1. Motion capture data. Skeleton is represented by a stick figure of 31 joints (only 17 are drawn here). Seven selected video frames of a walk sequence contain 3D coordinates of each joint in time. The red and blue lines track trajectories of hands and feet. [18] 1 Recognizing a person by walk involves capturing and normalizing their walk sample, extracting gait features to compose a template, and finally querying a central database for a set of similar templates to report the most likely identity. This work focuses on extracting robust and discriminative gait features from raw MoCap data. Many geometric gait features have been introduced over the past few years. They are typically combinations of static body parameters (bone lengths, person’s height) [12] with dynamic gait features such as step length, walk speed, joint angles and inter-joint distances [4,1,12,15], along with various statistics (mean, standard deviation or maximum) of their signals [3]. Clearly, these features are schematic and human-interpretable, which is convenient for visualizations and for intuitive understanding, but unnecessary for automatic gait recognition. Instead, this application prefers learning features that maximally separate the identity classes and are not limited by such dispensable factors. Methods for 2D gait recognition extensively use machine learning models for extracting gait features, such as principal component analysis and multi-scale shape analysis [8], genetic algorithms and kernel principal component analysis [17], radial basis function neural networks [20], or convolutional neural networks [7]. All of those and many other models are reasonable to be utilized also in 3D gait recognition. In the video surveillance environment data need to be acquired without walker’s consent and new identities can appear on the fly. Here and also in other applications where labels for all encountered people may not always be available, we value features that have a high power in distinguishing all people and not exclusively who they were learned on. We call these walker-independent features. The main idea is to statistically learn what aspects of walk people generally differ in and extract those as gait features. The features are learned in a supervised manner, as described in the following section. 2 Learning Gait Features In statistical pattern recognition, reducing space dimensionality is a common technique to overcome class estimation problems. Classes are discriminated by projecting highdimensional input data onto low-dimensional sub-spaces by linear transformations with the goal of maximizing the class separability. We are interested in finding an optimal feature space where a gait template is close to those of the same walker and far from those of different walkers. Let the model of a human body have J joints and all samples be linearly normalized to their average length T. Labeled learning data in the measurement space 𝒢L are in the form {(gn, n)}NL n=1 where gn = [︁ [γ1 (1) · · · γJ (1)]⊤ · · · [γ1 (T) · · · γJ (T)]⊤ ]︁⊤ (1) is a gait sample (one gait cycle) in which γj (t) ∈ R3 are 3D spatial coordinates of a joint j ∈ {1, . . . , J} at time t ∈ {1, . . . , T} normalized with respect to the person’s position and walk direction. See that 𝒢L has dimensionality D = 3JT. Each learning sample falls strictly into one of the learning identity classes {ℐc}C c=1 determined by n. A class ℐc ⊆ 𝒢L has Nc samples. The classes are complete and mutually exclusive. We say that learning samples (gn, n) and (gn′ , n′ ) share a common walker if and only if they belong to the same class, i.e., (gn, n) , (gn′ , n′ ) ∈ ℐc ⇔ n = n′ . 2 We measure class separability of a given feature space by a representation of the Maximum Margin Criterion (MMC) [11,13] used by the Vapnik’s Support Vector Machines (SVM) [19] 𝒥 = 1 2 CL∑︁ c,c′=1 (︁ (µc − µc′ )⊤ (µc − µc′ ) − tr (Σc + Σc′ ) )︁ (2) which is actually a summation of 1 2CL(CL − 1) between-class margins. The margin is defined as the Euclidean distance of class means minus both individual variances (traces of scatter matrices Σc = 1 Nc ∑︀Nc n=1 (︁ g(c) n − µc )︁ (︁ g(c) n − µc )︁⊤ and similarly for Σc′ ). For the whole labeled data, we denote the between- and within-class and total scatter matrices ΣB = CL∑︁ c=1 (µc − µ) (µc − µ)⊤ ΣW = CL∑︁ c=1 1 Nc Nc∑︁ n=1 (︁ g(c) n − µc )︁ (︁ g(c) n − µc )︁⊤ ΣT = CL∑︁ c=1 1 Nc Nc∑︁ n=1 (︁ g(c) n − µ )︁ (︁ g(c) n − µ )︁⊤ = ΣB + ΣW (3) where g(c) n denotes the n-th sample in class ℐc and µc and µ are sample means for class ℐc and the whole data set, respectively, that is, µc = 1 Nc ∑︀Nc n=1 g(c) n and µ = 1 NL ∑︀NL n=1 gn. Now we obtain 𝒥 = 1 2 CL∑︁ c,c′=1 (µc − µc′ )⊤ (µc − µc′ ) − 1 2 CL∑︁ c,c′=1 tr (Σc + Σc′ ) = 1 2 CL∑︁ c,c′=1 (µc − µ + µ − µc′ )⊤ (µc − µ + µ − µc′ ) − CL∑︁ c=1 tr (Σc) = tr ⎛ ⎜⎜⎜⎜⎜⎜⎝ CL∑︁ c=1 (µc − µ) (µc − µ)⊤ ⎞ ⎟⎟⎟⎟⎟⎟⎠ − tr ⎛ ⎜⎜⎜⎜⎜⎜⎝ CL∑︁ c=1 Σc ⎞ ⎟⎟⎟⎟⎟⎟⎠ = tr (ΣB) − tr (ΣW) = tr (ΣB − ΣW) . (4) Since tr (ΣB) measures the overall variance of the class mean vectors, a large one implies that the class mean vectors scatter in a large space. On the other hand, a small tr (ΣW) implies that classes have a small spread. Thus, a large 𝒥 indicates that samples are close to each other if they share a common walker but are far from each other if they are performed by different walkers. Extracting features, that is, transforming the input data in the measurement space into a feature space of higher 𝒥, can be used to link new observations of walkers more successfully. 3 Feature extraction is given by a linear transformation (feature) matrix Φ ∈ RD×̂︀D from a D-dimensional measurement space 𝒢 = {gn}N n=1 of not necessarily labeled gait samples to a ̂︀D-dimensional feature space ̂︀𝒢 = {︀ ̂︀gn }︀N n=1 of gait templates where ̂︀D < D and each gait sample gn is transformed into a gait templatê︀gn = Φ⊤ gn. The objective is to learn a transform Φ that maximizes MMC in the feature space 𝒥 (Φ) = tr (︁ Φ⊤ (ΣB − ΣW) Φ )︁ . (5) Once the transformation is found, all measured samples are transformed into templates (in the feature space) along with the class means and covariances. The templates are compared by the Mahalanobis distance function ̂︀δ (︀ ̂︀gn,̂︀gn′ )︀ = √︁ (︀ ̂︀gn −̂︀gn′ )︀⊤ ̂︀Σ−1 T (︀ ̂︀gn −̂︀gn′ )︀ . (6) We show that solution to the optimization problem in Equation (5) can be obtained by eigendecomposition of the matrix ΣB −ΣW. An important property to notice about the objective 𝒥 (Φ) is that it is invariant w.r.t. rescalings Φ → αΦ. Hence, we can always choose Φ = f1‖ · · · ‖f̂︀D such that f⊤ ̂︀d f̂︀d = 1, since it is a scalar itself. For this reason we can reduce the problem of maximizing 𝒥 (Φ) into the constrained optimization problem max ̂︀D∑︁ ̂︀d=1 f⊤ ̂︀d (ΣB − ΣW) f̂︀d subject to f⊤ ̂︀d f̂︀d − 1 = 0 ∀̂︀d = 1, . . . , ̂︀D. (7) To solve the above optimization problem, let us consider the Lagrangian ℒ (︁ f̂︀d, λ̂︀d )︁ = ̂︀D∑︁ ̂︀d=1 f⊤ ̂︀d (ΣB − ΣW) f̂︀d − λ̂︀d (︁ f⊤ ̂︀d f̂︀d − 1 )︁ (8) with multipliers λ̂︀d. To find the maximum, we derive it with respect to f̂︀d and equate to zero ∂ℒ (︁ f̂︀d, λ̂︀d )︁ ∂f̂︀d = (︁ (ΣB − ΣW) − λ̂︀dI )︁ f̂︀d = 0 (9) which leads to (ΣB − ΣW) f̂︀d = λ̂︀df̂︀d (10) where λ̂︀d are the eigenvalues of ΣB − ΣW and f̂︀d are the corresponding eigenvectors. Putting it all together, (ΣB − ΣW) Φ = ΛΦ (11) where Λ = diag (︁ λ1, . . . , λ̂︀D )︁ is the eigenvalue matrix. Therefore, 𝒥 (Φ) = tr (︁ Φ⊤ (ΣB − ΣW) Φ )︁ = tr (︁ Φ⊤ ΛΦ )︁ = tr (Λ) (12) is maximized when Λ has ̂︀D largest eigenvalues and Φ contains the corresponding leading eigenvectors. 4 In the following we discuss how to calculate the eigenvectors of ΣB − ΣW and to determine an optimal dimensionality ̂︀D of the feature space. Rewrite ΣB−ΣW = 2ΣB−ΣT. Note that the null space of ΣT is a subspace of that of ΣB since the null space of ΣT is the common null space of ΣB and ΣW. Thus, we can simultaneously diagonalize ΣB and ΣT to some ∆ and I Ψ⊤ ΣBΨ = ∆ Ψ⊤ ΣTΨ = I (13) with the D × rank (ΣT) eigenvector matrix Ψ = ΩΘ− 1 2 Ξ (14) where Ω and Θ are the eigenvector and eigenvalue matrices of ΣT, respectively, and Ξ is the eigenvector matrix of Θ−1/2 Ω⊤ ΣBΩΘ−1/2 . To calculate Ψ, we use a fast two-step algorithm in virtue of Singular Value Decomposition (SVD). SVD expresses a real r × s matrix A as a product A = UDV⊤ where D is a diagonal matrix with decreasing non-negative entries, and U and V are r×min {r, s} and s×min {r, s} eigenvector matrices of AA⊤ and A⊤ A, respectively, and the non-vanishing entries of D are square roots of the non-zero corresponding eigenvalues of both AA⊤ and A⊤ A. See that ΣT and ΣB can be expressed in the forms ΣT = XX⊤ where X = 1 √ NL [︀ (g1 − µ) · · · (︀ gNL − µ )︀]︀ and ΣB = ΥΥ⊤ where Υ = [︀ (µ1 − µ) · · · (︀ µCL − µ )︀]︀ , (15) respectively. Hence, we can obtain the eigenvectors Ω and the corresponding eigenvalues Θ of ΣT through the SVD of X and analogically Ξ of Θ−1/2 Ω⊤ ΣBΩΘ−1/2 through the SVD of Θ−1/2 Ω⊤ Υ. The columns of Ψ are clearly the eigenvectors of 2ΣB − ΣT with the corresponding eigenvalues 2∆ − I. Therefore, to constitute the transform Φ by maximizing the MMC, we should choose the eigenvectors in Ψ that correspond to the eigenvalues of at least 1 2 in ∆. Note that ∆ contains at most rank (ΣB) = CL − 1 positive eigenvalues, which gives an upper bound on the feature space dimensionality ̂︀D. Algorithm 1 [5] provided below is an efficient way of learning the transform Φ for MMC on given labeled learning data 𝒢L. Algorithm 1 LearnTransformationMatrixMMC(𝒢L) 1: split 𝒢L = {(gn, n)}NL n=1 into classes {ℐc}CL c=1 of Nc = |ℐc| samples 2: compute overall mean µ = 1 NL ∑︀NL n=1 gn and individual class means µc = 1 Nc ∑︀Nc n=1 g(c) n 3: compute ΣB = ∑︀CL c=1 (µc − µ) (µc − µ)⊤ 4: compute X = 1√ NL [︀ (g1 − µ) · · · (︀ gNL − µ )︀]︀ 5: compute Υ = [︀ (µ1 − µ) · · · (︀ µCL − µ )︀]︀ 6: compute eigenvectors Ω and corresponding eigenvalues Θ of ΣT through SVD of X 7: compute eigenvectors Ξ of Θ−1/2 Ω⊤ ΣBΩΘ−1/2 through SVD of Θ−1/2 Ω⊤ Υ 8: compute eigenvectors Ψ = ΩΘ−1/2 Ξ 9: compute eigenvalues ∆ = Ψ⊤ ΣBΨ 10: return transform Φ as eigenvectors in Ψ that correspond to the eigenvalues of at least 1/2 in ∆ 5 3 Experiments and Results 3.1 Database For the evaluation purposes we have extracted a large number of samples from the general MoCap database from CMU [9] as a well-known and recognized database of structural human motion data. It contains numerous motion sequences, including a considerable number of gait sequences. Motions are recorded with an optical marker-based Vicon system. People wear a black jumpsuit and have 41 markers taped on. The tracking space of 30 m2 , surrounded by 12 cameras of sampling rate of 120 Hz in the height from 2 to 4 meters above ground, creates a video surveillance environment. Motion videos are triangulated to get highly accurate 3D data in the form of relative body point coordinates (with respect to the root joint) in each video frame and stored in the standard ASF/AMC data format. Each registered participant is assigned with their respective skeleton described in an ASF file. Motions in the AMC files store bone rotational data, which is interpreted as instructions about how the associated skeleton deforms over time. These MoCap data, however, contain skeleton parameters pre-calibrated by the CMU staff. Skeletons are unique for each walker and even a trivial skeleton check could result in 100% recognition. In order to use the collected data in a fairly manner, a prototypical skeleton is constructed and used to represent bodies of all subjects, shrouding the unique skeleton parameters of individual walkers. Assuming that all walking subjects are physically identical disables the skeleton check as a potentially unfair classifier. Moreover, this is a skeleton-robust solution as all bone rotational data are linked with a fixed skeleton. To obtain realistic parameters, it is calculated as the mean of all skeletons in the provided ASF files. We calculate 3D joint coordinates using bone rotational data and the prototypical skeleton. One cannot directly use raw values of joint coordinates, as they refer to absolute positions in the tracking space, and not all potential methods are invariant to person’s position or walk direction. To ensure such invariance, the center of the coordinate system is moved to the position of root joint γroot (t) = [0, 0, 0]⊤ for each time t and axes are adjusted to the walker’s perspective: the X axis is from right (negative) to left (positive), the Y axis is from down (negative) to up (positive), and the Z axis is from back (negative) to front (positive). In the AMC file structure notation it is achieved by zeroing the root translation and rotation (root 0 0 0 0 0 0) in all frames of all motion sequences. Since the general motion database contains all motion types, we extracted a number of sub-motions that represent gait cycles. First, an exemplary gait cycle was identified, and clean gait cycles were then filtered out using the DTW distance over bone rotations. The similarity threshold was set high enough so that even the least similar sub-motion still semantically represents a gait cycle. Finally, subjects that contributed with less than 10 samples were excluded. The final database has 54 walking subjects that performed 3,843 samples in total, which makes an average of about 71 samples per subject. 3.2 Evaluation Setups and Metrics Learning data 𝒢L = {(gn, n)}NL n=1 of CL identities and evaluation data 𝒢E = {(gn, n)}NE n=1 of CE identity classes have to be disjunct at all times. Evaluation is performed exclusively 6 on the evaluation part, taking no observations of the learning part into account. In the following we introduce two setups of data separation: homogeneous and heterogeneous. The homogeneous setup learns the transformation matrix on 1/3 samples of CL identities and is evaluated on templates derived from other 2/3 samples of the same CE = CL identities. The heterogeneous setup learns the transform on all samples in CL identities and is evaluated on all templates derived from other CE identities. For better clarification we refer to Figure 2. Note that unlike in homogeneous setup, in heterogeneous setup there is no walker identity ever used for both learning and evaluation at the same time. Figure 2. Abstraction of data separation for homogeneous setup of CL = CE = 3 learning-andevaluation classes (left) and for heterogeneous setup of CL = 2 learning classes and CE = 4 evaluation classes (right). Black square represents a database and ellipses are identity classes. Homogeneous setup is parametrized by a single number CL = CE of learning-andevaluation identity classes, whereas the heterogeneous setup has the form (CL,CE) specifying how many learning and how many evaluation identity classes are randomly selected from the database. Evaluation of each setup is repeated 3 times, selecting new random CL and CE identity classes each time and reporting the average result. Please note that in the heterogeneous setup the learning identities are disjunct from the evaluation identities, that is, there is no single identity used for both learning and evaluation. Correct Classification Rate (CCR) is a standard qualitative measure; however, if a method has a low CCR, we cannot directly say if the system is failing because of bad features or a bad classifier. Providing an evaluation in terms of class separability of the feature space gives an estimate on the recognition potential of the extracted features and do not reflect eventual combination with an unsuitable classifier. Quality of features extraction algorithms is reflected in the Davies-Bouldin Index (DBI) DBI = 1 CE CE∑︁ c=1 max 1≤c′≤CE, c′ c σc + σc′ ̂︀δ (︀ ̂︀µc,̂︀µc′ )︀ (16) where σc = 1 Nc ∑︀Nc n=1 ̂︀δ (︀ ̂︀gn,̂︀µc )︀ is the average distance of all templates in identity class ℐc to its centroid, and similarly for σc′ . Templates of low intra-class distances and of high inter-class distances have a low DBI. DBI is measured on the full evaluation part, whereas CCR is estimated with 10-fold cross-validation taking one dis-labeled fold as a testing set and other nine as gallery. Test templates are classified by the winner-takes-all 7 strategy, in which a test template ̂︀gtest gets assigned with the label argmini ̂︀δ (︁ ̂︀gtest,̂︀g gallery i )︁ of the gallery’s closest identity class. Based on Section 3.1, our database has 54 identity classes in total. We performed the series of experiments A, B, C, D below. The experiments A and B are to compare the homogeneous and heterogeneous setup, whereas C and D examine how performance of the system in the heterogeneous setup improves with increasing number of learning identities. The results are illustrated in Figure 3 and in Figure 4 in the next section. A homogeneous setup with CL = CE ∈ {2, . . . , 27}; B heterogeneous setup with CL = CE ∈ {2, . . . , 27}; C heterogeneous setup with CL ∈ {2, . . . , 27} and CE = 27; D heterogeneous setup with CL ∈ {2, . . . , 52} and CE = 54 − CL. 3.3 Results Experiments A and B compare homogeneous and heterogeneous setups by measuring the drop in performance on an identical number of learning and evaluation identities (CL = CE). Top plot in Figure 3 shows the measured values of DBI and CCR metrics in both alternatives, which not only appear comparable but also in some configurations the heterogeneous setup has an even higher CCR. Bottom plot expresses heterogeneous setup as a percentage of the homogeneous setup in each of the particular metrics. Here we see that with raising number of identities the heterogeneous setup approaches 100% of the fully homogeneous alternative. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 21 28 35 42 49 56 63 70 DBI homogeneous DBI heterogeneous CCR homogenous CCR heterogeneous 0.0 0.1 0.2 0.3 0 7 14 21 (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10) (11,11) (12,12) (13,13) (14,14) (15,15) (16,16) (17,17) (18,18) (19,19) (20,20) (21,21) (22,22) (23,23) (24,24) (25,25) (26,26) (27,27) CCR heterogeneous 40% 60% 80% 100% 120% DBI CCR 0% 20% 40% (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10) (11,11) (12,12) (13,13) (14,14) (15,15) (16,16) (17,17) (18,18) (19,19) (20,20) (21,21) (22,22) (23,23) (24,24) (25,25) (26,26) (27,27) CCR Figure 3. DBI (left vertical axis) and CCR (right vertical axis) for experiments A of homogeneous setup and B of heterogeneous setup (top) with (CL,CE) configurations (horizontal axes) and their percentages (bottom). 8 Experiments C and D investigate on the impact of the number of learning identities in the heterogeneous setup. Observing from the Figure 4, the performance grows quickly on the first configurations with very few learning identities, which we can interpret as an analogy to the Pareto (80–20) principle. Specifically, the results of experiment C say that 8 learning identities achieve almost the same performance (66.78 DBI and 0.902 CCR) to as if learned on 27 identities (68.32 DBI and 0.947 CCR). The outcome of experiment D indicates a similar growth of performance and we see that yet 14 identities can be enough to learn the transformation matrix to distinguish 40 completely different people (0.904 CCR). 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 300 400 500 600 700 800 0.0 0.1 0.2 0.3 0 100 200 (2,27) (3,27) (4,27) (5,27) (6,27) (7,27) (8,27) (9,27) (10,27) (11,27) (12,27) (13,27) (14,27) (15,27) (16,27) (17,27) (18,27) (19,27) (20,27) (21,27) (22,27) (23,27) (24,27) (25,27) (26,27) (27,27) DBI CCR 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 600 800 1000 1200 1400 1600 1800 2000 0.0 0.1 0.2 0.3 0 200 400 600 (2,52) (3,51) (4,50) (5,49) (6,48) (7,47) (8,46) (9,45) (10,44) (11,43) (12,42) (13,41) (14,40) (15,39) (16,38) (17,37) (18,36) (19,35) (20,34) (21,33) (22,32) (23,31) (24,30) (25,29) (26,28) (27,27) (28,26) (29,25) (30,24) (31,23) (32,22) (33,21) (34,20) (35,19) (36,18) (37,17) (38,16) (39,15) (40,14) (41,13) (42,12) (43,11) (44,10) (45,9) (46,8) (47,7) (48,6) (49,5) (50,4) (51,3) (52,2) Figure 4. DBI (left vertical axes) and CCR (right vertical axes) for experiments C (top) and D (bottom) on heterogeneous setup with (CL,CE) configurations (horizontal axes). The proposed method and seven other state-of-the-art methods [2,4,6,10,12,14,16] have been subjected to extensive simulations on homogeneous setup in our recent research paper [5]. A variety of class-separability coefficients and classification metrics allows insights from different statistical perspectives. Results indicate that the proposed method is a leading concept for rank-based classifier systems: lowest Davies-Bouldin Index, highest Dunn Index, highest (and exclusively positive) Silhouette Coefficient, second highest Fisher’s Discriminant Ratio and, combined with rank-based classifier, the best Cumulative Match Characteristic, False Accept Rate and False Reject Rate trade-off, Receiver Operating Characteristic (ROC) and recall-precision trade-off scores along with Correct Classification Rate, Equal Error Rate, Area Under ROC Curve and Mean Average Precision. We interpret the high scores as a sign of robustness. Apart from performance merits, the MMC method is also efficient: low-dimensional templates (̂︀D ≤ CL −1 = CE −1 = 53) and Mahalanobis distance ensure fast distance computations and thus contribute to high scalability. 9 4 Conclusions Despite many advanced optimization techniques used in statistical pattern recognition, a common practice of state-of-the-art MoCap-based human identification is still to design geometric gait features by hand. As the first contribution of this paper, the proposed method does not involve any ad-hoc features; on the contrary, they are computed from a much larger space beyond the limits of human interpretability. The features are learned directly from raw joint coordinates by a modification of the Fisher’s LDA with MMC so that the identities are maximally separated. We believe that MMC is a suitable criterion for optimizing gait features; however, our future work will continue with research on further potential optimality criterions and machine learning approaches. Furthermore, we are in the process of developing an evaluation framework with implementation details and source codes of all related methods, data extraction drive from the general CMU MoCap database and the evaluation mechanism to support reproducible research. Second contribution lies in showing the possibility of building a representation on a problem and using it on another (related) problem. Simulations on the CMU MoCap database show that our approach is able to build robust feature spaces without pre-registering and labeling all potential walkers. In fact, we can take different people (experiments A and B) and just a fraction of them (experiments C and D). We have observed that with an increasing volume of identities the heterogeneous evaluation setup is on par with the homogeneous setup, that is, it does not matter what identities we learn the features on. One does not have to rely on the availability of all walkers for learning. This is particularly important for a system to aid video surveillance applications where encountered walkers never supply labeled data. Multiple occurrences of individual walkers can now be linked together even without knowing their actual identities. Acknowledgments Authors thank to the anonymous reviewers for their detailed commentary and suggestions. Data used in this project was created with funding from NSF EIA-0196217 and was obtained from http://mocap.cs.cmu.edu. Our extracted database is available at https://gait.fi.muni.cz/ to support results reproducibility. References 1. Ahmed, F., Paul, P.P., Gavrilova, M.L.: DTW-Based Kernel and Rank-Level Fusion for 3D Gait Recognition Using Kinect. The Visual Computer 31(6), 915–924 (2015), https: //doi.org/10.1007/s00371-015-1092-0 2. Ahmed, M., Al-Jawad, N., Sabir, A.: Gait Recognition Based on Kinect Sensor. Proc. SPIE 9139, 91390B–91390B–10 (2014), http://dx.doi.org/10.1117/12.2052588B 3. Ali, S., Wu, Z., Li, X., Saeed, N., Wang, D., Zhou, M.: Transactions on Computational Science XXVI: Special Issue on Cyberworlds and Cybersecurity, chap. Applying Geometric Function on Sensors 3D Gait Data for Human Identification, pp. 125–141. Springer, Berlin, Heidelberg (2016), https://doi.org/10.1007/978-3-662-49247-5_8 4. Andersson, V.O., Araujo, R.M.: Person Identification Using Anthropometric and Gait Data from Kinect Sensor. In: Proc. of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15). pp. 425–431. AAAI Press (2015), http://www.aaai.org/ocs/index.php/ AAAI/AAAI15/paper/view/9680 5. Balazia, M., Sojka, P.: Learning Robust Features for Gait Recognition by Maximum Margin Criterion. In: Proc. of 23rd International Conference on Pattern Recognition, ICPR 2016. pp. 901–906. IEEE (Dec 2016) 10 6. Ball, A., Rye, D., Ramos, F., Velonaki, M.: Unsupervised clustering of people from ’skeleton’ data. In: Proceedings of the Seventh Annual ACM/IEEE International Conference on HumanRobot Interaction. pp. 225–226. HRI ’12, ACM, New York, NY, USA (2012), http://doi. acm.org/10.1145/2157689.2157767 7. Castro, F.M., Marín-Jiménez, M.J., Guil, N., de la Blanca, N.P.: Automatic Learning of Gait Signatures for People Identification. CoRR abs/1603.01006 (2016), https://arxiv.org/ abs/1603.01006 8. Choudhury, S.D., Tjahjadi, T.: Robust View-Invariant Multiscale Gait Recognition. Pattern Recognition 48(3), 798–811 (2015), http://www.sciencedirect.com/science/article/ pii/S0031320314003835 9. CMU Graphics Lab: Carnegie-Mellon Motion Capture (MoCap) Database (2003), http: //mocap.cs.cmu.edu 10. Dikovski, B., Madjarov, G., Gjorgjevikj, D.: Evaluation of Different Feature Sets for Gait Recognition Using Skeletal Data from Kinect. In: 37th Intl. Convention on Information and Communication Technology, Electronics and Microelectronics. pp. 1304–1308 (May 2014), http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6859769 11. Kocsor, A., Kovács, K., Szepesvári, C.: Margin Maximizing Discriminant Analysis, pp. 227– 238. Springer Berlin Heidelberg, Berlin, Heidelberg (2004), https://doi.org/10.1007/ 978-3-540-30115-8_23 12. Kwolek, B., Krzeszowski, T., Michalczuk, A., Josinski, H.: 3D Gait Recognition Using SpatioTemporal Motion Descriptors. In: Proc. of Intelligent Information and Database Systems: 6th Asian Conference, ACIIDS 2014, Bangkok, Thailand, Part II. LNCS, vol. 8398, pp. 595–604. Springer (Apr 2014), https://doi.org/10.1007/978-3-319-05458-2_61 13. Li, H., Jiang, T., Zhang, K.: Efficient and Robust Feature Extraction by Maximum Margin Criterion. IEEE Transactions on Neural Networks 17(1), 157–165 (Jan 2006), https://doi. org/10.1109/TNN.2005.860852 14. Preis, J., Kessel, M., Werner, M., Linnhoff-Popien, C.: Gait Recognition with Kinect. In: 1st International Workshop on Kinect in Pervasive Computing, New Castle, UK, June 18–22. pp. 1–4 (2012), https://www.researchgate.net/publication/239862819_ Gait_Recognition_with_Kinect 15. Ramu Reddy, V., Chakravarty, K., Aniruddha, S.: Person Identification in Natural Static Postures Using Kinect. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) Computer Vision – ECCV 2014 Workshops: Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings, Part II. LNCS, vol. 8926, pp. 793–808. Springer (2015), https://doi.org/10.1007/ 978-3-319-16181-5_60 16. Sinha, A., Chakravarty, K., Bhowmick, B.: Person Identification Using Skeleton Information from Kinect. In: ACHI 2013: Proc. of the Sixth Intl. Conf. on Advances in CHI. pp. 101–108 (2013), https://www.thinkmind.org/index.php?view=article&articleid=achi_ 2013_4_50_20187 17. Tafazzoli, F., Bebis, G., Louis, S.J., Hussain, M.: Genetic Feature Selection for Gait Recognition. J. of Electron. Imaging 24(1), 013036 (Feb 2015), https://doi.org/10.1117/1. JEI.24.1.013036 18. Valcik, J., Sedmidubsky, J., Zezula, P.: Assessing Similarity Models for Human-Motion Retrieval Applications. Computer Animation and Virtual Worlds 27(5), 484–500 (2016), http://dx.doi.org/10.1002/cav.1674 19. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, USA (1995) 20. Zeng, W., Wang, C.: View-Invariant Gait Recognition via Deterministic Learning. In: International Joint Conference on Neural Networks (IJCNN). pp. 3465–3472 (Jul 2014), https://doi.org/10.1109/IJCNN.2014.6889507 11