23rd IEEE/IAPR International Conference on Pattern Recognition 2016, preprint Learning Robust Features for Gait Recognition by Maximum Margin Criterion Michal Balazia (ORCID 0000-0001-7153-9984) and Petr Sojka (ORCID 0000-0002-5768-4007) Faculty of Informatics, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic xbalazia@mail.muni.cz and sojka@fi.muni.cz Abstract—In the field of gait recognition from motion capture data, designing human-interpretable gait features is a common practice of many fellow researchers. To refrain from ad-hoc schemes and to find maximally discriminative features we may need to explore beyond the limits of human interpretability. This paper contributes to the state-of-the-art with a machine learning approach for extracting robust gait features directly from raw joint coordinates. The features are learned by a modification of Linear Discriminant Analysis with Maximum Margin Criterion so that the identities are maximally separated and, in combination with an appropriate classifier, used for gait recognition. Experiments on the CMU MoCap database show that this method outperforms eight other relevant methods in terms of the distribution of biometric templates in respective feature spaces expressed in four class separability coefficients. Additional experiments indicate that this method is a leading concept for rank-based classifier systems. I. Introduction From the surveillance perspective, gait pattern biometrics is appealing for its possibility of being performed at a distance and without body-invasive equipment or subject’s cooperation. This allows data acquisition without a subject’s consent. As the data are collected with high participation rate and surveilled subjects are not expected to claim their identities, the trait is preferably employed for identification rather than for authentication. Motion capture technology provides video clips of walking individuals containing structural motion data. The format keeps an overall structure of the human body and holds estimated 3D positions of major anatomical landmarks as the person moves. These so-called motion capture data (MoCap) can be collected online by a system of multiple cameras (Vicon) or a depth camera (Microsoft Kinect). To visualize motion capture data (see Figure 1), a simplified stick figure representing the human skeleton (a graph of joints connected by bones) can be recovered from the values of body point spatial coordinates. With recent rapid improvement in MoCap sensor accuracy, we believe in an affordable MoCap technology that can be installed in the streets and identify people from MoCap data. Primary goal of this work is to project a method for learning robust gait features from raw MoCap data. A collection of extracted features builds a gait template that serves as the walker’s signature. Templates are stored in a central database. Recognition of a person involves capturing their walk sample, extracting gait features to compose a template, and finally querying this database for a set of similar templates to report the most likely identity. Similarity of two templates is expressed in a single number computed by a similarity/distance function. z y x Figure 1. Motion capture data. Skeleton is represented by a stick figure of 31 joints (only 17 are drawn here). Seven selected video frames of a walk sequence contain 3D coordinates of each joint in time. The red and blue lines track trajectories of hands and feet. [1] Many geometric gait features have been introduced over the past few years: Ahmed et al. [2] extract horizontal and vertical distances of selected joint pairs. Andersson et al. [3] calculate mean and standard deviation in the signals of lower joint (hips, knees and ankles) angles. Ball et al. [4] select mean, standard deviation and maximum of the signals of lower joint (hips, knees and ankles) angles. Dikovski et al. [5] construct 7 different feature sets from a broad spectrum of geometric features, such as static body parameters, joint angles and interjoint distances aggregated within a gait cycle, along with various statistics. Kwolek et al. [6] extract bone rotations, interjoint distances, and the person’s height. Preis et al. [7] define thirteen pose attributes, eleven of them static body parameters and the other two dynamic parameters: step length and walk speed. Sinha et al. [8] combine a number of gait features: areas of upper and lower body, inter-joint distances and all features introduced by Ball et al. [4] and Preis et al. [7]. Clearly, joint angles and step length are schematic and humaninterpretable, which is convenient for visualizations and for intuitive understanding of the models, but unnecessary for automatic gait recognition. This application prefers learning features that maximally separate the identities and are not limited by such dispensable factors. Section II gives a scheme for learning the features by Maximum Margin Criterion. We have implemented and evaluated all methods on a testing database described in Section III-A. Their performance is expressed in several evaluation metrics defined in Section III-B. Results are presented and discussed in Section III-C. 1 II. Learning Gait Features In statistical pattern recognition, reducing space dimensionality is a common technique to overcome class estimation problems. Classes are discriminated by projecting highdimensional input data onto low-dimensional sub-spaces by linear transformations with the goal of maximizing the class separability. We are interested in finding an optimal feature space where a gait template is close to those of the same walker and far from those of different walkers. Let the model of a human body have J joints and all samples be linearly normalized to their average length T. Labeled learning data in a sample (measurement) space have the form 𝒢L = {(gn, n)}NL n=1 where gn = [︁ [γ1 (1) · · · γJ (1)]⊤ · · · [γ1 (T) · · · γJ (T)]⊤ ]︁⊤ (1) is a gait sample (one gait cycle) in which γj (t) ∈ R3 are 3D spatial coordinates of a joint j ∈ {1, . . . , J} at time t ∈ {1, . . . , T} normalized with respect to the person’s position and walk direction. See that 𝒢L has dimensionality D = 3JT. Each learning sample falls strictly into one of the learning identity classes {ℐc}C c=1. A class ℐc has Nc samples. Here ℐc ∩ ℐc′ = ∅ for c c′ and 𝒢L = ⋃︀C c=1 ℐc. n is the ground-truth label of the walker’s identity class. We say that samples gn and gn′ share a common walker if they are in the same class, i.e., gn, gn′ ∈ ℐc ⇔ n = n′ . We measure class separability of a given feature space by a representation of the Maximum Margin Criterion (MMC) used by the Vapnik’s Support Vector Machines (SVM) [9] 𝒥 = 1 2 C∑︁ c,c′=1 (︁ (µc − µc′ )⊤ (µc − µc′ ) − tr (Σc + Σc′ ) )︁ (2) which is actually a summation of 1 2C(C − 1) between-class margins. The margin is defined as the Euclidean distance of class means minus both individual variances (traces of scatter matrices Σc = 1 Nc ∑︀Nc n=1 (︁ g(c) n − µc )︁ (︁ g(c) n − µc )︁⊤ and similarly for Σc′ ). For the whole labeled data, we denote the between- and within-class and total scatter matrices ΣB = CL∑︁ c=1 (µc − µ) (µc − µ)⊤ ΣW = CL∑︁ c=1 1 Nc Nc∑︁ n=1 (︁ g(c) n − µc )︁ (︁ g(c) n − µc )︁⊤ ΣT = CL∑︁ c=1 1 Nc Nc∑︁ n=1 (︁ g(c) n − µ )︁ (︁ g(c) n − µ )︁⊤ = ΣB + ΣW (3) where g(c) n denotes the n-th sample in class ℐc and µc and µ are sample means for class ℐc and the whole data set, respectively, that is, µc = 1 Nc ∑︀Nc n=1 g(c) n and µ = 1 NL ∑︀NL n=1 gn. Now we obtain 𝒥 = 1 2 C∑︁ c,c′=1 (µc − µc′ )⊤ (µc − µc′ ) − 1 2 C∑︁ c,c′=1 tr (Σc + Σc′ ) = 1 2 C∑︁ c,c′=1 (µc − µ + µ − µc′ )⊤ (µc − µ + µ − µc′ ) − C∑︁ c=1 tr (Σc) =tr ⎛ ⎜⎜⎜⎜⎜⎝ C∑︁ c=1 (µc − µ) (µc − µ)⊤ ⎞ ⎟⎟⎟⎟⎟⎠ − tr ⎛ ⎜⎜⎜⎜⎜⎝ C∑︁ c=1 Σc ⎞ ⎟⎟⎟⎟⎟⎠ =tr (ΣB) − tr (ΣW) = tr (ΣB − ΣW) . (4) Since tr (ΣB) measures the overall variance of the class mean vectors, a large one implies that the class mean vectors scatter in a large space. On the other hand, a small tr (ΣW) implies that classes have a small spread. Thus, a large 𝒥 indicates that samples are close to each other if they share a common walker but are far from each other if they are performed by different walkers. Extracting features, that is, transforming the input data in the measurement space into a feature space of higher 𝒥, can be used to link new observations of walkers more successfully. Feature extraction is given by a linear transformation (feature) matrix Φ ∈ RD×̂︀D from a D-dimensional measurement space 𝒢 = {gn}N n=1 of not necessarily labeled gait samples into a ̂︀Ddimensional feature space ̂︀𝒢 = {︀ ̂︀gn }︀N n=1 of gait templates where ̂︀D < D. Gait samples gn are transformed into gait templates ̂︀gn = Φ⊤ gn. The objective is to learn a transform Φ that maximizes the accumulated margin of the feature space 𝒥 (Φ) = tr (︁ Φ⊤ (ΣB − ΣW) Φ )︁ . (5) Once the transformation is found, all measured samples are transformed into templates (in the feature space) along with the class means and covariances. The templates are compared by the Mahalanobis distance function ̂︀δ (︀ ̂︀gn,̂︀gn′ )︀ = √︁ (︀ ̂︀gn −̂︀gn′ )︀⊤ ̂︀Σ−1 T (︀ ̂︀gn −̂︀gn′ )︀ . (6) Now we show that solution to the optimization problem in Equation (5) can be obtained by eigendecomposition of the matrix ΣB − ΣW. An important property to notice about the objective 𝒥 (Φ) is that it is invariant w.r.t. rescalings Φ → αΦ. Since it is a scalar itself, we can always choose Φ = f1‖ · · · ‖f̂︀D such that f⊤ ̂︀d f̂︀d = 1 and reduce the problem of maximizing 𝒥 (Φ) into the constrained optimization problem max ̂︀D∑︁ ̂︀d=1 f⊤ ̂︀d (ΣB − ΣW) f̂︀d subject to f⊤ ̂︀d f̂︀d − 1 = 0 ∀̂︀d = 1, . . . , ̂︀D. (7) To solve the above optimization problem, let us consider the Lagrangian ℒ (︁ f̂︀d, λ̂︀d )︁ = ̂︀D∑︁ ̂︀d=1 f⊤ ̂︀d (ΣB − ΣW) f̂︀d − λ̂︀d (︁ f⊤ ̂︀d f̂︀d − 1 )︁ (8) 2 with multipliers λ̂︀d. To find the maximum, we derive it with respect to f̂︀d and equate to zero ∂ℒ (︁ f̂︀d, λ̂︀d )︁ ∂f̂︀d = (︁ (ΣB − ΣW) − λ̂︀dI )︁ f̂︀d = 0 (9) which leads to (ΣB − ΣW) f̂︀d = λ̂︀df̂︀d (10) where λ̂︀d are the eigenvalues of ΣB − ΣW and f̂︀d are the corresponding eigenvectors. Therefore, 𝒥 (Φ) = tr (︁ Φ⊤ (ΣB − ΣW) Φ )︁ = tr (︁ Φ⊤ ΛΦ )︁ = tr (Λ) (11) is maximized when Λ = diag (︁ λ1, . . . , λ̂︀D )︁ has ̂︀D largest eigenvalues and Φ contains the corresponding leading eigenvectors. In the following we discuss how to calculate the eigenvectors of ΣB − ΣW and to determine an optimal dimensionality ̂︀D of the feature space. First, we rewrite ΣB − ΣW = 2ΣB − ΣT. Note that the null space of ΣT is a subspace of that of ΣB since the null space of ΣT is the common null space of ΣB and ΣW. Thus, we can simultaneously diagonalize ΣB and ΣT to some ∆ and I Ψ⊤ ΣBΨ = ∆ Ψ⊤ ΣTΨ = I (12) with the D × rank (ΣT) eigenvector matrix Ψ = ΩΘ− 1 2 Ξ (13) where Ω and Θ are the eigenvector and eigenvalue matrices of ΣT, respectively, and Ξ is the eigenvector matrix of Θ−1/2 Ω⊤ ΣBΩΘ−1/2 . To calculate Ψ, we use a fast two-step algorithm [10] in virtue of Singular Value Decomposition (SVD). SVD expresses a real r × s matrix A as a product A = UDV⊤ where D is a diagonal matrix with decreasing nonnegative entries, and U and V are r×min {r, s} and s×min {r, s} eigenvector matrices of AA⊤ and A⊤ A, respectively, and the non-vanishing entries of D are square roots of the non-zero corresponding eigenvalues of both AA⊤ and A⊤ A. See that ΣT and ΣB can be expressed in the forms ΣT = XX⊤ where X = 1 √ NL [︀ (g1 − µ) · · · (︀ gNL − µ )︀]︀ and ΣB = ΥΥ⊤ where Υ = [︀ (µ1 − µ) · · · (︀ µCL − µ )︀]︀ , (14) respectively. Hence, we can obtain the eigenvectors Ω and the corresponding eigenvalues Θ of ΣT through the SVD of X and analogically Ξ of Θ−1/2 Ω⊤ ΣBΩΘ−1/2 through the SVD of Θ−1/2 Ω⊤ Υ. The columns of Ψ are clearly the eigenvectors of 2ΣB −ΣT with the corresponding eigenvalues 2∆−I. Therefore, to constitute the transform Φ by maximizing the MMC, we should choose the eigenvectors in Ψ that correspond to the eigenvalues of at least 1 2 in ∆. Note that ∆ contains at most rank (ΣB) = C − 1 positive eigenvalues. We found inspiration in the Fisher’s Linear Discriminant Analysis (LDA) [11] that uses the Fisher’s criterion 𝒥 (ΦLDA) = tr (︃ Φ⊤ LDAΣBΦLDA Φ⊤ LDAΣWΦLDA )︃ . (15) However, since the rank of ΣW is at most NL −C, it is a singular (non-invertible) matrix if NL is less than D+C, or, analogously might be unstable if NL ≪ D. Small sample size is a substantial difficulty as it is necessary to calculate Σ−1 W . To alleviate this, the measured data can be first projected to a lower dimensional space using Principal Component Analysis (PCA), resulting in a two-stage PCA+LDA feature extraction technique [12] 𝒥 (ΦPCA) = tr (︁ Φ⊤ PCAΣTΦPCA )︁ 𝒥 (ΦLDA) = tr (︃ Φ⊤ LDAΦ⊤ PCAΣBΦPCAΦLDA Φ⊤ LDAΦ⊤ PCAΣWΦPCAΦLDA )︃ (16) and the final transform is Φ = ΦPCAΦLDA. Given that there are D principal components, then regardless of the dimensionality D there are at least D + 1 independent data points. Thus, if the D × D matrix Φ⊤ PCAΣWΦPCA is estimated from NL − C independent observations and providing the C ≤ D ≤ NL − C, we can always invert Φ⊤ PCAΣWΦPCA and this way obtain the LDA estimate. Note that this method is sub-optimal for multiclass problems [13] as PCA keeps at most NL − C principal components whereas at least NL − 1 of them are necessary in order not to lose information. PCA+LDA in this form has been used for silhouette-based (2D) gait recognition by Su et al. [14] and is included in our experiments with MoCap (3D). On given labeled learning data 𝒢L, Algorithm 1 and Algorithm 2 provided below are efficient ways of learning the transforms Φ for MMC and PCA+LDA, respectively. Algorithm 1 LearnTransformationMatrixMMC(𝒢L) 1: split 𝒢L = {(gn, n)}NL n=1 into {ℐc}CL c=1 of Nc = |ℐc| samples 2: compute overall mean µ = 1 NL ∑︀NL n=1 gn and individual class means µc = 1 Nc ∑︀Nc n=1 g(c) n 3: compute ΣB = ∑︀CL c=1 (µc − µ) (µc − µ)⊤ 4: compute X = 1√ NL [︀ (g1 − µ) · · · (︀ gNL − µ )︀]︀ 5: compute Υ = [︀ (µ1 − µ) · · · (︀ µCL − µ )︀]︀ 6: compute eigenvectors Ω and corresponding eigenvalues Θ of ΣT through SVD of X 7: compute eigenvectors Ξ of Θ−1/2 Ω⊤ ΣBΩΘ−1/2 through SVD of Θ−1/2 Ω⊤ Υ 8: compute eigenvectors Ψ = ΩΘ−1/2 Ξ 9: compute eigenvalues ∆ = Ψ⊤ ΣBΨ 10: return transform Φ as eigenvectors in Ψ that correspond to the eigenvalues of at least 1/2 in ∆ Algorithm 2 LearnTransformationMatrixPCALDA(𝒢L) 1: split 𝒢L = {(gn, n)}NL n=1 into {ℐc}CL c=1 of Nc = |ℐc| samples 2: compute overall mean µ = 1 NL ∑︀NL n=1 gn and individual class means µc = 1 Nc ∑︀Nc n=1 g(c) n 3: compute ΣB = ∑︀CL c=1 (µc − µ) (µc − µ)⊤ 4: compute ΣW = ∑︀CL c=1 1 Nc ∑︀Nc n=1 (︁ g(c) n − µc )︁ (︁ g(c) n − µc )︁⊤ 5: compute eigenvectors ΦPCA of ΣT = ΣB + ΣW that correspond to D largest eigenvalues (we set D = CL) 6: compute eigenvectors ΦLDA of (Φ⊤ PCAΣWΦPCA)−1 (Φ⊤ PCAΣBΦPCA) 7: return transform Φ = ΦPCAΦLDA 3 III. Experimental Evaluation A. Database For the evaluation purposes we have extracted a large number of samples from the general MoCap database from CMU [15] as a well-known and recognized database of structural human motion data. It contains numerous motion sequences, including a considerable number of gait sequences. Motions are recorded with an optical marker-based Vicon system. People wear a black jumpsuit and have 41 markers taped on. The tracking space of 30 m2 , surrounded by 12 cameras of sampling rate of 120 Hz in the height from 2 to 4 meters above ground, creates a video surveillance environment. Motion videos are triangulated to get highly accurate 3D data in the form of relative body point coordinates (with respect to the root joint) in each video frame and stored in the standard ASF/AMC data format. Each registered participant is assigned with their respective skeleton described in an ASF file. Motions in the AMC files store bone rotational data, which is interpreted as instructions about how the associated skeleton deforms over time. To use the collected data in a fairly manner, a prototypical skeleton is constructed and used to represent bodies of all subjects, shrouding the unique skeleton parameters of individual walkers. Assuming that all walking subjects are physically identical disables a trivial skeleton check as a potentially unfair 100% classifier. Moreover, this is a skeleton-robust solution as all bone rotational data are linked with a fixed skeleton. To obtain realistic parameters, it is calculated as the mean of all skeletons in the provided ASF files. 3D joint coordinates are calculated using bone rotational data and the prototypical skeleton. One cannot directly use raw values of joint coordinates, as they refer to absolute positions in the tracking space, and not all potential methods are invariant to person’s position or walk direction. To ensure such invariance, the center of the coordinate system is moved to the position of root joint γroot (t) = [0, 0, 0]⊤ for each time t and axes are adjusted to the walker’s perspective: the X axis is from right (negative) to left (positive), the Y axis is from down (negative) to up (positive), and the Z axis is from back (negative) to front (positive). In the AMC file structure notation it is achieved by zeroing the root translation and rotation (root 0 0 0 0 0 0) in all frames of all motion sequences. Since the general motion database contains all motion types, we extracted a number of sub-motions that represent gait cycles. First, an exemplary gait cycle was identified, and clean gait cycles were then filtered out using the DTW distance over bone rotations. The similarity threshold was set high enough so that even the least similar sub-motion still semantically represents a gait cycle. Finally, subjects that contributed with less than 10 samples were excluded. The final database has 54 walking subjects that performed 3,843 samples in total, which makes an average of about 71 samples per subject. B. Performance Metrics All results are estimated with nested cross-validation (see Figure 2) that involves two partial cross-validation loops. In Figure 2. Nested cross-validation. the outer 3-fold cross-validation loop, NL labeled templates in one fold are used for learning the features and thus training the model. This model is frozen and ready to be evaluated for class separability coefficients on the remaining two folds of NE labeled templates. Both learning and evaluation sets contain templates of all C identities. Evaluation of classification metrics advances to the inner 10-fold cross-validation loop taking one dis-labeled fold as a testing set and other nine labeled folds as gallery. Test templates are classified by the winner-takes-all strategy, in which a test template ̂︀gtest gets assigned with the label argmini ̂︀δ (︁ ̂︀gtest,̂︀g gallery i )︁ of the gallery’s closest identity class. Correct Classification Rate (CCR) is often perceived as the ultimate qualitative measure; however, if a method has a low CCR, we cannot directly say if the system is failing because of bad features or a bad classifier. It is more explanatory to provide an evaluation in terms of class separability of the feature space. The class separability measures give an estimate on the recognition potential of the extracted features and do not reflect eventual combination with an unsuitable classifier: ∙ Davies-Bouldin Index: DBI DBI = 1 C C∑︁ c=1 max 1≤c′≤C, c′ c σc + σc′ ̂︀δ (︀ ̂︀µc,̂︀µc′ )︀ (17) where σc = 1 Nc ∑︀Nc n=1 ̂︀δ (︀ ̂︀gn,̂︀µc )︀ is the average distance of all elements in identity class ℐc to its centroid, and analogically for σc′ . Templates of low intra-class distances and of high inter-class distances have a low DBI. ∙ Dunn Index: DI DI = min 1≤c