Joint International Workshops on Structural and Syntactic Pattern Recognition
and Statistical Techniques in Pattern Recognition 2016, preprint
Walker-Independent Features for Gait Recognition
from Motion Capture Data
Michal Balazia (0000-0001-7153-9984) and Petr Sojka (0000-0002-5768-4007)
Faculty of Informatics, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic
xbalazia@mail.muni.cz and sojka@fi.muni.cz
Abstract MoCap-based human identiﬁcation, as a pattern recognition discipline,
can be optimized using a machine learning approach. Yet in some applications
such as video surveillance new identities can appear on the ﬂy and labeled data
for all encountered people may not always be available. This work introduces
the concept of learning walker-independent gait features directly from raw joint
coordinates by a modiﬁcation of the Fisher’s Linear Discriminant Analysis with
Maximum Margin Criterion. Our new approach shows not only that these features
can discriminate diﬀerent people than who they are learned on, but also that the
number of learning identities can be much smaller than the number of walkers
encountered in the real operation.
1 Introduction
Recent rapid improvement in motion capture (MoCap) sensor accuracy brought aﬀordable
technology that can identify walking people. MoCap technology provides video
clips of walking individuals containing structural motion data. The format keeps an overall
structure of the human body and holds estimated 3D positions of major anatomical
landmarks as the person moves. MoCap data can be collected online by a system of
multiple cameras (Vicon) or a depth camera (Microsoft Kinect). To visualize motion capture
data (see Figure 1), a simpliﬁed stick ﬁgure representing the human skeleton (graph
of joints connected by bones) can be recovered from body point spatial coordinates.
2 Michal Balazia, Konstantinos N. Plataniotis
The introduced method uses a motion capture technology that provides video clips of walking individuals
containing structural motion data. The format keeps an overall structure of the human body and holds estimated
3D positions of major anatomical landmarks as the person moves. These so-called motion capture data (MoCap)
can be collected online by a system of multiple cameras (Vicon) or a depth camera (Microsoft Kinect). To visualize
motion capture data (see Figure 1), a simpliﬁed stick ﬁgure representing the human skeleton (a graph of joints
connected by bones) can be recovered from the values of body point spatial coordinates. Shape-based solutions
utilize human silhouettes that are often corrupt and overlapping, while structure-based solutions estimate body
point positions with much higher accuracy using the 3D technology.
x
y
z
Fig. 1 Motion capture data. Skeleton is represented by a stick ﬁgure of 17 joints (left). Seven selected video frames (right) of a
walk sequence contain 3D spatial coordinates of each joint in time. The red and blue lines track the trajectories of hands and feet.
People spotted in our tracking space do not walk all the time; on the contrary, they perform various activities.
Recognizing people by gait requires processing video segments where they are actually walking. Clean gait cycles
need to be ﬁrst ﬁltered out from the video sequences of general motion. Some methods focus on detecting gait
cycles directly [1,2] or we can use a general action recognition method [3,4,5,6,7] that only need an example of a
gait cycle to search general motion sequences.
Having a query motion clip where a person performs an action classiﬁed as a gait cycle, the system can proceed
to the identiﬁcation phase (see Figure 2). Gait sample of each recorded walker in a raw MoCap form is pre-processed
to contain the discriminative gait information. A collection of extracted gait features, such as feet distance or elbow
angle, builds a gait pattern descriptor. Descriptor, also referred to as gait pattern, serves as a walker’s signature.
Associated with the walker’s identity, the descriptors are stored in a central database. To identify someone by
gait means to classify an identity for their gait pattern that is unknown at the moment. The classiﬁer composes a
query to search this database for a set of similar gait patterns, retrieving a collection of candidate identities and
reporting the most likely one as the classiﬁed identity. Here, the similarity of two gait patterns is expressed in a
single number computed by a similarity/distance function of their descriptors.
Recognition rate is clearly the most important qualitative measure; however, as video surveillance applications
process large amounts of data, we are also interested in how the system performs with our limited time and
space resources. Computational complexity, as a quantitative measure, is the amount of computational resources
used to operate the gait recognition system. This measure involves computational time and space of frequent
operations, such as descriptor extraction, identity classiﬁcation, and database maintenance. Even a 100% method
is unacceptable if processing a short motion clip takes too long or if a small database occupies too much memory.
z
y
x
Figure 1. Motion capture data. Skeleton is represented by a stick ﬁgure of 31 joints (only 17 are
drawn here). Seven selected video frames of a walk sequence contain 3D coordinates of each joint
in time. The red and blue lines track trajectories of hands and feet. [18]
1
Recognizing a person by walk involves capturing and normalizing their walk sample,
extracting gait features to compose a template, and ﬁnally querying a central database
for a set of similar templates to report the most likely identity. This work focuses on
extracting robust and discriminative gait features from raw MoCap data.
Many geometric gait features have been introduced over the past few years. They are
typically combinations of static body parameters (bone lengths, person’s height) [12]
with dynamic gait features such as step length, walk speed, joint angles and inter-joint distances
[4,1,12,15], along with various statistics (mean, standard deviation or maximum)
of their signals [3]. Clearly, these features are schematic and human-interpretable, which
is convenient for visualizations and for intuitive understanding, but unnecessary for automatic
gait recognition. Instead, this application prefers learning features that maximally
separate the identity classes and are not limited by such dispensable factors.
Methods for 2D gait recognition extensively use machine learning models for
extracting gait features, such as principal component analysis and multi-scale shape
analysis [8], genetic algorithms and kernel principal component analysis [17], radial
basis function neural networks [20], or convolutional neural networks [7]. All of those
and many other models are reasonable to be utilized also in 3D gait recognition.
In the video surveillance environment data need to be acquired without walker’s
consent and new identities can appear on the ﬂy. Here and also in other applications
where labels for all encountered people may not always be available, we value features
that have a high power in distinguishing all people and not exclusively who they were
learned on. We call these walker-independent features. The main idea is to statistically
learn what aspects of walk people generally diﬀer in and extract those as gait features.
The features are learned in a supervised manner, as described in the following section.
2 Learning Gait Features
In statistical pattern recognition, reducing space dimensionality is a common technique
to overcome class estimation problems. Classes are discriminated by projecting highdimensional
input data onto low-dimensional sub-spaces by linear transformations with
the goal of maximizing the class separability. We are interested in ﬁnding an optimal
feature space where a gait template is close to those of the same walker and far from
those of diﬀerent walkers.
Let the model of a human body have J joints and all samples be linearly normalized
to their average length T. Labeled learning data in the measurement space 𝒢L are in the
form {(gn, n)}NL
n=1 where
gn =
[︁
[γ1 (1) · · · γJ (1)]⊤
· · · [γ1 (T) · · · γJ (T)]⊤
]︁⊤
(1)
is a gait sample (one gait cycle) in which γj (t) ∈ R3
are 3D spatial coordinates of a joint
j ∈ {1, . . . , J} at time t ∈ {1, . . . , T} normalized with respect to the person’s position
and walk direction. See that 𝒢L has dimensionality D = 3JT. Each learning sample
falls strictly into one of the learning identity classes {ℐc}C
c=1 determined by n. A class
ℐc ⊆ 𝒢L has Nc samples. The classes are complete and mutually exclusive. We say that
learning samples (gn, n) and (gn′ , n′ ) share a common walker if and only if they belong
to the same class, i.e., (gn, n) , (gn′ , n′ ) ∈ ℐc ⇔ n = n′ .
2
We measure class separability of a given feature space by a representation of the
Maximum Margin Criterion (MMC) [11,13] used by the Vapnik’s Support Vector
Machines (SVM) [19]
𝒥 =
1
2
CL∑︁
c,c′=1
(︁
(µc − µc′ )⊤
(µc − µc′ ) − tr (Σc + Σc′ )
)︁
(2)
which is actually a summation of 1
2CL(CL − 1) between-class margins. The margin is
deﬁned as the Euclidean distance of class means minus both individual variances (traces
of scatter matrices Σc = 1
Nc
∑︀Nc
n=1
(︁
g(c)
n − µc
)︁ (︁
g(c)
n − µc
)︁⊤
and similarly for Σc′ ). For the
whole labeled data, we denote the between- and within-class and total scatter matrices
ΣB =
CL∑︁
c=1
(µc − µ) (µc − µ)⊤
ΣW =
CL∑︁
c=1
1
Nc
Nc∑︁
n=1
(︁
g(c)
n − µc
)︁ (︁
g(c)
n − µc
)︁⊤
ΣT =
CL∑︁
c=1
1
Nc
Nc∑︁
n=1
(︁
g(c)
n − µ
)︁ (︁
g(c)
n − µ
)︁⊤
= ΣB + ΣW
(3)
where g(c)
n denotes the n-th sample in class ℐc and µc and µ are sample means for class
ℐc and the whole data set, respectively, that is, µc = 1
Nc
∑︀Nc
n=1 g(c)
n and µ = 1
NL
∑︀NL
n=1 gn.
Now we obtain
𝒥 =
1
2
CL∑︁
c,c′=1
(µc − µc′ )⊤
(µc − µc′ ) −
1
2
CL∑︁
c,c′=1
tr (Σc + Σc′ )
=
1
2
CL∑︁
c,c′=1
(µc − µ + µ − µc′ )⊤
(µc − µ + µ − µc′ ) −
CL∑︁
c=1
tr (Σc)
= tr
⎛
⎜⎜⎜⎜⎜⎜⎝
CL∑︁
c=1
(µc − µ) (µc − µ)⊤
⎞
⎟⎟⎟⎟⎟⎟⎠ − tr
⎛
⎜⎜⎜⎜⎜⎜⎝
CL∑︁
c=1
Σc
⎞
⎟⎟⎟⎟⎟⎟⎠
= tr (ΣB) − tr (ΣW) = tr (ΣB − ΣW) .
(4)
Since tr (ΣB) measures the overall variance of the class mean vectors, a large one implies
that the class mean vectors scatter in a large space. On the other hand, a small tr (ΣW)
implies that classes have a small spread. Thus, a large 𝒥 indicates that samples are
close to each other if they share a common walker but are far from each other if they
are performed by diﬀerent walkers. Extracting features, that is, transforming the input
data in the measurement space into a feature space of higher 𝒥, can be used to link new
observations of walkers more successfully.
3
Feature extraction is given by a linear transformation (feature) matrix Φ ∈ RD×̂︀D
from a D-dimensional measurement space 𝒢 = {gn}N
n=1 of not necessarily labeled gait
samples to a ̂︀D-dimensional feature space ̂︀𝒢 =
{︀
̂︀gn
}︀N
n=1 of gait templates where ̂︀D < D
and each gait sample gn is transformed into a gait templatê︀gn = Φ⊤
gn. The objective is
to learn a transform Φ that maximizes MMC in the feature space
𝒥 (Φ) = tr
(︁
Φ⊤
(ΣB − ΣW) Φ
)︁
. (5)
Once the transformation is found, all measured samples are transformed into templates
(in the feature space) along with the class means and covariances. The templates are
compared by the Mahalanobis distance function
̂︀δ
(︀
̂︀gn,̂︀gn′
)︀
=
√︁
(︀
̂︀gn −̂︀gn′
)︀⊤ ̂︀Σ−1
T
(︀
̂︀gn −̂︀gn′
)︀
. (6)
We show that solution to the optimization problem in Equation (5) can be obtained
by eigendecomposition of the matrix ΣB −ΣW. An important property to notice about the
objective 𝒥 (Φ) is that it is invariant w.r.t. rescalings Φ → αΦ. Hence, we can always
choose Φ = f1‖ · · · ‖f̂︀D such that f⊤
̂︀d
f̂︀d = 1, since it is a scalar itself. For this reason we
can reduce the problem of maximizing 𝒥 (Φ) into the constrained optimization problem
max
̂︀D∑︁
̂︀d=1
f⊤
̂︀d
(ΣB − ΣW) f̂︀d
subject to f⊤
̂︀d
f̂︀d − 1 = 0 ∀̂︀d = 1, . . . , ̂︀D.
(7)
To solve the above optimization problem, let us consider the Lagrangian
ℒ
(︁
f̂︀d, λ̂︀d
)︁
=
̂︀D∑︁
̂︀d=1
f⊤
̂︀d
(ΣB − ΣW) f̂︀d − λ̂︀d
(︁
f⊤
̂︀d
f̂︀d − 1
)︁
(8)
with multipliers λ̂︀d. To ﬁnd the maximum, we derive it with respect to f̂︀d and equate to
zero
∂ℒ
(︁
f̂︀d, λ̂︀d
)︁
∂f̂︀d
=
(︁
(ΣB − ΣW) − λ̂︀dI
)︁
f̂︀d = 0 (9)
which leads to
(ΣB − ΣW) f̂︀d = λ̂︀df̂︀d (10)
where λ̂︀d are the eigenvalues of ΣB − ΣW and f̂︀d are the corresponding eigenvectors.
Putting it all together,
(ΣB − ΣW) Φ = ΛΦ (11)
where Λ = diag
(︁
λ1, . . . , λ̂︀D
)︁
is the eigenvalue matrix. Therefore,
𝒥 (Φ) = tr
(︁
Φ⊤
(ΣB − ΣW) Φ
)︁
= tr
(︁
Φ⊤
ΛΦ
)︁
= tr (Λ) (12)
is maximized when Λ has ̂︀D largest eigenvalues and Φ contains the corresponding
leading eigenvectors.
4
In the following we discuss how to calculate the eigenvectors of ΣB − ΣW and to
determine an optimal dimensionality ̂︀D of the feature space. Rewrite ΣB−ΣW = 2ΣB−ΣT.
Note that the null space of ΣT is a subspace of that of ΣB since the null space of ΣT is
the common null space of ΣB and ΣW. Thus, we can simultaneously diagonalize ΣB and
ΣT to some ∆ and I
Ψ⊤
ΣBΨ = ∆
Ψ⊤
ΣTΨ = I
(13)
with the D × rank (ΣT) eigenvector matrix
Ψ = ΩΘ− 1
2 Ξ (14)
where Ω and Θ are the eigenvector and eigenvalue matrices of ΣT, respectively, and Ξ
is the eigenvector matrix of Θ−1/2
Ω⊤
ΣBΩΘ−1/2
. To calculate Ψ, we use a fast two-step
algorithm in virtue of Singular Value Decomposition (SVD). SVD expresses a real
r × s matrix A as a product A = UDV⊤
where D is a diagonal matrix with decreasing
non-negative entries, and U and V are r×min {r, s} and s×min {r, s} eigenvector matrices
of AA⊤
and A⊤
A, respectively, and the non-vanishing entries of D are square roots of
the non-zero corresponding eigenvalues of both AA⊤
and A⊤
A. See that ΣT and ΣB can
be expressed in the forms
ΣT = XX⊤
where X =
1
√
NL
[︀
(g1 − µ) · · ·
(︀
gNL
− µ
)︀]︀
and
ΣB = ΥΥ⊤
where Υ =
[︀
(µ1 − µ) · · ·
(︀
µCL
− µ
)︀]︀
,
(15)
respectively. Hence, we can obtain the eigenvectors Ω and the corresponding eigenvalues
Θ of ΣT through the SVD of X and analogically Ξ of Θ−1/2
Ω⊤
ΣBΩΘ−1/2
through
the SVD of Θ−1/2
Ω⊤
Υ. The columns of Ψ are clearly the eigenvectors of 2ΣB − ΣT
with the corresponding eigenvalues 2∆ − I. Therefore, to constitute the transform Φ
by maximizing the MMC, we should choose the eigenvectors in Ψ that correspond to
the eigenvalues of at least 1
2 in ∆. Note that ∆ contains at most rank (ΣB) = CL − 1
positive eigenvalues, which gives an upper bound on the feature space dimensionality ̂︀D.
Algorithm 1 [5] provided below is an eﬃcient way of learning the transform Φ for MMC
on given labeled learning data 𝒢L.
Algorithm 1 LearnTransformationMatrixMMC(𝒢L)
1: split 𝒢L = {(gn, n)}NL
n=1 into classes {ℐc}CL
c=1 of Nc = |ℐc| samples
2: compute overall mean µ = 1
NL
∑︀NL
n=1 gn and individual class means µc = 1
Nc
∑︀Nc
n=1 g(c)
n
3: compute ΣB =
∑︀CL
c=1 (µc − µ) (µc − µ)⊤
4: compute X = 1√
NL
[︀
(g1 − µ) · · ·
(︀
gNL
− µ
)︀]︀
5: compute Υ =
[︀
(µ1 − µ) · · ·
(︀
µCL
− µ
)︀]︀
6: compute eigenvectors Ω and corresponding eigenvalues Θ of ΣT through SVD of X
7: compute eigenvectors Ξ of Θ−1/2
Ω⊤
ΣBΩΘ−1/2
through SVD of Θ−1/2
Ω⊤
Υ
8: compute eigenvectors Ψ = ΩΘ−1/2
Ξ
9: compute eigenvalues ∆ = Ψ⊤
ΣBΨ
10: return transform Φ as eigenvectors in Ψ that correspond to the eigenvalues of at least 1/2 in ∆
5
3 Experiments and Results
3.1 Database
For the evaluation purposes we have extracted a large number of samples from the general
MoCap database from CMU [9] as a well-known and recognized database of structural
human motion data. It contains numerous motion sequences, including a considerable
number of gait sequences. Motions are recorded with an optical marker-based Vicon
system. People wear a black jumpsuit and have 41 markers taped on. The tracking
space of 30 m2
, surrounded by 12 cameras of sampling rate of 120 Hz in the height
from 2 to 4 meters above ground, creates a video surveillance environment. Motion
videos are triangulated to get highly accurate 3D data in the form of relative body point
coordinates (with respect to the root joint) in each video frame and stored in the standard
ASF/AMC data format. Each registered participant is assigned with their respective
skeleton described in an ASF ﬁle. Motions in the AMC ﬁles store bone rotational data,
which is interpreted as instructions about how the associated skeleton deforms over time.
These MoCap data, however, contain skeleton parameters pre-calibrated by the CMU
staﬀ. Skeletons are unique for each walker and even a trivial skeleton check could result
in 100% recognition. In order to use the collected data in a fairly manner, a prototypical
skeleton is constructed and used to represent bodies of all subjects, shrouding the
unique skeleton parameters of individual walkers. Assuming that all walking subjects
are physically identical disables the skeleton check as a potentially unfair classiﬁer.
Moreover, this is a skeleton-robust solution as all bone rotational data are linked with
a ﬁxed skeleton. To obtain realistic parameters, it is calculated as the mean of all skeletons
in the provided ASF ﬁles.
We calculate 3D joint coordinates using bone rotational data and the prototypical
skeleton. One cannot directly use raw values of joint coordinates, as they refer to absolute
positions in the tracking space, and not all potential methods are invariant to person’s
position or walk direction. To ensure such invariance, the center of the coordinate system
is moved to the position of root joint γroot (t) = [0, 0, 0]⊤
for each time t and axes are
adjusted to the walker’s perspective: the X axis is from right (negative) to left (positive),
the Y axis is from down (negative) to up (positive), and the Z axis is from back (negative)
to front (positive). In the AMC ﬁle structure notation it is achieved by zeroing the root
translation and rotation (root 0 0 0 0 0 0) in all frames of all motion sequences.
Since the general motion database contains all motion types, we extracted a number
of sub-motions that represent gait cycles. First, an exemplary gait cycle was identiﬁed,
and clean gait cycles were then ﬁltered out using the DTW distance over bone rotations.
The similarity threshold was set high enough so that even the least similar sub-motion
still semantically represents a gait cycle. Finally, subjects that contributed with less than
10 samples were excluded. The ﬁnal database has 54 walking subjects that performed
3,843 samples in total, which makes an average of about 71 samples per subject.
3.2 Evaluation Setups and Metrics
Learning data 𝒢L = {(gn, n)}NL
n=1 of CL identities and evaluation data 𝒢E = {(gn, n)}NE
n=1
of CE identity classes have to be disjunct at all times. Evaluation is performed exclusively
6
on the evaluation part, taking no observations of the learning part into account. In the
following we introduce two setups of data separation: homogeneous and heterogeneous.
The homogeneous setup learns the transformation matrix on 1/3 samples of CL identities
and is evaluated on templates derived from other 2/3 samples of the same CE = CL
identities. The heterogeneous setup learns the transform on all samples in CL identities
and is evaluated on all templates derived from other CE identities. For better clariﬁcation
we refer to Figure 2. Note that unlike in homogeneous setup, in heterogeneous setup
there is no walker identity ever used for both learning and evaluation at the same time.
Figure 2. Abstraction of data separation for homogeneous setup of CL = CE = 3 learning-andevaluation
classes (left) and for heterogeneous setup of CL = 2 learning classes and CE = 4
evaluation classes (right). Black square represents a database and ellipses are identity classes.
Homogeneous setup is parametrized by a single number CL = CE of learning-andevaluation
identity classes, whereas the heterogeneous setup has the form (CL,CE)
specifying how many learning and how many evaluation identity classes are randomly
selected from the database. Evaluation of each setup is repeated 3 times, selecting new
random CL and CE identity classes each time and reporting the average result. Please note
that in the heterogeneous setup the learning identities are disjunct from the evaluation
identities, that is, there is no single identity used for both learning and evaluation.
Correct Classiﬁcation Rate (CCR) is a standard qualitative measure; however, if
a method has a low CCR, we cannot directly say if the system is failing because of bad
features or a bad classiﬁer. Providing an evaluation in terms of class separability of the
feature space gives an estimate on the recognition potential of the extracted features
and do not reﬂect eventual combination with an unsuitable classiﬁer. Quality of features
extraction algorithms is reﬂected in the Davies-Bouldin Index (DBI)
DBI =
1
CE
CE∑︁
c=1
max
1≤c′≤CE, c′ c
σc + σc′
̂︀δ
(︀
̂︀µc,̂︀µc′
)︀ (16)
where σc = 1
Nc
∑︀Nc
n=1
̂︀δ
(︀
̂︀gn,̂︀µc
)︀
is the average distance of all templates in identity class
ℐc to its centroid, and similarly for σc′ . Templates of low intra-class distances and of
high inter-class distances have a low DBI. DBI is measured on the full evaluation part,
whereas CCR is estimated with 10-fold cross-validation taking one dis-labeled fold as
a testing set and other nine as gallery. Test templates are classiﬁed by the winner-takes-all
7
strategy, in which a test template ̂︀gtest
gets assigned with the label argmini
̂︀δ
(︁
̂︀gtest,̂︀g
gallery
i
)︁ of
the gallery’s closest identity class.
Based on Section 3.1, our database has 54 identity classes in total. We performed the
series of experiments A, B, C, D below. The experiments A and B are to compare the
homogeneous and heterogeneous setup, whereas C and D examine how performance
of the system in the heterogeneous setup improves with increasing number of learning
identities. The results are illustrated in Figure 3 and in Figure 4 in the next section.
A homogeneous setup with CL = CE ∈ {2, . . . , 27};
B heterogeneous setup with CL = CE ∈ {2, . . . , 27};
C heterogeneous setup with CL ∈ {2, . . . , 27} and CE = 27;
D heterogeneous setup with CL ∈ {2, . . . , 52} and CE = 54 − CL.
3.3 Results
Experiments A and B compare homogeneous and heterogeneous setups by measuring
the drop in performance on an identical number of learning and evaluation identities
(CL = CE). Top plot in Figure 3 shows the measured values of DBI and CCR metrics
in both alternatives, which not only appear comparable but also in some conﬁgurations
the heterogeneous setup has an even higher CCR. Bottom plot expresses heterogeneous
setup as a percentage of the homogeneous setup in each of the particular metrics. Here
we see that with raising number of identities the heterogeneous setup approaches 100%
of the fully homogeneous alternative.
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
21
28
35
42
49
56
63
70
DBI homogeneous
DBI heterogeneous
CCR homogenous
CCR heterogeneous
0.0
0.1
0.2
0.3
0
7
14
21
(2,2)
(3,3)
(4,4)
(5,5)
(6,6)
(7,7)
(8,8)
(9,9)
(10,10)
(11,11)
(12,12)
(13,13)
(14,14)
(15,15)
(16,16)
(17,17)
(18,18)
(19,19)
(20,20)
(21,21)
(22,22)
(23,23)
(24,24)
(25,25)
(26,26)
(27,27)
CCR heterogeneous
40%
60%
80%
100%
120%
DBI
CCR
0%
20%
40%
(2,2)
(3,3)
(4,4)
(5,5)
(6,6)
(7,7)
(8,8)
(9,9)
(10,10)
(11,11)
(12,12)
(13,13)
(14,14)
(15,15)
(16,16)
(17,17)
(18,18)
(19,19)
(20,20)
(21,21)
(22,22)
(23,23)
(24,24)
(25,25)
(26,26)
(27,27)
CCR
Figure 3. DBI (left vertical axis) and CCR (right vertical axis) for experiments A of homogeneous
setup and B of heterogeneous setup (top) with (CL,CE) conﬁgurations (horizontal axes) and their
percentages (bottom).
8
Experiments C and D investigate on the impact of the number of learning identities
in the heterogeneous setup. Observing from the Figure 4, the performance grows quickly
on the ﬁrst conﬁgurations with very few learning identities, which we can interpret
as an analogy to the Pareto (80–20) principle. Speciﬁcally, the results of experiment
C say that 8 learning identities achieve almost the same performance (66.78 DBI and
0.902 CCR) to as if learned on 27 identities (68.32 DBI and 0.947 CCR). The outcome of
experiment D indicates a similar growth of performance and we see that yet 14 identities
can be enough to learn the transformation matrix to distinguish 40 completely diﬀerent
people (0.904 CCR).
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
300
400
500
600
700
800
0.0
0.1
0.2
0.3
0
100
200
(2,27)
(3,27)
(4,27)
(5,27)
(6,27)
(7,27)
(8,27)
(9,27)
(10,27)
(11,27)
(12,27)
(13,27)
(14,27)
(15,27)
(16,27)
(17,27)
(18,27)
(19,27)
(20,27)
(21,27)
(22,27)
(23,27)
(24,27)
(25,27)
(26,27)
(27,27)
DBI
CCR
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
600
800
1000
1200
1400
1600
1800
2000
0.0
0.1
0.2
0.3
0
200
400
600
(2,52)
(3,51)
(4,50)
(5,49)
(6,48)
(7,47)
(8,46)
(9,45)
(10,44)
(11,43)
(12,42)
(13,41)
(14,40)
(15,39)
(16,38)
(17,37)
(18,36)
(19,35)
(20,34)
(21,33)
(22,32)
(23,31)
(24,30)
(25,29)
(26,28)
(27,27)
(28,26)
(29,25)
(30,24)
(31,23)
(32,22)
(33,21)
(34,20)
(35,19)
(36,18)
(37,17)
(38,16)
(39,15)
(40,14)
(41,13)
(42,12)
(43,11)
(44,10)
(45,9)
(46,8)
(47,7)
(48,6)
(49,5)
(50,4)
(51,3)
(52,2)
Figure 4. DBI (left vertical axes) and CCR (right vertical axes) for experiments C (top) and D
(bottom) on heterogeneous setup with (CL,CE) conﬁgurations (horizontal axes).
The proposed method and seven other state-of-the-art methods [2,4,6,10,12,14,16]
have been subjected to extensive simulations on homogeneous setup in our recent
research paper [5]. A variety of class-separability coeﬃcients and classiﬁcation metrics
allows insights from diﬀerent statistical perspectives. Results indicate that the proposed
method is a leading concept for rank-based classiﬁer systems: lowest Davies-Bouldin
Index, highest Dunn Index, highest (and exclusively positive) Silhouette Coeﬃcient,
second highest Fisher’s Discriminant Ratio and, combined with rank-based classiﬁer,
the best Cumulative Match Characteristic, False Accept Rate and False Reject Rate
trade-oﬀ, Receiver Operating Characteristic (ROC) and recall-precision trade-oﬀ scores
along with Correct Classiﬁcation Rate, Equal Error Rate, Area Under ROC Curve and
Mean Average Precision. We interpret the high scores as a sign of robustness. Apart
from performance merits, the MMC method is also eﬃcient: low-dimensional templates
(̂︀D ≤ CL −1 = CE −1 = 53) and Mahalanobis distance ensure fast distance computations
and thus contribute to high scalability.
9
4 Conclusions
Despite many advanced optimization techniques used in statistical pattern recognition,
a common practice of state-of-the-art MoCap-based human identiﬁcation is still to design
geometric gait features by hand. As the ﬁrst contribution of this paper, the proposed
method does not involve any ad-hoc features; on the contrary, they are computed from
a much larger space beyond the limits of human interpretability. The features are learned
directly from raw joint coordinates by a modiﬁcation of the Fisher’s LDA with MMC so
that the identities are maximally separated. We believe that MMC is a suitable criterion
for optimizing gait features; however, our future work will continue with research on
further potential optimality criterions and machine learning approaches. Furthermore,
we are in the process of developing an evaluation framework with implementation details
and source codes of all related methods, data extraction drive from the general CMU
MoCap database and the evaluation mechanism to support reproducible research.
Second contribution lies in showing the possibility of building a representation
on a problem and using it on another (related) problem. Simulations on the CMU
MoCap database show that our approach is able to build robust feature spaces without
pre-registering and labeling all potential walkers. In fact, we can take diﬀerent people
(experiments A and B) and just a fraction of them (experiments C and D). We have
observed that with an increasing volume of identities the heterogeneous evaluation setup
is on par with the homogeneous setup, that is, it does not matter what identities we
learn the features on. One does not have to rely on the availability of all walkers for
learning. This is particularly important for a system to aid video surveillance applications
where encountered walkers never supply labeled data. Multiple occurrences of individual
walkers can now be linked together even without knowing their actual identities.
Acknowledgments Authors thank to the anonymous reviewers for their detailed commentary
and suggestions. Data used in this project was created with funding from NSF
EIA-0196217 and was obtained from http://mocap.cs.cmu.edu. Our extracted database
is available at https://gait.fi.muni.cz/ to support results reproducibility.
References
1. Ahmed, F., Paul, P.P., Gavrilova, M.L.: DTW-Based Kernel and Rank-Level Fusion for
3D Gait Recognition Using Kinect. The Visual Computer 31(6), 915–924 (2015), https:
//doi.org/10.1007/s00371-015-1092-0
2. Ahmed, M., Al-Jawad, N., Sabir, A.: Gait Recognition Based on Kinect Sensor. Proc. SPIE
9139, 91390B–91390B–10 (2014), http://dx.doi.org/10.1117/12.2052588B
3. Ali, S., Wu, Z., Li, X., Saeed, N., Wang, D., Zhou, M.: Transactions on Computational Science
XXVI: Special Issue on Cyberworlds and Cybersecurity, chap. Applying Geometric Function
on Sensors 3D Gait Data for Human Identiﬁcation, pp. 125–141. Springer, Berlin, Heidelberg
(2016), https://doi.org/10.1007/978-3-662-49247-5_8
4. Andersson, V.O., Araujo, R.M.: Person Identiﬁcation Using Anthropometric and Gait Data
from Kinect Sensor. In: Proc. of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence
(AAAI-15). pp. 425–431. AAAI Press (2015), http://www.aaai.org/ocs/index.php/
AAAI/AAAI15/paper/view/9680
5. Balazia, M., Sojka, P.: Learning Robust Features for Gait Recognition by Maximum Margin
Criterion. In: Proc. of 23rd International Conference on Pattern Recognition, ICPR 2016. pp.
901–906. IEEE (Dec 2016)
10
6. Ball, A., Rye, D., Ramos, F., Velonaki, M.: Unsupervised clustering of people from ’skeleton’
data. In: Proceedings of the Seventh Annual ACM/IEEE International Conference on HumanRobot
Interaction. pp. 225–226. HRI ’12, ACM, New York, NY, USA (2012), http://doi.
acm.org/10.1145/2157689.2157767
7. Castro, F.M., Marín-Jiménez, M.J., Guil, N., de la Blanca, N.P.: Automatic Learning of Gait
Signatures for People Identiﬁcation. CoRR abs/1603.01006 (2016), https://arxiv.org/
abs/1603.01006
8. Choudhury, S.D., Tjahjadi, T.: Robust View-Invariant Multiscale Gait Recognition. Pattern Recognition
48(3), 798–811 (2015), http://www.sciencedirect.com/science/article/
pii/S0031320314003835
9. CMU Graphics Lab: Carnegie-Mellon Motion Capture (MoCap) Database (2003), http:
//mocap.cs.cmu.edu
10. Dikovski, B., Madjarov, G., Gjorgjevikj, D.: Evaluation of Diﬀerent Feature Sets for Gait
Recognition Using Skeletal Data from Kinect. In: 37th Intl. Convention on Information and
Communication Technology, Electronics and Microelectronics. pp. 1304–1308 (May 2014),
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6859769
11. Kocsor, A., Kovács, K., Szepesvári, C.: Margin Maximizing Discriminant Analysis, pp. 227–
238. Springer Berlin Heidelberg, Berlin, Heidelberg (2004), https://doi.org/10.1007/
978-3-540-30115-8_23
12. Kwolek, B., Krzeszowski, T., Michalczuk, A., Josinski, H.: 3D Gait Recognition Using SpatioTemporal
Motion Descriptors. In: Proc. of Intelligent Information and Database Systems: 6th
Asian Conference, ACIIDS 2014, Bangkok, Thailand, Part II. LNCS, vol. 8398, pp. 595–604.
Springer (Apr 2014), https://doi.org/10.1007/978-3-319-05458-2_61
13. Li, H., Jiang, T., Zhang, K.: Eﬃcient and Robust Feature Extraction by Maximum Margin
Criterion. IEEE Transactions on Neural Networks 17(1), 157–165 (Jan 2006), https://doi.
org/10.1109/TNN.2005.860852
14. Preis, J., Kessel, M., Werner, M., Linnhoﬀ-Popien, C.: Gait Recognition with Kinect. In:
1st International Workshop on Kinect in Pervasive Computing, New Castle, UK, June
18–22. pp. 1–4 (2012), https://www.researchgate.net/publication/239862819_
Gait_Recognition_with_Kinect
15. Ramu Reddy, V., Chakravarty, K., Aniruddha, S.: Person Identiﬁcation in Natural Static
Postures Using Kinect. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) Computer Vision
– ECCV 2014 Workshops: Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings,
Part II. LNCS, vol. 8926, pp. 793–808. Springer (2015), https://doi.org/10.1007/
978-3-319-16181-5_60
16. Sinha, A., Chakravarty, K., Bhowmick, B.: Person Identiﬁcation Using Skeleton Information
from Kinect. In: ACHI 2013: Proc. of the Sixth Intl. Conf. on Advances in CHI. pp. 101–108
(2013), https://www.thinkmind.org/index.php?view=article&articleid=achi_
2013_4_50_20187
17. Tafazzoli, F., Bebis, G., Louis, S.J., Hussain, M.: Genetic Feature Selection for Gait Recognition.
J. of Electron. Imaging 24(1), 013036 (Feb 2015), https://doi.org/10.1117/1.
JEI.24.1.013036
18. Valcik, J., Sedmidubsky, J., Zezula, P.: Assessing Similarity Models for Human-Motion
Retrieval Applications. Computer Animation and Virtual Worlds 27(5), 484–500 (2016),
http://dx.doi.org/10.1002/cav.1674
19. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY,
USA (1995)
20. Zeng, W., Wang, C.: View-Invariant Gait Recognition via Deterministic Learning. In:
International Joint Conference on Neural Networks (IJCNN). pp. 3465–3472 (Jul 2014),
https://doi.org/10.1109/IJCNN.2014.6889507
11