3rd IEEE/IAPR International Joint Conference on Biometrics 2017, preprint
You Are How You Walk: Uncooperative MoCap Gait Identiﬁcation
for Video Surveillance with Incomplete and Noisy Data
Michal Balazia
xbalazia@mail.muni.cz
Petr Sojka
sojka@fi.muni.cz
Faculty of Informatics, Masaryk University, Botanick´a 68a, 602 00 Brno, Czech Republic
Abstract
This work oﬀers a design of a video surveillance system
based on a soft biometric – gait identiﬁcation from MoCap
data. The main focus is on two substantial issues of the
video surveillance scenario: (1) the walkers do not cooperate
in providing learning data to establish their identities and
(2) the data are often noisy or incomplete. We show that only
a few examples of human gait cycles are required to learn a
projection of raw MoCap data onto a low-dimensional subspace
where the identities are well separable. Latent features
learned by Maximum Margin Criterion (MMC) method discriminate
better than any collection of geometric features.
The MMC method is also highly robust to noisy data and
works properly even with only a fraction of joints tracked.
The overall workﬂow of the design is directly applicable for
a day-to-day operation based on the available MoCap technology
and algorithms for gait analysis. In the concept we
introduce, a walker’s identity is represented by a cluster of
gait data collected at their incidents within the surveillance
system: They are how they walk.
1. Introduction
Public safety issues are constantly evolving and security
monitoring agencies are facing more challenges than ever
before. Oﬀering a range of security systems indispensable to
investigators, the surveillance industry appears to be at the
beginning of a massive expansion. Video surveillance technology
records video footage for the potential future identiﬁcation
of suspicious individuals and activities. Many public
places, such as banks and airports, already have surveillance
cameras installed, but these require intelligent approaches to
human identiﬁcation. A useful early-warning system would
analyze the collected video footage and release an alert before
an adverse event takes place. Triggered by detection of
an abnormal behavior, the system would instantly identify all
participants in the scene, rapidly investigate their previous
activities, and launch the tracking of the suspects.
A typical video surveillance environment has some crucial
characteristics that need to be taken into account when
designing an identiﬁcation system. Data are captured by a
system of video cameras covering a large tracking space.
People walk in various directions, at various speeds, and are
often in crowds. They wear a variety of clothes and shoes
and often carry large objects. Since they do not cooperate in
this, an eventual learning model is only available in an externally
annotated database. A central database stores hundreds
of subject identities that are encountered repeatedly, each
contributing with multiple biometric samples. Identiﬁcation
has to be performed in real time, that is, in a few seconds.
Tracking and identiﬁcation have to be automatic as operator
interventions are slow and costly.
Gait (walk) pattern has several attractive properties as a
soft biometric trait. From the surveillance perspective, gait
pattern biometrics is appealing for its possibility of being
performed at a distance and without body-invasive equipment
or subject cooperation. This enables sample acquisition
even without a subject’s consent. The goal of this work is
to design a method for identifying individuals from videos
from their gait pattern.
Uncooperative gait identiﬁcation has been addressed by
Martin-Felez and Xiang [29] by casting gait identiﬁcation
as a bipartite ranking problem. Their model learns a ranking
function in a higher dimensional space where true matches
and wrong matches become more separable than in the original
space. The output of the model is a ranking function
which gives a higher score if a pair of gait templates belong
to the same person than to diﬀerent people.
Chen and Xu [11] extend the bipartite ranking approach
by integrating sparse coding re-ranking and multi-view
hypergraph-based re-ranking framework, calling it a sparse
coding multi-view hypergraph learning re-ranking method.
2. Identiﬁcation Workﬂow
In accordance with the outlined video surveillance environment,
our human identiﬁcation system has the following
4-phase workﬂow (see Figure 1):
1
Spo�ed walker
Acquiring
MoCap data
Phase I
Detec�ng
gait
Phase II
Iden�fying
walkers
Phase IV
MoCap data
Gait template
Gait sample
Iden�ﬁed walker
DEPARTURES
BANK
OPEN
ING
HOU
RS
BANK
Extrac�ng
gait features
Phase III
Figure 1: Video surveillance workﬂow. A person is (I) captured on an RGB-D camera in the form of MoCap data, (II) gait is
then detected to form a gait sample, from which (III) gait template is extracted and (IV) the person is identiﬁed.
Phase I – Acquiring Motion Capture Data
Motion capture (MoCap) technology acquires video clips
of people and derives structural motion data. The format
maintains an overall structure of the human body and holds
the estimated 3D positions of the main anatomical landmarks
as the person moves. MoCap data can be collected
by RGB-D sensors such as Microsoft Kinect, Asus Xtion
or Vicon. For a schematic visualization, a simpliﬁed stick
ﬁgure representing the human skeleton (a graph of joints
connected by bones) can be automatically recovered from
the values of body point spatial coordinates. With recent
rapid improvements in sensor technology and pose estimation
techniques [15, 18], an accurate and aﬀordable MoCap
system [17] is at our disposal to aid gait identiﬁcation for
applications in video surveillance.
Phase II – Detecting Gait Cycles
People spotted in our tracking space do not walk all the
time; on the contrary, they perform various activities. Identifying
people from gait requires processing video segments
where they are actually walking. Gait cycles need to be
ﬁrst ﬁltered out from the video sequences of general motion.
There are methods [5, 33] for detecting gait cycles directly
as well as action recognition methods [12, 19, 21, 25, 34]
that need a demonstrative example of a gait cycle to query
general motion sequences.
Phase III – Extracting Gait Features
Once a query motion clip has been cut into clean gait
cycles, the identiﬁcation mechanism proceeds with transforming
the sample of raw MoCap data into a representation
that contains discriminative gait information. A collection
of extracted gait features builds a gait template which serves
as the walker’s signature. But as in video surveillance environment
the walkers cannot be relied upon to be cooperative,
we are left with the problem of obtaining highly discriminative
gait features for the walkers without a labeled learning
dataset containing their very own samples.
Many research groups investigate the discriminatory
power of geometric gait features designed by hand and without
any statistical learning. They typically combine static
body parameters (bone lengths, person’s height) with dynamic
gait features such as step length, walk speed, joint
angles and inter-joint distances, along with various statistics
(mean, standard deviation or maximum) of their signals.
These, in particular, are the horizontal and vertical distances
of selected joint pairs by Ahmed et al. [2], lower limb triangles
by Ali et al. [3], lower body angles, step length, cycle
time and velocity by Andersson and Araujo [4], lower limb
(hips, knees and ankles) angles by Ball et al. [10], axis rotations
of the major bones by Kwolek et al. [24], eleven static
body parameters, and step length and walk speed by Preis et
al. [30]. Dikovski et al. [14] construct seven diﬀerent feature
sets from a broad spectrum of geometric features, such as
static body parameters, joint angles and inter-joint distances
2
aggregated within a gait cycle, along with various statistics.
Sinha et al. [32] combine areas of upper and lower body,
inter-joint distances as well as all of the features introduced
by Ball et al. [10] and Preis et al. [30]. These features are
convenient for visualizations and for intuitive understanding,
but their schematism and human-interpretability are
unnecessary for automatic identiﬁcation.
Instead, we [6, 7] prefer to statistically learn the features
on an auxiliary labeled database with the goal of maximally
separating the identity classes in the feature space and use
these features to identify all potential walkers. Their linear
model is learned in a supervised manner through (1) a modiﬁcation
of Fisher’s Linear Discriminant Analysis [16] with
Maximum Margin Criterion (MMC) and (2) a combination
of Principal Component Analysis and Linear Discriminant
Analysis (PCA+LDA) to project the high-dimensional input
data onto low-dimensional sub-spaces. The similarity of
templates is expressed in the Mahalanobis distance function.
Both these and the non-learning approaches create an unsupervised
environment suitable for searching for similar
templates and for clustering templates into potential identities,
which is the main focus of the application outlined in
the following phase.
Phase IV – Identifying Walkers
Identiﬁcation is most commonly formulated as a classiﬁcation
problem: A walker’s identity is established (classiﬁed)
by picking one from the pool of registered identities. This
model is suitable for applications where participants reveal
their identities at registration (closed-set identiﬁcation). During
video surveillance, on the other hand, new identities can
appear on the ﬂy (open-set identiﬁcation), and labeled data
for all the people encountered may not always be available.
Nobody is claiming their identities since people are recorded
without their consent. Needless to say, this task requires a
diﬀerent understanding of what a person’s identity is.
You are how you walk. Your identity is your gait pattern
itself. Instead of classifying walker identities as names or
numbers that are not available in any case, a forensic investigator
rather asks for information about their appearances
captured by surveillance system – their location trace (see
Figure 2) that includes timestamp and geolocation of each
appearance. In the suggested application, walkers are clustered
rather than classiﬁed. Identiﬁcation is carried out as a
query-by-example: A similarity search query ranks recorded
gait templates on the basis of their similarity to the query
template, which represents their likelihood of belonging to
the same person. A decision mechanism operates on the
basis of clustering or thresholding to determine which of the
templates are ﬁnally accepted to establish the location trace.
Although one can allow more templates to be accepted, doing
so will enlarge the location trace with a chance of falsely
accepting some templates of another person.
rejected
accepted
query
39° 43' 04'' N 104° 51' 50'' W
2017/06/24 22:49:38
39° 49' 06'' N 104° 52' 10'' W
2017/06/23 13:24:19
39° 51' 05'' N 104° 40' 34'' W
2017/06/21 07:55:16
DEPARTURES
BANK
Figure 2: Illustration of a person’s location trace. Each
point represents a gait template and contains additional information
about the place and time of the corresponding
incident within the surveillance system. Given a query template
(green dot), the retrieved cluster of templates (red dots)
is expected to contain other templates of the same person.
Map obtained from https://www.openstreetmap.org.
3. Evaluation
We have based our evaluation on the gait recognition
framework [8]. The evaluation focuses on the Phase III in
Section 2. We investigate the impact of externalizing the
learning identities as a resolution for walker uncooperativeness
by evaluating discriminativeness of the feature space
learned by a certain amount of separate learning data. One
also desires a system with a model that does not collapse
even if data are mapped wrongly due to inaccuracies or
failures of the data acquisition technology, for which we
evaluate robustness incomplete and noisy data. The ﬁnal
stage is the evaluation clusterability of individual feature
spaces for potential data pre-processing.
We implemented and evaluated all competitive MoCap
gait identiﬁcation methods [2, 3, 4, 6, 10, 14, 24, 30, 32].
The state-of-the-art records additional methods [1, 20, 22,
23, 31] which we have implemented but not evaluated due
to the high demands they place on computational resources.
3
For the purpose of evaluation, we selected the MoCap
database of the CMU Graphics Lab [13], which is available
under the Creative Commons license. Normalization
and extraction of gait cycles from this database is described
in [8] and is available for download at [9]. The gait data hold
C = 64 walking subjects that performed N = 5 923 samples
in total, which resulted in an average of about 83 samples
per subject. Data in the measurement (sample) space
have the form 𝒢 = {(gn, n)}N
n=1 where gn is a tensorial representation
of a gait sample (a single gait cycle) that contains
3D spatial coordinates (positions) of joints in all video
frames, normalized with respect to the person’s position
and direction of walking. Each of the N learning samples
falls strictly into one of the C identity classes representing
a single walker labeled n. A class ℐc ⊆ 𝒢 has Nc
samples. Classes ℐ = {ℐc}C
c=1 are complete and mutually
exclusive. We say that samples (gn, n) and (gn′ , n′ ) share a
common walker if and only if they belong to the same class:
(gn, n) , (gn′ , n′ ) ∈ ℐc ⇔ n = n′ .
We evaluate the implemented methods with the crossidentity
evaluation setup, in which the collections of learning
data 𝒢L = {(gn, n)}NL
n=1 of CL identities and evaluation data
𝒢E = {(gn, n)}NE
n=1 of CE identities are disjunct. An evaluation
conﬁguration is parametrized by (CL,CE) specifying
how many learning and how many evaluation identity classes
are selected from the database.
Models are eventually learned on the learning part and
evaluated on the evaluation part transformed into individual
feature spaces 𝒢E = {(gn, n)}NE
n=1 → ̂︀𝒢E =
{︀
̂︀gn, n
}︀NE
n=1
of CE identities, as determined by corresponding methods.
Evaluation results are presented in terms of the following
metrics:
∙ Davies-Bouldin Index: DBI = 1
CE
∑︀CE
c=1 max
1≤c′≤C, c′ c
σc+σc′
̂︀δ(̂︀µc,̂︀µc′ )
where σc = 1
Nc
∑︀Nc
n=1
̂︀δ
(︀
̂︀gn,̂︀µc
)︀
is the average distance of all
elements in identity class ℐc to its centroid, and analogically
for σc′ . Templates of low intra-class distances and of high
inter-class distances have a low DBI.
∙ Silhouette Coeﬃcient: SC = 1
NE
∑︀NE
n=1
b(̂︀gn)−a(̂︀gn)
max{a(̂︀gn),b(̂︀gn)}
where a(̂︀gn) = 1
Nc
∑︀Nc
n′=1
̂︀δ
(︀
̂︀gn,̂︀gn′
)︀
is the average distance
from̂︀gn to other samples within the same identity class and
b(̂︀gn) = min
1≤c′≤C, c′ c
1
Nc′
∑︀Nc′
n′=1
̂︀δ
(︀
̂︀gn,̂︀gn′
)︀
is the average distance
of ̂︀gn to the samples in the closest class. It is clear that
−1 ≤ SC ≤ 1 and a SC close to one means that classes are
appropriately separated.
∙ area under Receiver Operating Characteristic curve (ROC)
∙ area under Precision-Recall curve (PR)
The system can potentially employ various clustering
mechanisms that need a high degree of accuracy. Consider
a clustering algorithm that returns the clusters 𝒞 = {𝒞c′ }C
c′=1
approximating the real identity classes ℐ = {ℐc}C
c=1. The
following metrics evaluate a clustering algorithm together
with a particular feature extraction method, which gives an
insight into clusterability of the corresponding feature space:
∙ Purity: P = 1
NE
∑︀
𝒞c′ ∈𝒞 maxℐc∈ℐ |𝒞c′ ∩ ℐc|
∙ Rand Index: RI = TP+TN
TP+FP+FN+TN
∙ F-measure: F = 2pr
p+r where p = TP
TP+FP and r = TP
TP+FN
∙ Jaccard Index: JI = TP
TP+FP+FN
∙ Fowlkes-Mallows Index: FMI =
√︁
TP
TP+FP · TP
TP+FN
where TP is the number of true positives, TN of true negatives,
FP of false positives, and FN of false negatives. A pair
of templates of the same identity falling in the same cluster
is a true positive, of diﬀerent identities in diﬀerent clusters
is a true negative, of diﬀerent identities in the same cluster is
a false positive, and of the same identity in diﬀerent clusters
is a false negative.
Discriminativeness Designing a surveillance system with
uncooperative subjects, one should also consider the availability
of an auxiliary learning database that is large enough
to be used for learning the model. In the following series
of experiments we measure the four evaluation metrics
on a sequence of conﬁgurations with an increasing
number of learning identities and the rest of the database
as an evaluation part. Given that the benchmark dataset
contains 64 identities in total, these conﬁgurations range
from (2, 62) to (32, 32). Technically, it is possible to continue
up until (62, 2), but conﬁgurations with more learning
than evaluation identities are insigniﬁcant. Each conﬁguration
(CL, 64 − CL) is constructed from the previous conﬁguration
(CL − 1, 64 − CL) + 1 by picking one identity in
the evaluation part at random and moving it into the learning
part. Observing from Figure 4, the discriminativeness
of the approaches based on statistical learning (MMC and
PCA+LDA) grows quickly on the ﬁrst conﬁgurations with
very few learning identities, which one can interpret as an
analogy to the Pareto (80–20) principle. The results indicate
that even 10 identities can be enough for learning the
MMC transform to identify 54 other people more accurately
than the other methods tested. Roughly speaking, the MMC
method achieves the top results in all of the metrics at a
conﬁguration of about (10, 54) and keeps or increases its
discriminativeness further on. This experiment provides
a lower bound estimate for the volume of learning data
given the volume of the population for surveillance: With
a learning database smaller than this lower bound, taking
the non-learned geometric features of Dikovski et al. [14]
or Kwolek et al. [24] is recommended, otherwise the MMC
method best identiﬁes walkers within a population of more
than triple the learning identities.
4
0
0.5
DCT
Ahmed Andersson Dikovski Preis MMCAhmed Andersson Dikovski Preis MMC
Ali Ball Kwolek Sinha PCA+LDA
10
100
1000
DBI
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
1
10
(2,62)
(3,61)
(4,60)
(5,59)
(6,58)
(7,57)
(8,56)
(9,55)
(10,54)
(11,53)
(12,52)
(13,51)
(14,50)
(15,49)
(16,48)
(17,47)
(18,46)
(19,45)
(20,44)
(21,43)
(22,42)
(23,41)
(24,40)
(25,39)
(26,38)
(27,37)
(28,36)
(29,35)
(30,34)
(31,33)
(32,32)
Preis
Sinha
MMC
PCA+LDA
(a) DBI
-0.4
-0.3
-0.2
-0.1
0.0
0.1
SC
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
-0.6
-0.5
-0.4
(2,62)
(3,61)
(4,60)
(5,59)
(6,58)
(7,57)
(8,56)
(9,55)
(10,54)
(11,53)
(12,52)
(13,51)
(14,50)
(15,49)
(16,48)
(17,47)
(18,46)
(19,45)
(20,44)
(21,43)
(22,42)
(23,41)
(24,40)
(25,39)
(26,38)
(27,37)
(28,36)
(29,35)
(30,34)
(31,33)
(32,32)
Preis
Sinha
MMC
PCA+LDA
(b) SC
0.70
0.75
0.80
0.85
ROC
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
0.60
0.65
(2,62)
(3,61)
(4,60)
(5,59)
(6,58)
(7,57)
(8,56)
(9,55)
(10,54)
(11,53)
(12,52)
(13,51)
(14,50)
(15,49)
(16,48)
(17,47)
(18,46)
(19,45)
(20,44)
(21,43)
(22,42)
(23,41)
(24,40)
(25,39)
(26,38)
(27,37)
(28,36)
(29,35)
(30,34)
(31,33)
(32,32) Preis
Sinha
MMC
PCA+LDA
(c) ROC
0.20
0.30
0.40
0.50
0.60
PR
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
0.00
0.10
0.20
(2,62)
(3,61)
(4,60)
(5,59)
(6,58)
(7,57)
(8,56)
(9,55)
(10,54)
(11,53)
(12,52)
(13,51)
(14,50)
(15,49)
(16,48)
(17,47)
(18,46)
(19,45)
(20,44)
(21,43)
(22,42)
(23,41)
(24,40)
(25,39)
(26,38)
(27,37)
(28,36)
(29,35)
(30,34)
(31,33)
(32,32)
Preis
Sinha
MMC
PCA+LDA
(d) PR
Figure 3: Simulations with 31 diﬀerent (CL,CE) conﬁgurations (horizontal axes) on four evaluation metrics (vertical axes).
70
75
80
85
90
95
100
105
110
%score
DBI
SC
ROC
PR
65
70
root
lhipjoint
rhipjoint
lfemur
rfemur
ltibia
rtibia
lfoot
rfoot
ltoes
rtoes
lclavicle
rclavicle
lhumerus
rhumerus
lradius
rradius
lwrist
rwrist
lhand
rhand
lfingers
rfingers
lthumb
rthumb
lowerback
upperback
thorax
lowerneck
upperneck
head
pelvis
torso
leftarm
rightarm
arms
leftleg
rightleg
legs
leftlimbs
rightlimbs
limbs
allbuttorso
allbutarms
allbutlegs
excluded
Figure 4: Four evaluated metrics of the MMC method with incomplete data. In each column a subset of joints is systematically
excluded from the input: ﬁrst 31 columns (root – head) exclude a single joint and last 14 columns (pelvis – all but legs)
exclude multiple joints. Structure of the human body is the following: head, pelvis = {root, lhipjoint, rhipjoint},
left leg = {lfemur, ltibia, lfoot, ltoes}, left arm = {lhumerus, lradius, lwrist, lhand, lfingers, lthumb},
right leg = {rfemur, rtibia, rfoot, rtoes}, right arm = {rhumerus, rradius, rwrist, rhand, rfingers, rthumb},
torso = {lowerback, upperback, thorax, lowerneck, upperneck, lclavicle, rclavicle}. Conﬁguration (9, 55).
0.60
0.65
0.70
0.75
0.80
ROC
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
Preis
0.50
0.55
0 10 20 30 40 50 60 70 80 90 100
% noise
Preis
Sinha
MMC
PCA+LDA
0.10
0.15
0.20
0.25
0.30
PR
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
Preis
0.00
0.05
0 10 20 30 40 50 60 70 80 90 100
% noise
Preis
Sinha
MMC
PCA+LDA
(a) ROC and PR for an x% noise simulated by multiplying each
measured value by a random number in interval (1 − x/100, 1 + x/100).
0.60
0.65
0.70
0.75
0.80
ROC
Ahmed
Ali
Andersson
Ball
Dikovski
Kwolek
Preis
0.50
0.55
0 10 20 30 40 50 60 70 80 90 100
% noise
Preis
Sinha
MMC
PCA+LDA
0.10
0.15
0.20
0.25
0.30
PR
0.00
0.05
0 10 20 30 40 50 60 70 80 90 100
% noise
(b) ROC and PR for an x% noise simulated by substituting each
measured value with a random value with probability x%.
Figure 5: ROC and PR of all implemented methods with noisy data. Conﬁguration (9, 55).
5
Robustness Video surveillance environments are vulnerable
to the accuracy of data acquisition. We conducted two
series of experiments where we assumed a (9, 55) conﬁguration
with ﬁxed learning and evaluation identities on which
we simulated measurement errors by (1) excluding some
data and by (2) adding random noise. Incomplete data are
simulated by excluding various subsets of joints with all
their tracked positions and score percentages of incomplete
data (with subscript new) to complete data (with subscript
old) were calculated for DBI as 100·DBIold/DBInew, for SC as
100·(SCnew+1)/SCold+1, for ROC as 100·ROCnew/ROCold, and for PR as
100·PRnew/PRold. A noise of x% was simulated in two ways: by
(2a) multiplying each measured value by a random number
from the interval (1 − x/100, 1 + x/100) and by (2b) substituting
each measured value with a random value between 0
and 1 with x% probability in which the 100% noise represents
completely random data. All methods were evaluated
on noisy data but only the MMC method was evaluated on
incomplete data as other methods do not give instructions
for dealing with incomplete data. Figures 4 and 5 illustrate
how the methods cope with incomplete and noisy data, respectively.
As for the data incompleteness, one can observe
that particular joints are more discriminatory than others
as their exclusion causes a more signiﬁcant drop in scores.
This information can be potentially used for ﬁne-tuning the
hand-picked geometric features. Furthermore, input data
pruning has a positive impact on the duration of model learning.
Regarding the noisy data, note that the ﬁrst type of noise
is shrouding less than the second type as the measured values
change only up to double and they preserve their means
and covariances. The results make it clear that the methods
of geometric features drop in score more quickly than the
MMC and PCA+LDA methods.
Clusterability Various pre-clustering techniques can be
applied to the feature space in order to improve the accuracy
of the location trace retrieval. One perfectly accurate
location trace would be the cluster of gait templates of and
only of the query walker, although this request is near impossible
on large datasets. The last experiment takes again
the (9, 55) conﬁguration and measures the purity and other
four indexes of the clustering obtained by K-Means into
K = CE = 55 clusters. Results in Table 1 show that the geometric
features of Andersson et al. [4], Dikovski et al. [14],
Kwolek et al. [24] and the latent features learned by MMC
and PCA+LDA deﬁne the most suitable sub-spaces for KMeans
clustering.
4. Conclusion
We present a plausible high-level workﬂow of a MoCap
gait identiﬁcation system in a video surveillance environment
with uncooperative walkers. The main focus is on the
Table 1: Methods evaluated on the ﬁve clusterability metrics.
Conﬁguration (9, 55).
method P RI F JI FMI
Ahmed 0.3402 0.9486 0.1388 0.0746 0.1418
Ali 0.2677 0.9472 0.0986 0.0519 0.1013
Andersson 0.3574 0.9526 0.26 0.1494 0.262
Ball 0.3409 0.9491 0.1581 0.0859 0.1611
Dikovski 0.4446 0.9542 0.2583 0.1483 0.2619
Kwolek 0.4571 0.9496 0.2319 0.1311 0.2329
Preis 0.1778 0.9464 0.0462 0.0237 0.0482
Sinha 0.3069 0.9473 0.1143 0.0606 0.1169
MMC 0.4491 0.9538 0.2147 0.1203 0.2202
PCA+LDA 0.437 0.9538 0.224 0.1261 0.2291
evaluation of whether the Phase III of the workﬂow (extracting
gait features) can meet the requirements of the Phase IV
of the workﬂow (a target application for identifying walkers).
Eight methods for extracting geometric gait features and
two methods for statistical learning the features have been
implemented and evaluated on the CMU MoCap database of
64 subjects and 5,923 gait samples. Feature space of each
method is evaluated for (1) discriminativeness, (2) robustness
to incomplete and noisy data and (3) clusterability. With the
evaluation set of 55 identities, the MMC method learned on
9 identities achieves the top results in discriminativeness and
improves with increasing learning identities; together with
the PCA+LDA method they are highly robust to noisy and
incomplete data; and result in a rather pure clustering. The
MMC method appears to be learning the top quality features,
reaching the state-of-the-art in MoCap gait identiﬁcation.
Our suggested concept of person identiﬁcation is based on
a completely diﬀerent perspective from previous approaches:
According to the phrase You are how you walk, a walker
identity is represented as their location trace, that is, a cluster
of gait templates from their incidents within the surveillance
system. This concept allows for data-driven models that
(1) can be learned from diﬀerent people as well as from
diﬀerent scenes and datasets, making it more generally applicable
with limited data per person in a learning set, even
if there is only a single gait sample available for each person,
and (2) do not make assumptions about the learning and
evaluation sets having the same covariate conditions as long
as the auxiliary learning set is rich in them to identify as
many people as possible. This makes it particularly suitable
for applications of uncooperative recognition, such as walker
re-identiﬁcation [26, 35] or next location prediction [27, 28].
Acknowledgments Data used in this project was created
with funding from NSF EIA-0196217 and was obtained from
http://mocap.cs.cmu.edu [13]. Our extracted database
and evaluation framework are available online at https:
//gait.fi.muni.cz to support reproducibility of results.
6
References
[1] F. Ahmed, P. P. Paul, and M. L. Gavrilova. DTW-Based
Kernel and Rank-Level Fusion for 3D Gait Recognition Using
Kinect. The Visual Computer, 31(6):915–924, 2015.
[2] M. Ahmed, N. Al-Jawad, and A. Sabir. Gait Recognition
Based on Kinect Sensor. Proc. SPIE, 9139:91390B–91390B–
10, 2014.
[3] S. Ali, Z. Wu, X. Li, N. Saeed, D. Wang, and M. Zhou. Transactions
on Computational Science XXVI: Special Issue on
Cyberworlds and Cybersecurity, chapter Applying Geometric
Function on Sensors 3D Gait Data for Human Identiﬁcation,
pages 125–141. Springer, Berlin, Heidelberg, 2016.
[4] V. O. Andersson and R. M. Araujo. Person Identiﬁcation
Using Anthropometric and Gait Data from Kinect Sensor.
In Proc. of the Twenty-Ninth AAAI Conference on Artiﬁcial
Intelligence (AAAI-15), pages 425–431. AAAI Press, 2015.
[5] E. Auvinet, F. Multon, C.-E. Aubin, J. Meunier, and M. Raison.
Detection of gait cycles in treadmill walking using a
Kinect. Gait & Posture, 41(2):722–725, 2015.
[6] M. Balazia and P. Sojka. Learning Robust Features for Gait
Recognition by Maximum Margin Criterion. In Proc. of 23rd
International Conference on Pattern Recognition, ICPR 2016,
pages 901–906. IEEE, 2016.
[7] M. Balazia and P. Sojka. Walker-independent features for
gait recognition from motion capture data. In A. RoblesKelly,
M. Loog, B. Biggio, F. Escolano, and R. Wilson, editors,
Structural, Syntactic, and Statistical Pattern Recognition:
Joint IAPR International Workshop, S+SSPR 2016, M´erida,
Mexico, November 29–December 2, 2016, Proceedings, pages
310–321, Cham, 2016. Springer International Publishing.
[8] M. Balazia and P. Sojka. An evaluation framework and
database for mocap-based gait recognition methods. In B. Kerautret,
M. Colom, and P. Monasse, editors, Reproducible Research
in Pattern Recognition: First International Workshop,
RRPR 2016, Canc´un, Mexico, December 4, 2016, Revised
Selected Papers, pages 33–47, Cham, 2017. Springer International
Publishing.
[9] M. Balazia and P. Sojka. Gait recognition from motion capture
data, 2017. https://gait.fi.muni.cz.
[10] A. Ball, D. Rye, F. Ramos, and M. Velonaki. Unsupervised
clustering of people from ’skeleton’ data. In Proceedings
of the Seventh Annual ACM/IEEE International Conference
on Human-Robot Interaction, HRI ’12, pages 225–226, New
York, NY, USA, 2012. ACM.
[11] X. Chen and J. Xu. Uncooperative gait recognition: Reranking
based on sparse coding and multi-view hypergraph
learning. Pattern Recognition, 53:116 – 129, 2016.
[12] W. Choensawat, W. Choi, and K. Hachimura. Similarity retrieval
of motion capture data based on derivative features.
Advanced Computational Intelligence and Intelligent Informatics,
16(1):13–23, 2012.
[13] CMU Graphics Lab. Carnegie-Mellon Motion Capture (MoCap)
Database, 2003. http://mocap.cs.cmu.edu.
[14] B. Dikovski, G. Madjarov, and D. Gjorgjevikj. Evaluation of
Diﬀerent Feature Sets for Gait Recognition Using Skeletal
Data from Kinect. In 37th Intl. Convention on Information
and Communication Technology, Electronics and Microelectronics,
pages 1304–1308, May 2014.
[15] M. Ding and G. Fan. Articulated and generalized gaussian kernel
correlation for human pose estimation. IEEE Transactions
on Image Processing, 25(2):776–789, Feb 2016.
[16] R. A. Fisher. The Use of Multiple Measurements in Taxonomic
Problems. Annals of Eugenics, 7(2):179–188, 1936.
[17] S. Han, M. Achar, S. Lee, and F. Pe˜na-Mora. Empirical
assessment of a rgb-d sensor on motion capture and action
recognition for construction worker monitoring. Visualization
in Engineering, 1(1):6, 2013.
[18] A. Haque, B. Peng, Z. Luo, A. Alahi, S. Yeung, and F. Li.
Viewpoint invariant 3d human pose estimation with recurrent
error feedback. CoRR, abs/1603.07076, 2016.
[19] M. C. Hu, C. W. Chen, W. H. Cheng, C. H. Chang, J. H.
Lai, and J. L. Wu. Real-Time Human Movement Retrieval
and Assessment With Kinect Sensor. IEEE Transactions on
Cybernetics, 45(4):742–753, Apr. 2015.
[20] S. Jiang, Y. Wang, Y. Zhang, and J. Sun. Real Time
Gait Recognition System Based on Kinect Skeleton Feature.
In C. Jawahar and S. Shan, editors, Computer Vision –
ACCV 2014 Workshops, volume 9008 of LNCS, pages 46–57.
Springer, 2015.
[21] I. Kapsouras and N. Nikolaidis. Action recognition in motion
capture data using a bag of postures approach. In Pattern
Recognition (ICPR), 2014 22nd International Conference on,
pages 2649–2654, Aug. 2014.
[22] T. Krzeszowski, A. Switonski, B. Kwolek, H. Josinski, and
K. Wojciechowski. DTW-Based Gait Recognition from Recovered
3-D Joint Angles and Inter-ankle Distance. In L. J.
Chmielewski, R. Kozera, B.-S. Shin, and K. Wojciechowski,
editors, Proc. of Computer Vision and Graphics: International
Conference, ICCVG 2014, Warsaw, Poland, volume
8671 of LNCS, pages 356–363. Springer, Sept. 2014.
[23] M. S. N. Kumar and R. V. Babu. Human gait recognition
using depth camera: A covariance based approach. In Proc. of
the Eighth Indian Conference on Computer Vision, Graphics
and Image Processing, ICVGIP ’12, pages 20:1–20:6, New
York, NY, USA, 2012. ACM.
[24] B. Kwolek, T. Krzeszowski, A. Michalczuk, and H. Josinski.
3D Gait Recognition Using Spatio-Temporal Motion
Descriptors. In Proc. of Intelligent Information and Database
Systems: 6th Asian Conference, ACIIDS 2014, Bangkok, Thailand,
Part II, volume 8398 of LNCS, pages 595–604. Springer,
Apr. 2014.
[25] D. Leightley, B. Li, J. S. McPhee, M. H. Yap, and J. Darby.
Exemplar-Based Human Action Recognition with Template
Matching from a Stream of Motion Capture, pages 12–20.
Springer International Publishing, Cham, 2014.
[26] X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K.-M. Lam, and
Y. Zhong. Person re-identiﬁcation by unsupervised video
matching. Pattern Recogn., 65(C):197–210, May 2017.
[27] U. Mahbub and R. Chellappa. PATH: person authentication
using trace histories. CoRR, abs/1610.07935, 2016.
[28] U. Mahbub, S. Sarkar, V. M. Patel, and R. Chellappa. Active
user authentication for smartphones: A challenge data set and
7
benchmark results. In 2016 IEEE 8th International Conference
on Biometrics Theory, Applications and Systems (BTAS),
pages 1–8, Sept 2016.
[29] R. Martn-Flez and T. Xiang. Uncooperative gait recognition
by learning to rank. Pattern Recognition, 47(12):3793–3806,
2014.
[30] J. Preis, M. Kessel, M. Werner, and C. Linnhoﬀ-Popien. Gait
Recognition with Kinect. In 1st International Workshop on
Kinect in Pervasive Computing, New Castle, UK, June 18–22,
pages 1–4, 2012.
[31] J. Sedmidubsky, J. Valcik, M. Balazia, and P. Zezula. Gait
Recognition Based on Normalized Walk Cycles. In Advances
in Visual Computing, volume 7432 of LNCS, pages 11–20.
Springer, 2012.
[32] A. Sinha, K. Chakravarty, and B. Bhowmick. Person Identiﬁcation
Using Skeleton Information from Kinect. In ACHI
2013: Proc. of the Sixth Intl. Conf. on Advances in CHI, pages
101–108, 2013.
[33] J. Valcik, J. Sedmidubsky, M. Balazia, and P. Zezula. Identifying
walk cycles for human recognition. In M. Chau, A. G.
Wang, W. T. Yue, and H. Chen, editors, Intelligence and Security
Informatics: Paciﬁc Asia Workshop, PAISI 2012, Kuala
Lumpur, Malaysia, May 29, 2012. Proceedings, pages 127–
135, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[34] S. Vantigodi and V. B. Radhakrishnan. Action recognition
from motion capture data using meta-cognitive rbf network
classiﬁer. In Intelligent Sensors, Sensor Networks and Information
Processing (ISSNIP), 2014 IEEE Ninth International
Conference on, pages 1–6, Apr. 2014.
[35] L. Zheng, Y. Yang, and A. G. Hauptmann. Person reidentiﬁcation:
Past, present and future. arXiv preprint
arXiv:1610.02984, 2016.
8