LSTM-Based Real-Time Action Detection and Prediction in Human
Motion Streams

CARRARA, Fabio, Petr ELIÁŠ, Jan SEDMIDUBSKÝ and Pavel ZEZULA. LSTM-Based Real-Time Action Detection and Prediction in Human Motion Streams. Multimedia Tools and Applications. Springer US, 2019, vol. 78, No 19, p. 27309-27331. ISSN 1380-7501. Available from: https://dx.doi.org/10.1007/s11042-019-07827-3.

Other formats: BibTeX LaTeX RIS

TY  - JOUR
ID  - 1533577
AU  - Carrara, Fabio - Eliáš, Petr - Sedmidubský, Jan - Zezula, Pavel
PY  - 2019
TI  - LSTM-Based Real-Time Action Detection and Prediction in Human Motion Streams
JF  - Multimedia Tools and Applications
VL  - 78
IS  - 19
SP  - 27309-27331
EP  - 27309-27331
PB  - Springer US
SN  - 13807501
KW  - motion capture data;stream annotation;action detection and recognition;action prediction;LSTM;recurrent neural network
UR  - http://dx.doi.org/10.1007/s11042-019-07827-3
L2  - http://dx.doi.org/10.1007/s11042-019-07827-3
N2  - Motion capture data digitally represent human movements by sequences of 3D skeleton configurations. Such spatio-temporal data, often recorded in the stream-based nature, need to be efficiently processed to detect high-interest actions, for example, in human-computer interaction to understand hand gestures in real time. Alternatively, automatically annotated parts of a continuous stream can be persistently stored to become searchable, and thus reusable for future retrieval or pattern mining. In this paper, we focus on multi-label detection of user-specified actions in unsegmented sequences as well as continuous streams. In particular, we utilize the current advances in recurrent neural networks and adopt a unidirectional LSTM model to effectively encode the skeleton frames within the hidden network states. The model learns what subsequences of encoded frames belong to the specified action classes within the training phase. The learned representations of classes are then employed within the annotation phase to infer the probability that an incoming skeleton frame belongs to a given action class. The computed probabilities are finally compared against a learned threshold to automatically determine the beginnings and endings of actions. To further enhance the annotation accuracy, we utilize a bidirectional LSTM model to estimate class probabilities by considering not only the past frames but also the future ones. We extensively evaluate both the models on the three use cases of real-time stream annotation, offline annotation of long sequences, and early action detection and prediction. The experiments demonstrate that our models outperform the state of the art in effectiveness and are at least one order of magnitude more efficient, being able to annotate 10k frames per second.
ER  -

Basic information
Original name	LSTM-Based Real-Time Action Detection and Prediction in Human Motion Streams
Authors	CARRARA, Fabio (380 Italy), Petr ELIÁŠ (203 Czech Republic, belonging to the institution), Jan SEDMIDUBSKÝ (203 Czech Republic, guarantor, belonging to the institution) and Pavel ZEZULA (203 Czech Republic, belonging to the institution).
Edition	Multimedia Tools and Applications, Springer US, 2019, 1380-7501.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10200 1.2 Computer and information sciences
Country of publisher	Netherlands
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
Impact factor	Impact factor: 2.313
RIV identification code	RIV/00216224:14330/19:00109721
Organization unit	Faculty of Informatics
Doi	http://dx.doi.org/10.1007/s11042-019-07827-3
UT WoS	000485298000024
Keywords in English	motion capture data;stream annotation;action detection and recognition;action prediction;LSTM;recurrent neural network
Tags	DISA
Tags	International impact, Reviewed
Changed by	Changed by: doc. RNDr. Jan Sedmidubský, Ph.D., učo 60474. Changed: 21/1/2020 08:21.

Abstract

Motion capture data digitally represent human movements by sequences of 3D skeleton configurations. Such spatio-temporal data, often recorded in the stream-based nature, need to be efficiently processed to detect high-interest actions, for example, in human-computer interaction to understand hand gestures in real time. Alternatively, automatically annotated parts of a continuous stream can be persistently stored to become searchable, and thus reusable for future retrieval or pattern mining. In this paper, we focus on multi-label detection of user-specified actions in unsegmented sequences as well as continuous streams. In particular, we utilize the current advances in recurrent neural networks and adopt a unidirectional LSTM model to effectively encode the skeleton frames within the hidden network states. The model learns what subsequences of encoded frames belong to the specified action classes within the training phase. The learned representations of classes are then employed within the annotation phase to infer the probability that an incoming skeleton frame belongs to a given action class. The computed probabilities are finally compared against a learned threshold to automatically determine the beginnings and endings of actions. To further enhance the annotation accuracy, we utilize a bidirectional LSTM model to estimate class probabilities by considering not only the past frames but also the future ones. We extensively evaluate both the models on the three use cases of real-time stream annotation, offline annotation of long sequences, and early action detection and prediction. The experiments demonstrate that our models outperform the state of the art in effectiveness and are at least one order of magnitude more efficient, being able to annotate 10k frames per second.

Links
EF16_019/0000822, research and development project	Name: Centrum excelence pro kyberkriminalitu, kyberbezpečnost a ochranu kritických informačních infrastruktur

PrintDisplayed: 27/4/2024 13:24

LSTM-Based Real-Time Action Detection and Prediction in Human Motion Streams

Other applications