Text-to-Motion Retrieval: Towards Joint Understanding of Human
Motion Data and Natural Language

MESSINA, Nicola, Jan SEDMIDUBSKÝ, Falchi FABRIZIO and Tomáš REBOK. Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language. Online. In 46th International Conference on Research and Development in Information Retrieval (SIGIR). New York, NY, USA: Association for Computing Machinery, 2023, p. 2420-2425. ISBN 978-1-4503-9408-6. Available from: https://dx.doi.org/10.1145/3539618.3592069.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Authors	MESSINA, Nicola (380 Italy), Jan SEDMIDUBSKÝ (203 Czech Republic, guarantor, belonging to the institution), Falchi FABRIZIO (380 Italy) and Tomáš REBOK (203 Czech Republic, belonging to the institution).
Edition	New York, NY, USA, 46th International Conference on Research and Development in Information Retrieval (SIGIR), p. 2420-2425, 6 pp. 2023.
Publisher	Association for Computing Machinery

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10200 1.2 Computer and information sciences
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
WWW	URL
RIV identification code	RIV/00216224:14330/23:00130552
Organization unit	Faculty of Informatics
ISBN	978-1-4503-9408-6
Doi	http://dx.doi.org/10.1145/3539618.3592069
UT WoS	001118084002091
Keywords in English	human motion data;skeleton sequences;CLIP;BERT;deep language models;ViViT;motion retrieval;cross-modal retrieval
Tags	International impact, Reviewed
Changed by	Changed by: doc. RNDr. Jan Sedmidubský, Ph.D., učo 60474. Changed: 14/3/2024 13:10.

Abstract

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Links
EF16_019/0000822, research and development project	Name: Centrum excelence pro kyberkriminalitu, kyberbezpečnost a ochranu kritických informačních infrastruktur

PrintDisplayed: 12/10/2024 19:46

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Other applications