2025
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval
MESSINA, Nicola; Jan SEDMIDUBSKÝ; Fabrizio FALCHI a Tomáš REBOKZákladní údaje
Originální název
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval
Autoři
MESSINA, Nicola; Jan SEDMIDUBSKÝ; Fabrizio FALCHI a Tomáš REBOK
Vydání
ACM Transactions on Multimedia Computing, Communications, and Applications, New York, NY, USA, ACM, 2025, 1551-6857
Další údaje
Jazyk
angličtina
Typ výsledku
Článek v odborném periodiku
Obor
10200 1.2 Computer and information sciences
Stát vydavatele
Spojené státy
Utajení
není předmětem státního či obchodního tajemství
Odkazy
Impakt faktor
Impact factor: 6.000 v roce 2024
Označené pro přenos do RIV
Ano
Kód RIV
RIV/00216224:14330/25:00144684
Organizační jednotka
Fakulta informatiky
UT WoS
EID Scopus
Klíčová slova anglicky
3D human motion; cross-modal retrieval; multi-modal understanding; text-motion retrieval
Příznaky
Mezinárodní význam, Recenzováno
Změněno: 1. 4. 2026 11:00, RNDr. Pavel Šmerk, Ph.D.
Anotace
V originále
Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: https://github.com/mesnico/MOTpp.
Návaznosti
| VK01010147, projekt VaV |
|