Joint-Dataset Learning and Cross-Consistent Regularization for
Text-to-Motion Retrieval

J 2025

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

MESSINA, Nicola; Jan SEDMIDUBSKÝ; Fabrizio FALCHI a Tomáš REBOK

Základní údaje

Originální název

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Autoři

MESSINA, Nicola; Jan SEDMIDUBSKÝ; Fabrizio FALCHI a Tomáš REBOK

Vydání

ACM Transactions on Multimedia Computing, Communications, and Applications, New York, NY, USA, ACM, 2025, 1551-6857

Další údaje

Jazyk

angličtina

Typ výsledku

Článek v odborném periodiku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Spojené státy

Utajení

není předmětem státního či obchodního tajemství

Odkazy

URL

Impakt faktor

Impact factor: 6.000 v roce 2024

Označené pro přenos do RIV

Ano

Kód RIV

RIV/00216224:14330/25:00144684

Organizační jednotka

Fakulta informatiky

Klíčová slova anglicky

3D human motion; cross-modal retrieval; multi-modal understanding; text-motion retrieval

Štítky

DISA, rivok

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 1. 4. 2026 11:00, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: https://github.com/mesnico/MOTpp.

Návaznosti

VK01010147, projekt VaV

Název: Automatizovaná forenzní laboratoř digitálních dat pro odhalování komplexní trestné činnosti

Investor: Ministerstvo vnitra ČR, Automatizovaná forenzní laboratoř digitálních dat pro odhalování komplexní trestné činnosti

Citovat

MESSINA, Nicola; Jan SEDMIDUBSKÝ; Fabrizio FALCHI a Tomáš REBOK. Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications. New York, NY, USA: ACM, 2025, roč. 21, č. 10, s. 1-24. ISSN 1551-6857. Dostupné z: https://doi.org/10.1145/3744565.

@article{2501762,
   author = {Messina, Nicola and Sedmidubský, Jan and Falchi, Fabrizio and Rebok, Tomáš},
   article_location = {New York, NY, USA},
   article_number = {10},
   doi = {https://doi.org/10.1145/3744565},
   keywords = {3D human motion; cross-modal retrieval; multi-modal understanding; text-motion retrieval},
   language = {eng},
   issn = {1551-6857},
   journal = {ACM Transactions on Multimedia Computing, Communications, and Applications},
   title = {Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval},
   url = {https://doi.org/10.1145/3744565},
   volume = {21},
   year = {2025}
}

TY  - JOUR
ID  - 2501762
AU  - Messina, Nicola - Sedmidubský, Jan - Falchi, Fabrizio - Rebok, Tomáš
PY  - 2025
TI  - Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval
JF  - ACM Transactions on Multimedia Computing, Communications, and Applications
VL  - 21
IS  - 10
SP  - 1-24
EP  - 1-24
PB  - ACM
SN  - 15516857
KW  - 3D human motion
KW  - cross-modal retrieval
KW  - multi-modal understanding
KW  - text-motion retrieval
UR  - https://doi.org/10.1145/3744565
N2  - Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: https://github.com/mesnico/MOTpp.
ER  -

MESSINA, Nicola; Jan SEDMIDUBSKÝ; Fabrizio FALCHI a Tomáš REBOK. Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval. \textit{ACM Transactions on Multimedia Computing, Communications, and Applications}. New York, NY, USA: ACM, 2025, roč.~21, č.~10, s.~1-24. ISSN~1551-6857. Dostupné z: https://doi.org/10.1145/3744565.

Přehled o publikaci