J 2024

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

VOLLMAR, Melanie; Santosh TIRUNAGARI; Deborah HARRUS; David ARMSTRONG; Romana GÁBOROVÁ et. al.

Základní údaje

Originální název

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Autoři

VOLLMAR, Melanie; Santosh TIRUNAGARI; Deborah HARRUS; David ARMSTRONG; Romana GÁBOROVÁ (703 Slovensko, domácí); Deepti GUPTA; Marcelo Querino Lima AFONSO; Genevieve EVANS a Sameer VELANKAR

Vydání

Scientific Data, BERLIN, NATURE PORTFOLIO, 2024, 2052-4463

Další údaje

Jazyk

angličtina

Typ výsledku

Článek v odborném periodiku

Obor

10700 1.7 Other natural sciences

Stát vydavatele

Německo

Utajení

není předmětem státního či obchodního tajemství

Odkazy

Impakt faktor

Impact factor: 5.800 v roce 2023

Kód RIV

RIV/00216224:14740/24:00138864

Organizační jednotka

Středoevropský technologický institut

UT WoS

001325129100022

EID Scopus

2-s2.0-85205275590

Klíčová slova anglicky

MECHANISM; COMPLEX; ONTOLOGY; DOMAIN

Štítky

Změněno: 25. 3. 2025 12:14, Mgr. Eva Dubská

Anotace

V originále

We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Návaznosti

90255, velká výzkumná infrastruktura
Název: ELIXIR CZ III