2024
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
VOLLMAR, Melanie; Santosh TIRUNAGARI; Deborah HARRUS; David ARMSTRONG; Romana GÁBOROVÁ et. al.Základní údaje
Originální název
Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature
Autoři
VOLLMAR, Melanie; Santosh TIRUNAGARI; Deborah HARRUS; David ARMSTRONG; Romana GÁBOROVÁ (703 Slovensko, domácí); Deepti GUPTA; Marcelo Querino Lima AFONSO; Genevieve EVANS a Sameer VELANKAR
Vydání
Scientific Data, BERLIN, NATURE PORTFOLIO, 2024, 2052-4463
Další údaje
Jazyk
angličtina
Typ výsledku
Článek v odborném periodiku
Obor
10700 1.7 Other natural sciences
Stát vydavatele
Německo
Utajení
není předmětem státního či obchodního tajemství
Odkazy
Impakt faktor
Impact factor: 5.800 v roce 2023
Kód RIV
RIV/00216224:14740/24:00138864
Organizační jednotka
Středoevropský technologický institut
UT WoS
001325129100022
EID Scopus
2-s2.0-85205275590
Klíčová slova anglicky
MECHANISM; COMPLEX; ONTOLOGY; DOMAIN
Štítky
Změněno: 25. 3. 2025 12:14, Mgr. Eva Dubská
Anotace
V originále
We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.
Návaznosti
90255, velká výzkumná infrastruktura |
|