Genomic benchmarks: a collection of datasets for genomic
sequence classification

J 2023

Genomic benchmarks: a collection of datasets for genomic sequence classification

GREŠOVÁ, Katarína, Vlastimil MARTINEK, David ČECHÁK, Petr ŠIMEČEK, Panagiotis ALEXIOU et. al.

Základní údaje

Originální název

Genomic benchmarks: a collection of datasets for genomic sequence classification

Autoři

GREŠOVÁ, Katarína (703 Slovensko, domácí), Vlastimil MARTINEK (203 Česká republika, domácí), David ČECHÁK (203 Česká republika, domácí), Petr ŠIMEČEK (203 Česká republika, garant, domácí) a Panagiotis ALEXIOU (300 Řecko, domácí)

Vydání

BMC Genomic Data, 2730-6844, BMC, 2023, 2730-6844

Další údaje

Jazyk

angličtina

Typ výsledku

Článek v odborném periodiku

Obor

10610 Biophysics

Stát vydavatele

Velká Británie a Severní Irsko

Utajení

není předmětem státního či obchodního tajemství

Odkazy

URL

Impakt faktor

Impact factor: 1.900 v roce 2022

Kód RIV

RIV/00216224:14740/23:00131330

Organizační jednotka

Středoevropský technologický institut

DOI

http://dx.doi.org/10.1186/s12863-023-01123-8

UT WoS

000981254200001

Klíčová slova anglicky

Genomics; Dataset; Benchmark; Deep learning; Convolutional neural network

Štítky

CF BIOIT, rivok

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 8. 4. 2024 10:34, Mgr. Eva Dubská

Anotace

V originále

Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

Návaznosti

GF23-04260L, projekt VaV

Název: Biologický kód uzlů – identifikace uzlových vzorů v biomolekulách pomocí AI metod

Investor: Grantová agentura ČR, Biological code of knots – identification of knotted patterns in biomolecules via AI approach, Partnerská agentura (Polsko)

LM2018140, projekt VaV

Název: e-Infrastruktura CZ (Akronym: e-INFRA CZ)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, e-Infrastruktura CZ

4431, interní kód MU

Název: Deep Learning for Genomic and Transcriptomic Pattern Identification

Investor: EMBO (European Molecular Biology Organization), Deep Learning for Genomic and Transcriptomic Pattern Identification

867414, interní kód MU

Název: Using Deep Learning to understand RNA Binding Protein binding characteristics (Akronym: DEEPLEARNRBP)

Investor: Evropská unie, Using Deep Learning to understand RNA Binding Protein binding characteristics, MSCA Marie Skłodowska-Curie Actions (Excellent Science)

896172, interní kód MU

Název: Deciphering the Language of DNA to Identify Regulatory Elements and Classify Transcripts Into Functional Classes (Akronym: LanguageOfDNA)

Investor: Evropská unie, Deciphering the Language of DNA to Identify Regulatory Elements and Classify Transcripts Into Functional Classes, MSCA Marie Skłodowska-Curie Actions (Excellent Science)

90267, velká výzkumná infrastruktura

Název: NCMG III

Podrobný výpis o publikaci