Genomic benchmarks: a collection of datasets for genomic
sequence classification

GREŠOVÁ, Katarína, Vlastimil MARTINEK, David ČECHÁK, Petr ŠIMEČEK and Panagiotis ALEXIOU. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data. 2730-6844: BMC, 2023, vol. 24, No 1, p. 1-9. ISSN 2730-6844. Available from: https://dx.doi.org/10.1186/s12863-023-01123-8.

Other formats: BibTeX LaTeX RIS

TY  - JOUR
ID  - 2300026
AU  - Grešová, Katarína - Martinek, Vlastimil - Čechák, David - Šimeček, Petr - Alexiou, Panagiotis
PY  - 2023
TI  - Genomic benchmarks: a collection of datasets for genomic sequence classification
JF  - BMC Genomic Data
VL  - 24
IS  - 1
SP  - 1-9
EP  - 1-9
PB  - BMC
SN  - 27306844
KW  - Genomics
KW  - Dataset
KW  - Benchmark
KW  - Deep learning
KW  - Convolutional neural network
UR  - https://link.springer.com/article/10.1186/s12863-023-01123-8
N2  - Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
ER  -

Basic information
Original name	Genomic benchmarks: a collection of datasets for genomic sequence classification
Authors	GREŠOVÁ, Katarína (703 Slovakia, belonging to the institution), Vlastimil MARTINEK (203 Czech Republic, belonging to the institution), David ČECHÁK (203 Czech Republic, belonging to the institution), Petr ŠIMEČEK (203 Czech Republic, guarantor, belonging to the institution) and Panagiotis ALEXIOU (300 Greece, belonging to the institution).
Edition	BMC Genomic Data, 2730-6844, BMC, 2023, 2730-6844.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10610 Biophysics
Country of publisher	United Kingdom of Great Britain and Northern Ireland
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
Impact factor	Impact factor: 1.900 in 2022
RIV identification code	RIV/00216224:14740/23:00131330
Organization unit	Central European Institute of Technology
Doi	http://dx.doi.org/10.1186/s12863-023-01123-8
UT WoS	000981254200001
Keywords in English	Genomics; Dataset; Benchmark; Deep learning; Convolutional neural network
Tags	CF BIOIT, rivok
Tags	International impact, Reviewed
Changed by	Changed by: Mgr. Eva Dubská, učo 77638. Changed: 8/4/2024 10:34.

Abstract

Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks. Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

Links
GF23-04260L, research and development project	Name: Biologický kód uzlů – identifikace uzlových vzorů v biomolekulách pomocí AI metod
GF23-04260L, research and development project	Investor: Czech Science Foundation, Partner Agency
LM2018140, research and development project	Name: e-Infrastruktura CZ (Acronym: e-INFRA CZ)
LM2018140, research and development project	Investor: Ministry of Education, Youth and Sports of the CR
4431, interní kód MU	Name: Deep Learning for Genomic and Transcriptomic Pattern Identification
4431, interní kód MU	Investor: EMBO (European Molecular Biology Organization)
867414, interní kód MU	Name: Using Deep Learning to understand RNA Binding Protein binding characteristics (Acronym: DEEPLEARNRBP)
867414, interní kód MU	Investor: European Union, MSCA Marie Skłodowska-Curie Actions (Excellent Science)
896172, interní kód MU	Name: Deciphering the Language of DNA to Identify Regulatory Elements and Classify Transcripts Into Functional Classes (Acronym: LanguageOfDNA)
896172, interní kód MU	Investor: European Union, MSCA Marie Skłodowska-Curie Actions (Excellent Science)
90267, large research infrastructures	Name: NCMG III

PrintDisplayed: 8/10/2024 09:58

Genomic benchmarks: a collection of datasets for genomic sequence classification

Other applications