SPIŠAKOVÁ, Viktória, Lukáš HEJTMÁNEK and Jakub HYNŠT. Nextflow in Bioinformatics: Executors Performance Comparison Using Genomics Data. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE. NETHERLANDS: ELSEVIER, 2023, vol. 142, May 2023, p. 328-339. ISSN 0167-739X. Available from: https://dx.doi.org/10.1016/j.future.2023.01.009.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Nextflow in Bioinformatics: Executors Performance Comparison Using Genomics Data
Authors SPIŠAKOVÁ, Viktória (703 Slovakia, belonging to the institution), Lukáš HEJTMÁNEK (203 Czech Republic, belonging to the institution) and Jakub HYNŠT (203 Czech Republic, belonging to the institution).
Edition FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, NETHERLANDS, ELSEVIER, 2023, 0167-739X.
Other information
Original language English
Type of outcome Article in a journal
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Netherlands
Confidentiality degree is not subject to a state or trade secret
WWW URL
Impact factor Impact factor: 7.500 in 2022
RIV identification code RIV/00216224:14610/23:00130181
Organization unit Institute of Computer Science
Doi http://dx.doi.org/10.1016/j.future.2023.01.009
UT WoS 000926828200001
Keywords in English Kubernetes;HPC;Cloud;Performance comparison;Genomics;Nextflow;Big data
Tags rivok
Tags International impact, Reviewed
Changed by Changed by: Mgr. Alena Mokrá, učo 362754. Changed: 20/3/2024 15:39.
Abstract
Processing big data is a computationally demanding task which has usually been fulfilled by HPC batch systems. These complex systems pose a challenge to scientists due to their cumbersome nature and changing environment. The scientists often lack deeper informatics understanding and experiment reproducibility is increasingly becoming a hard request on the research validity. A new computational paradigm — containers — are meant to contain all dependencies and persist the state which help reproducibility. They have gained a lot of popularity in the informatics community but HPC community remains skeptical and doubts that container platforms are appropriate for demanding tasks or that such infrastructure can reach significant performance. In this paper, we observe the performance of various infrastructure types (HPC, Kubernetes, local) on a Sarek Nextflow bioinformatics workflow with real life genomics data of various sizes. We analyze obtained workload trace and discuss pros and cons of utilized infrastructures. We also show some approaches perform better in terms of available resources but others are more suitable for diversified workflows. Based on the results, we provide recommendations for life science groups which plan to analyze data in large scale.
Links
EF16_026/0008448, research and development projectName: Analýza českých genomů pro teranostiku
LM2018140, research and development projectName: e-Infrastruktura CZ (Acronym: e-INFRA CZ)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 24/7/2024 11:26