Network Flows for Data Distribution and Computation

MAKATUN, Dzmitry, Jerome LAURET, Hana RUDOVÁ and Michal ŠUMBERA. Network Flows for Data Distribution and Computation. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI). USA: IEEE, 2016, p. 1-8. ISBN 978-1-5090-4240-1. Available from: https://dx.doi.org/10.1109/SSCI.2016.7850083.

Other formats: BibTeX LaTeX RIS

TY  - JOUR
ID  - 1377655
AU  - Makatun, Dzmitry - Lauret, Jerome - Rudová, Hana - Šumbera, Michal
PY  - 2016
TI  - Network Flows for Data Distribution and Computation
PB  - IEEE
CY  - USA
SN  - 9781509042401
KW  - Production
KW  - Processor scheduling
KW  - Distributed databases
KW  - Optimization
KW  - Computational modeling
KW  - Planning
KW  - Bandwidth
N2  - An important class of modern big data applications is distributed data production in High Energy and Nuclear Physics (HENP). Such data intensive computations heavily rely on geographically distributed resources featuring hundreds of thousands CPUs and petabytes of storage. Unfortunately, classical job scheduling approaches either do not address all the aspects of the case or do not scale appropriately. Previously we have developed a new job scheduling approach dedicated to distributed data production, where the load balancing across sites is provided by forwarding data in peer-to-peer manner, but guided by a centrally created and periodically updated plan, aiming to achieve global optimality. Because the many HENP experiments utilize distributed storage, in this work we provide an important generalization of our approach to consider multiple sources of input data. The underlying network flow model is also extended to enable optimization on various additional criteria on top of the flow maximization making it versatile for a wide scope of potential use cases. In this study such additional optimization was used for more efficient reasoning with multiple data sources: balancing their usage and planning of the initial data distribution. Those two considerations allow to reduce an influence of network bottlenecks at early and late stages of data production. The simulations carried out in this work allow to test our approach towards a more general case of networks and servers not limited to specifics of HENP infrastructure. In all of the simulations our planner has shown a significant improvement in both average throughput and makespan against the typically used pull scheduling approach.
ER  -

Basic information
Original name	Network Flows for Data Distribution and Computation
Authors	MAKATUN, Dzmitry (112 Belarus), Jerome LAURET (840 United States of America), Hana RUDOVÁ (203 Czech Republic, guarantor, belonging to the institution) and Michal ŠUMBERA (203 Czech Republic).
Edition	USA, 2016 IEEE Symposium Series on Computational Intelligence (SSCI), p. 1-8, 8 pp. 2016.
Publisher	IEEE

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	United States of America
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
RIV identification code	RIV/00216224:14330/16:00094080
Organization unit	Faculty of Informatics
ISBN	978-1-5090-4240-1
Doi	http://dx.doi.org/10.1109/SSCI.2016.7850083
UT WoS	000400488301124
Keywords in English	Production; Processor scheduling; Distributed databases; Optimization; Computational modeling; Planning; Bandwidth
Tags	International impact, Reviewed
Changed by	Changed by: doc. Mgr. Hana Rudová, Ph.D., učo 3840. Changed: 4/9/2018 11:24.

Abstract

An important class of modern big data applications is distributed data production in High Energy and Nuclear Physics (HENP). Such data intensive computations heavily rely on geographically distributed resources featuring hundreds of thousands CPUs and petabytes of storage. Unfortunately, classical job scheduling approaches either do not address all the aspects of the case or do not scale appropriately. Previously we have developed a new job scheduling approach dedicated to distributed data production, where the load balancing across sites is provided by forwarding data in peer-to-peer manner, but guided by a centrally created and periodically updated plan, aiming to achieve global optimality. Because the many HENP experiments utilize distributed storage, in this work we provide an important generalization of our approach to consider multiple sources of input data. The underlying network flow model is also extended to enable optimization on various additional criteria on top of the flow maximization making it versatile for a wide scope of potential use cases. In this study such additional optimization was used for more efficient reasoning with multiple data sources: balancing their usage and planning of the initial data distribution. Those two considerations allow to reduce an influence of network bottlenecks at early and late stages of data production. The simulations carried out in this work allow to test our approach towards a more general case of networks and servers not limited to specifics of HENP infrastructure. In all of the simulations our planner has shown a significant improvement in both average throughput and makespan against the typically used pull scheduling approach.

PrintDisplayed: 27/5/2024 18:43

Network Flows for Data Distribution and Computation

Other applications