D 2016

Network Flows for Data Distribution and Computation

MAKATUN, Dzmitry; Jerome LAURET; Hana RUDOVÁ a Michal ŠUMBERA

Základní údaje

Originální název

Network Flows for Data Distribution and Computation

Autoři

MAKATUN, Dzmitry; Jerome LAURET; Hana RUDOVÁ a Michal ŠUMBERA

Vydání

USA, 2016 IEEE Symposium Series on Computational Intelligence (SSCI), od s. 1-8, 8 s. 2016

Nakladatel

IEEE

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Spojené státy

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Označené pro přenos do RIV

Ano

Kód RIV

RIV/00216224:14330/16:00094080

Organizační jednotka

Fakulta informatiky

ISBN

978-1-5090-4240-1

EID Scopus

Klíčová slova anglicky

Production; Processor scheduling; Distributed databases; Optimization; Computational modeling; Planning; Bandwidth

Příznaky

Mezinárodní význam, Recenzováno
Změněno: 4. 9. 2018 11:24, doc. Mgr. Hana Rudová, Ph.D.

Anotace

V originále

An important class of modern big data applications is distributed data production in High Energy and Nuclear Physics (HENP). Such data intensive computations heavily rely on geographically distributed resources featuring hundreds of thousands CPUs and petabytes of storage. Unfortunately, classical job scheduling approaches either do not address all the aspects of the case or do not scale appropriately. Previously we have developed a new job scheduling approach dedicated to distributed data production, where the load balancing across sites is provided by forwarding data in peer-to-peer manner, but guided by a centrally created and periodically updated plan, aiming to achieve global optimality. Because the many HENP experiments utilize distributed storage, in this work we provide an important generalization of our approach to consider multiple sources of input data. The underlying network flow model is also extended to enable optimization on various additional criteria on top of the flow maximization making it versatile for a wide scope of potential use cases. In this study such additional optimization was used for more efficient reasoning with multiple data sources: balancing their usage and planning of the initial data distribution. Those two considerations allow to reduce an influence of network bottlenecks at early and late stages of data production. The simulations carried out in this work allow to test our approach towards a more general case of networks and servers not limited to specifics of HENP infrastructure. In all of the simulations our planner has shown a significant improvement in both average throughput and makespan against the typically used pull scheduling approach.