The Future of HPC - Considering Some Myths Erwin Laure 29.4.2024 oooooooooooo oooooooooooo OOOOOOOOOOOO oooooooooooo oooooooooooo oooooooooooo oooooooooooo oooooooooooo EXASCALE COMPUTING IS HERE (US, CN?) FRONTIER @ OLCF (US): HPE/CRAY • AMD EPYC CPUs, AMD MI250 GPUs • 8.7 M CPU cores & GPU compute units • 52GFIop/s/W • 1102 PFIop/s HPL, rank 1 in Top500 11/2022 2 • 1 more pre-exascale @ BSC, 2023 * "Jupiter"@Julich will be the first European Exascale system (500 M€), 2024 AND SOON IN EUROPE I MareNostrum 5 Barcelona SPAIN LEONARDO MAX PLANCK COMPUTING AND DATA FACILITY, MARKUS RAMPP 2023-01-11 Projected Performance Development 10 EFIopte 1 EFIopys 100 PFIop^s 10 PFIop/s 1 PFIopys 100TFIopi's 10TFIop/s 1 TFIop^s 100 GFIopte 10 GFIop/s 1 GFIop^s 100MFIop/s 1990 /' **** a/ A >^ r i i i i i i i i i i i 1995 2000 2005 2010 2015 2020 2025 • Sum —#1 ■ #500 Globally, the rate of performance increase is diminishing Some jumps in Topi still expected, but what then? WHAT'S NEXT - THE ZETTAFLOPS SUPERCOMPUTER? from: https://www.nextbiqfuture.com/2023 Path to Zettascale Process 5x Tech Foundation Today Architecture 16x Power & Thermals 2x Data Movement 3x intel WHAT'S NEXT - AI WILL TAKE OVER? Workshops March 17-21 I Al Conference and Expo March 18-21 | Keynote March 18 I San Jose, CA and Virtual NVIDIA GTC 2024 Keynote Don't Miss This Transformative Moment in Al Watch NVIDIA CEO Jensen Huang's GTC keynote to catch all the announcements on Al advances that are shaping our future. GTC March 2024 Keym <^nVIDIA GTC MAX PL WHAT'S NEXT - QUANTUM? from: https://wwwjtwmJraunhofer.de/en/departments/hpc/quantum-com MAX PLANCK COMPUTING AND DATA FACILITY, MARKUS RAMPP 2023-01 -11 12 "MYTHS" IN HPC S. Matsuoka, J. Domke, M. Wahib, A. Drozd, and T. Hofler https://doi.org/10.48550/arXiv.2301.02432 CM © CM a S3 U q > CM m © Myths and Legends in High-Performance Computing Satoshi Matsuoka1, Jens Domke1, Mohamed Wahib1, Aleksandr Drozd1, Torsten Hoeller2 Abstract In this humorous and thought provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We collected those myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within (and beyond) our community. We believe they represent the Zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore's law. While some laws end. new directions open up. such as algorithmic scaling or novel architecture research. However, these myths are rarely based on scientific facts but often on some evidence or argumentation. In tact, we believe that this is the very reason for the existence ot many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates such as the question whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment. Keywords Quantum; zettascale; deep learning; clouds; HPC myths Introduction Any human society has their myths and legends—this also applies to The high-performance computing (HPC) community. HPC drives the largest and most powerful computers and latest computing and acceleration technologies forward. One may think that it's scientific reasoning all the way down in such an advanced field. Yet, we find many persistent myths revolving around trends of the moment Myth 1: Quantum Computing Will Take Over HPC! Numerous articles arc hyping Lire quantum computing revolution affecting nearly all aspects of life ranging from quantum artificial intelligence to even quantum gaming The whole IT industry is following the quantum trend and conceives quickly growing expectations. The actual development of quantum technologies, algorithms, and use-cases is on a very different time-scale. Most practitioners would not exnect niinnfiim enmnuters to nntnerform classical THE 12 'MYTHS' IN HPC Myth 1: Quantum Computing Will Take Over HPC! Myth 2: Everything Will Be Deep Learning! Myth 3: Extreme Specialization as Seen in Smartphones Will Push Supercomputers Beyond Moore's Law! Myth 4: Everything Will Run on Some Accelerator! Myth 5: Reconfigurable Hardware Will Give You 100X Speedup! Myth 6: We Will Soon Run at Zettascale! Myth 7: Next-Generation Systems Need More Memory per Core! Myth 8: Everything Will Be Disaggregated! Myth 9: Applications Continue to Improve, Even on Stagnating Hardware! Myth 10: Fortran Is Dead, Long Live the DSL! Myth 11: HPC Will Pivot to Low or Mixed Precision! Myth 12: All HPC Will Be Subsumed by the Clouds! MYTH 9: APPLICATIONS CONTINUE TO IMPROVE EVEN ON STAGNATING HARDWARE MAX PLANCK COMPUTING AND DATA FACILITY, MARKUS RAMPP 2023-01-11 STAGNATING HARDWARE? AN INCONVENIENT TRUTH 13 ALGORITHMIC MOORE'S LAW 8 arXiv preprints • Should we dramatically increase investments in software? • Will the "Algorithmic Moore's Law" end soon as well? • Are we willing to refactor/rewrite legacy codebases? 1980 1990 2000 2010 2020 Figure 3. Examples of "Algorithmic Moore's Law" for different areas in HPC; Fusion energy and combustion simulations data by Keyes (2022) and climate simulation data by Schulthess (2016) https://doi.org/10.48550/arXiv.2301.02432 14 MYTH 4: EVERYTHING WILL RUN ON SOME ACCELERATOR! LARGE UNEXPLORED TERRITORY - WILL IT BE TAKEN UP? arXiv preprints Compute Bound (C) Memory-Bandwidth Bound (B) (akaTop500) (aka HPCG) Memory-Latency Bound (L) (aka Graph500) v-y-' Classic Vector (e.g. Earth Simulator) -90s J Y COTS-CPU based clusters late 90s - late 2000s (ASCI XXX, Tsubame1/T2K, Jaguar, K) Standard Memory Technologies (DDR DRAM), Massively Parallel i V~ GPU Y CPU ) V Y GPU-Based "Heterogeneous" Machines: high (compute & BW & latency) for GPU Tsubame2/3, ABCI, Summit, Piz-Daint, Fronter, Aurora,... J 1 Y Fujitsu A64FX (Fugaku), Intel Sapphire Rapids: incorporating high bandwidth vectors & good SW ecosystem J V GPU/Matrix CPU J Y Unexplored but good? (programmability, performance, industry adoption, ...) V ' v. J j V -------------r> Strong Scaling CG R A/Mat rix CPU/PIM Strong Scaling CGRA J CPU/PIM for BW bound, Strong Scaling CGRA for compute- & latency-bound Figure 1. Classification of Compute Kernels and Supercomputing Architecture https://doi.Org/10.48550/arXiv.2301.02432 16 MYTH 2: EVERYTHING WILL BE DEEP LEARNING Original slide courtesy Rick Stevens, ANL, 2023 Al for Science w/HPC Foundations and Applications Parameter Search Surrogate Models Optimization Augmented Simulations Surgery Controlling complex systems and simulations of digital twins Manufacturing Reactors\^ Simulation Mobility Foundational Models for Science Distillation Integration Protocols ynthesis Q+A Materials Design & Optimization Industrial Structures Devices' Proteins o|o| : ..... Itl ,000 O Debugging Security Optimization Developing Scientific Codes Coding Translation Experiments& Simulations Planning AG I Scientist w/Foundational Models Discovery & Instruments i Evaluation HPC Control R Al Training is Now the Forefront of High End HPC (And thus Free Ride on HPC is Over) Slide from FugakuGPT/Rio Yokota Olli *ccs 1Q8 100 trillion parameters Sunway TaihuLight 91ÍPF oT 106 1 trillion parameters o 2 1Q5 100 billion parameters o5 E 2 10 0- 4 10 1f)3 1 billion parameters Transformer 213 SP100S/3.5 days cLMo 94 102 <----- • Top500ttl l2xperfjncrS2^2 ORNL Frontier 1100 PF GLaM 1200000 1024TPUV4 GPT4: requiring Top500 Top-5 capabilities Megatron-Turlng NLG . 530000 GPT-3 175000 Turing-NLG 11000 17000 1024TPUv3s -^t0V100s/1 week Megatron i8300 512 V100s/2 days/epoch CTRL _0- 1630 dcki RoBERTa 256 TPUv3s/2 weeks 340 355 64 TPUv3s/4 days 1024 V100s/5 days GPT 0 A ft ALBERT PanGu-a 200000 2048 Ascend 910s T Jurassic-1 PaLM 540000 6144TPUv4s Gopher 280000 4096 TPUv3s Chinchilla 178000 1 ^~ 70000 LaMDA 137000 1024TPUv3s/57.7 days 110 8 P100S/1 month Transformer-XL 257 XLNet 340 512TPUv3s/2.5 days 233 1024TPUv3s Collected and shared by Nikoli Dryden (LLNL) ^0 (0 l.OOO.OOOxin 5 years! slide curtesy Satoshi Matsuoka ■•^ The rapid evolution of large language models (LLMs) leading up to GPT-4 can be attributed to scaling, which in turn has been supported by "free ride" or "low-hanging fruit" advancements in supercomputer technologies, such as weak scaling, low-precision arithmetic in GPUs, matrix multiplication engines, high-bandwidth memory (HBM), and high bandwidth interconnects, etc. Coincidentally, the GPT-3.5/4.0 revolution occurred when utilizing computational resources equivalent to those of top-tier supercomputers. The development of models eg GPT-5 will slow down as the era of "free ride" ending, causing progress to be proportional to the evolution of supercomputers. Moving forward, it is important to focus on research in large-scale supercomputer Al systems, along with how to incorporate domain-specific knowledge in the foundational models 19 MPCDF A FEW EXAMPLES OF Al IN "CLASSICAL" (E)SCIENCE From Collaborations of MPCDF with various Max Planck Institutes MAX PLANCK COMPUTING AND DATA FACILITY PREDICION OF 3D PROTEIN STRUCTURES ENABLING Al SYSTEMS IN COMPUTATIONAL BIOLOGY FOR A BROAD USER BASE AlphaFold2 [1] • Deep learning system to predict the 3d structure of proteins based on their linear sequence of amino acids • Adapted and optimized by MPCDF early on for use on supercomputers with GPU acceleration • High demand and extreme IO requirements, mitigated by using dedicated NVMe-based storage systems • Very large and broad user base, encompassing theoretical, interdisciplinary, and experimental groups SKINFYTTTIETLETEDQNNTLTTFKVQNVSNASTIFSNGK TYWNFARPSYISNRINTFKNNPGVLRQLLNTSYGQSSLWAK HLLGEEKNVTGDFVLAGNARESASENRLKSLELSIFNSLQE KDKGAEGNDNGSISIVDQLADKLNKVLRGGTKNGT SIYSTV TPGDKSTLHEIKIDHFIPETISS FSNGTMIFNDKIVNAFTD HFVS EVNRMKEAYQELETLPESKRWHYHTDARGNVMKDGK LAGNAFKSGHILSELS FDQITQDDNEMLKLYNEDGS PINPK GAVSNEQKILIKQTINKVLNQRIKENIRYFKDQGLVIDTVN KDGNKGFHFHGLDKSIMSEYTDDIQLTEFDISHWSDFTLN SI LAS IEYTKLFTGDPANYKNMVDFFKRVPATYTN >T1037 S0A2C3d4, 404 residues T1037 / 6vr4 90.7 GDT (RNA polymerase domain) T1049 / 6y4f 93.3 GDT (adhesin tip) •Experimental result •Computational prediction MAX PLANCK COMPUTING AND DATA FACILITY [1] Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589. [21 https://github.com/deepmind/alphafold_ RECOGNITION OF CRYSTAL STRUCTURES A COLLABORATION OF MPI FÜR EISENFORSCHUNG AND MPCDF Automatic analyses of atom probe tomography data • A convolutional neuronal network has been developed which can reconstruct 3D crystal structures from atom probe tomography data • The method dramatically speeds up the analysis of micrographs • The method has been extended to reliably detect chemical short-range order (CSRO) in crystalline structures AI-AI interatomic vector tomograms ■■'-,:..>>.-.;J I Convolution (C Pooling (P) Y. Li, T. Colnaghi. A. Marek et al. Npj Comput. Mater. 7, 8 (2021) A** MAX PLANCK COMPUTING AND DATA FACILITY SISSO++ A COLLABORATION OF THE FRITZ-HABER INSTITUTE, MPCDF, EU COE NOMAD Feature generation 1 SISSO, a deterministic symbolic regression method extracts mathematical expressions directly from data in 2 steps: . create a (huge) pool of analytical expressions through iterative combinations SISSO steps Feature screening: SlSfsure independence screening) + LO regularization Y = aO♦ al • xl + a2 • x2 + ... (For Kl, x2,... In generated features) 2. select optimal candidates for desired properties through (regression) analysis of these expressions and their linear combinations SISSO++, open source software (Purcell et al., JOSS, 7(71), 3960, 2022) cross-platform, GPU-acceleration using the Kokkos framework scientific application highlight: identification of > 50 strongly thermally insulating materials for thermoelectric elements (devices able to convei otherwise wasted heat into useful electrical voltage) nVIDIA programming models CUDA OpenACC OpenMP CUDA HIP DP C++ OpanACC OpenMP, Unified Performance Portable Layer Al-Guided Workflow MAX PLANCK COMPUTING AND DATA FACILITY Purcell et al. npj Comput Mater 9, 112 (2023) Y. Yao,S. Eibl.M. Rampp,L. Ghiringhelli, T. Purcell, M. Schettler (in preparation) GANS FOR CHEMICAL STRUCTURE GENERATION A COLLABORATION OF MPI FHI AND MPCDF Generate relevant chemical structures • Obtaining chemical structures for interesting configuations is hard, since the most stable (measured) ones are "boring" • Design and train a physics informed generative model which can create physically correct but very interesting structures • The generated structures will be then used for calculations of material properties Most stable configuration - but irrelevan for chemistry GAN trial moves Irrelevant Relevant GAN " configurations configurations * guesses configu rational space P. König et. al., Presentation at the SKM 2023 MAX PLANCK COMPUTING AND DATA FACILITY SYNTHSEG A COLLABORATION OF MPI CBS AND Synthetic image generation for segmentation networks • Instead of training on expensive (and hard to obtain) real MRI scans, a massive and diverse synthetic dataset is generated • The synthetic images are obtainded via a generative model that takes as input real exisiting label maps • The generative model is tuned to produce images that resemble the the real MRI scans • The final segmentation model (well-proven 3d Unet) is trained with this generated dataset ( Input Labels Deformation GMM Sampling Bias Field Downsampling Training inputs MAX PLANCK COMPUTING AND DATA FACILITY 3D MAPPING OF CLOUD COMPLEXES IN THE MILKY WAY A COLLABORATION OF MPI ASTRONOMY AND MPCDF Automatic density reconstruction from distance and optical-IR extinction measurements • A new algorithm (based on baysian statistics) to infer a 3d density distribuition from distance and extinction measurements has been optimized by MPCDF to be able to tackle better resolved inference grids • A catalog with 16 molecular cloud complexes of the Milky Way a 3d density distribution could be generated 69 MAX PLANCK COMPUTING AND DATA FACILITY T. E. Dharmawardena et al., The 3D structure of Galactic molecular cloud complexes out to 2.5 kpc, MNRAS (2022) SUMMARY • Al methods are being explored in many scientific domains • already in production in some • "black-box" approach seen critically sometimes • effort on validation, trust-worthyness, error-estimation etc. • Potential to speed up many tedious tasks • pruning search spaces, creating new study objects via generative models, steering simulations, etc. • But will they replace first-principle simulations? • and if so, should the physical model be changed? • Doubtless, we will see many more (and surprising) adoptons of Al methods in (e)Science MAX PLANCK COMPUTING AND DATA FACILITY MYTH 1: QUANTUM COMPUTING WILL TAKE OVER HPC P Scientific Analysis (not Hype) of Utility of Quantum Computing om For practical 'quantum supremacy', exponential speedup cf classical algorithm is necessary • Many algorithms only achieve quadratic speedup, thus will lose to classical in practice . E.g., Shor's algorithm - exponential => Good . E.g., Graver's algorithm - quadratic=>NG For 'pure' quantum algorithms, none exist that exhibit quadratic speedup & can be executed practically on current NISQ machines w/~100 qubits • Shor's algorithm may break RSA 2048 in the far future but will require 20~200mil NISQ qubits https://arxiv.org/pdf/1905.09749.pdf Hybrid algorithms e.g., variational algorithms (e.g. VQE) might be useful in much closer future Require platform to conduct scientific analysis of QC, as large qubits as possible, using real state-of-the art real machines and simulators! Torsten Hoefler, Thomas Haner, Matthias Troyer Communications of the ACM, May 2023, Vol. 66 No. 5, Pages 82-87 10.1145/3571725 Disentangling Hype from Practicality: On Realistically Achieving Quantum Advantage TORSTEN HOEFLER, Microsoft Corporation, USA and ETH Zurich, Switzerland THOMAS HANER and MATTHIAS TROYER, Microsoft Corporation, USA Quantum computers offer a new paradigm of computing with the potential to vastly outperform any imagineable classical computer. This has caused a gold rush towards new quantum algorithms and hardware. In light of the growing expectations and hype surrounding quantum computing we ask the question which are the promising applications to realize quantum advantage. We argue that small data problems and quantum algorithms with super-quadratic speedups are essential to make quantum computers useful in practice. With these guidelines one can separate promising applications for quantum computing from those where classical solutions should be pursued. While most of the proposed quantum algorithms and applications do not achieve the necessary speedups to be considered practical, we already see a huge potential in material science and chemistry. We expect further applications to be developed based on our guidelines. ACM Reference Format: Torsten Hoefler, Thomas Haner, and Matthias Troyer. 2022. Disentangling Hype from Practicality: On Realistically Achieving Quantum Advantage. 1, 1 (September 2022), 7 pages. https://doi.org/XXXXXXX.XXXXXXX Practical and impractical applications. We can now use the above considerations to discuss several classes of applications where our fundamental bounds draw a line for quantum practicality. The most likely problems to allow for a practical quantum advantage are those with exponential quantum speedup. This includes the simulation of quantum systems for problems in chemistry, materials science, and quantum physics, as well as cryptanalysis using Shor's algorithm [13]. The solution of linear systems of equations for highly structured problems [10] also has an exponential speedup, but the I/O limitations disci and undo this advantage if knowledge of the full solution is required (as oppi obtained by sampling the solution). Equally importantly, we identify dead ends in the maze of applications, quadratic quantum speedups, such as many current machine learning tra design and protein folding with Grover's algorithm, speeding up Monte < walks, as well as more traditional scientific computing simulations includ systems of equations, such as fluid dynamics in the turbulent regime, we at achieve quantum advantage with current quantum algorithms in the fores the identified I/O limits constrain the performance of quantum computing linear systems, and database search based on Grover's algorithm such that These considerations help with separating hype from practicality in the can guide algorithmic developments. Specifically, our analysis shows that to focus on super-quadratic speedups, ideally exponential speedups and 2] bottlenecks when deriving algorithms to exploit quantum computation be, quantum practicality are small-data problems with exponential speedup, ant problems in chemistry and materials science. slide curtesy Satoshi Matsuoka 3J. LIKELY/NEEDED QUANTUM DEVELOPMENTS • More research into algorithms • QC good for big compute on little data; bad on big data • QC likely as "accelerator" for certain problems in a classical workflow • Most common strategy adopted worldwide today, including EuroHPC • Commercial viability of QC? CONCLUSIONS • We see a lot of hypes and myths in HPC • some might become reality, some not • There is a lot more than (today's) hardware/FLOPS-focussed Exascale computing • scientific approaches need, not hypes • Realize that the current hardware market is driven by Al, not HPC • be pragmatic and adopt