A historical perspective Why parallel computing? Principles of parallel computing Bi7740: Scientific computing Introduction to parallel computing Vlad Popovici popovici@iba.muni.cz Institute of Biostatistics and Analyses Masaryk University, Brno Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Supplemental bibliography Kepner: Parallel Matlab. SIAM 2009 Mathworks: Parallel Computing Toolbox. User’s Guide McCallum: Parallel R. O’Reilly 2012 Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing "I think there is a world market for maybe five computers." (Thomas Watson, chairman of IBM, 1943) "There is no reason for any individual to have a computer in their home." (Ken Olson, founder Digital Equipment Corporation, 1977) "640K of memory ought to be enough for anybody." Bill Gates, chairman of Microsoft, 1981 Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing ∼ 2500 BC: Babylon - the first abacus ∼ 100 BC: Antikythera device believed to be the first mechanical computer first half of the 19th century: Charles Babbage’s differential machine (to tabulate polynomials) and analytical machine (only design) 1941: Z3 computer by Konrad Zuse: first programmable, fully automatic computing machine source: Wikipedia Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing ∼ 1840 Charles Babbage produces the differential machine, a mechanical computer. source: Wikipedia Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing 1941: Z3 computer: electro-mechanical computer, ∼ 2000 relays, 22-bit words, operating at 5-10 Hz. source: Wikipedia Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing 1946: ENIAC - Electronic Numerical Integrator And Computer used initially by US Army to compute tables for artilery. Uses vacuum tubes as switches. Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing 1976: Cray-1 - the first successful supercomputer Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing ...fast forward: Tianhe-2 (top supercomputer as Nov. 2013): 33.86 PFlop/s, 3,120,000 cores; 1,024,000 GB, CPU: Intel Xeon Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Moore’s law Gordon E. Moore (co-founder Intel): "Cramming More Components onto Integrated Circuits", Electronics Magazine, 1965 source: Wikipedia Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Software and hardware Software crises: ’60s-’70s: assembly language difficult to use for large complex problems → Fortran, C: provide abstraction and portability for uniprocessors ’80s-’90s: problems in maintaining complex systems → object-oriented programming (C++, Java) ∼ 2000s: sequential performance lags behind Moore’s law → programmers are oblivious to hardware better compilers, higher level languages, virtual machines Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing parallel computing: using multiple execution units concurrently to solve a problem examples: multi-core processors: several processors (cores) in a chip shared memory processors (SMP): several processors interconnected through a shared memory cluster computer: several computers interconnected through high-speed network Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Issues with the traditional model: power density (Ross: Why CPU Frequency Stalled, IEEE Spectrum Magazine, 2008) Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Issues cont’d: gains from implicit parallelism tapped out Example: instruction-level parallelism. Machine instruction: decomposed into 4-stages: fetch, decode, execute and write-back Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Issues cont’d Other issues: increase in production costs (decrease in "chip yield") increase in amount of data to be processed Solution: explicit parallelism multi-core multi-processor multi-machine Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Principles identifying parallelism granularity: more smaller or fewer larger tasks? locality: data and instruction location load balance: aim: no lost CPU cycles synchronization overhead Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Identifying parallelism Amdahl’s law: Sn = T1 Tn ≤ 1 α + (1 − α)/n ≤ 1 α where α is the fraction of the program that is strictly sequential, Ti is the execution time on i processors and Si is the speed-up obtained by using i processors instead of 1. Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Identifying parallelism implicit parallelism hardware level: superscalar processors, multi-core, cluster computing compiler level: parallelizing compilers explicit parallelism programming language level library level Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Processing architectures Flynn’s taxonomy ("old way"): Single/Multiple Instruction × Single/Multiple Data Source: Wikipedia Examples: SISD: mainframes; SIMD: GPUs; MISD: fault tolerant systems; MIMD: most computers nowadays Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Locality: a box in a box in a box... Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Computing topologies Source: Grama - Introduction to Parallel Computing Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Shared memory: multicore or multi-CPU machines Distributed memory: clusters with single CPUs nodes Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Hybrid systems a limited number of CPUs have access to a pooled memory using more CPUs implies communication over network through message-passing Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Hybrid systems with multicore CPUs extension of the hybrid model communication becomes increasingly complex many levels in the memory hierachy: cache(s), local main memory, other node’s memory, etc you can add accelerators: e.g, GPUs requires a new programming model, and different communication protocols Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Load balancing aim: distribute evenly the load (work) on all available resources... ...and thus minimize the time a resource is idle causes of imbalanced load: insufficient paralelism unequal task size (poor design?) Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Types of parallelism data parallelism: each processor performs the same task on different data (h/w: SIMD, MIMD) task parallelism: each processor performs a different task on the same data (h/w: MISD, MIMD) usually, both types of parallelism are present Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Example: re-annotation of a microarray chip (embarrassingly parallel problem) Problem: map (BLAST) each probe from a microarray against the latest version of the human genome (RefSeq). Naive implementation on 2 CPUs: program : . . . i f CPU == ’CPU1’ then idx = 1 , . . ,Np/ 2 e l s e i f CPU == ’CPU2’ then idx = Np/ 2 + 1 , . . . ,N endif BLAST( Probes [ idx ] ) . . . program : . . . idx = 1 , . . ,Np/ 2 BLAST( Probes [ idx ] ) . . . program : . . . idx = Np/ 2 + 1 , . . . ,N BLAST( Probes [ idx ] ) . . . Better ways of distributing the data exists for this problem! Ex: distribute also the RefSeq... Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Problem decomposition split the computations into concurrent tasks build the task-dependency graph there is no one-size-fits-all technique some methods: recursive decomposition, data-decomposition, exploratory decomposition and speculative decomposition Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Recursive decomposition: example Problem: find the minimum of a vector proc s e r i a l _min (A, n ) min = A[ 1 ] f o r i = 2 to n do i f A[ i ] < min then min = A[ i ] end f o r return min end s e r i a l _min proc rec_min (A, i , j ) i f i == j then min = A[ i ] else lmin = rec_min (A, i , j / 2) rmin = rec_min (A, j / 2+1, j ) i f lmin < rmin then min = lmin else min = rmin end i f end i f return min end rec_min Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Data decomposition: example Matrix multiplication: A · B = C. Write it as A11 A12 A21 A22 · B11 B12 B21 B22 = C11 C12 C21 C22 and distribute the four tasks: Task 1: C11 = A11B11 + A12B21 Task 2: C12 = A11B12 + A12B22 Task 3: C21 = A21B11 + A22B21 Task 4: C22 = A21B12 + A22B22 Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Other decompositions exploratory decomposition: decompose the search space for the solution and search for a solution in each subspace; then choose among the solutions speculative decomposition: launch alternative computation branches in parallel while waiting for input for deciding which branch to use hybrid decompositions Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Mapping techniques problem decomposition → tasks the tasks need to be allocated (mapped) to processors/processes objective: minimize the execution time overheads: time spent for everything else but actually solving the problem: inter-process interaction - synchronization and control time spent being idle - poor load balancing reduce the process inter-dependencies and communication: e.g. maximize data locality improve load balancing reduce blocking operations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Outline 1 A historical perspective 2 Why parallel computing? 3 Principles of parallel computing Introduction Programming models Implementations Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Implementations on multihtread/multicore machines POSIX threads (pthreads): OS-level paralelism. threads: lightweight processes the same program runs on single- or multi-core machines OS has the responsibility of mapping the threads needs low-level programming, dedicated library OpenMP: built on top of pthreads for SIMD-kind of parallelism implemented through compiler directives easier to use than pthreads performance depends on compiler’s ’intelligence’ Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations OpenMP: how does it look like? ( i aibi) double a [N ] ; double sum = 0.0; int i , n , t i d ; #pragma omp p a r a l l e l shared ( a ) private ( i ) { t i d = omp_get_thread_num ( ) ; / ∗ Only one of the threads do t h i s ∗ / #pragma omp single { n = omp_get_num_threads ( ) ; p r i n t f ( "Number of threads = %d \ n" , n ) ; } / ∗ I n i t i a l i z e a ∗ / #pragma omp for for ( i =0; i < N; i ++) { a [ i ] = 1.0; } / ∗ P a r a l l e l for loop computing the sum of a [ i ] ∗ / #pragma omp for reduction ( + :sum) for ( i =0; i < N; i ++) { sum = sum + ( a [ i ] ) ; } } / ∗ End of p a r a l l e l region ∗ / Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Implementations on distributed-memory systems MPI: Message Passing Interface de facto standard for distributed memory programming (clusters) data must be manually decomposed use special libraries based on sending and receiving messages: data and synchronization PVM: Parallel Virtual Machine previous library for cluster programming based on message-passing principle supplanted by MPI Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations MPI: how does it look like? #include int main ( int argc , char ∗argv [ ] ) { int numprocs , myid ; MPI_ I n i t (&argc ,&argv ) ; MPI_Comm_size (MPI_COMM_WORLD,&numprocs ) ; MPI_Comm_rank (MPI_COMM_WORLD,&myid ) ; / ∗ p r i n t out my rank and t h i s run ’ s PE size ∗ / p r i n t f ( " Hello from %d of %d \ n " , myid , numprocs ) ; MPI_ F in al iz e ( ) ; } Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Implementations in R parallelism came as an after thought target: massive data applications tries to bring to R some of the libraries existing to other languages snow: for traditional clusters, supports PVM, MPI,...; is portable (UNIX, Windows) multicore: targets multi-core/-CPU machines; simple; does not run on Windows; does not handle parallel RNGs parallel: snow+multicore in new R (>=2.14); strange interactions with OS R+Hadoop: based on Hadoop cluster RHIPE: based on Hadoop, targets map-reduce operations Segue: apply-like calculations on Hadoop clusters, using Amazon’s Elastic MapReduce Vlad Bi7740: Scientific computing A historical perspective Why parallel computing? Principles of parallel computing Introduction Programming models Implementations Implementations in Matlab Parallel Computing Toolbox: can use multicore, GPUs, clusters still evolving parallel for-loops, special array types, parallelized numerical routines tries to provide a uniform interface and isolation from underlying implementation runs several workers (computational engines) on a multicore machine for single program multiple data problems the same code can be run on a cluster or grid computing service (needs Distributed Computing Server!) Vlad Bi7740: Scientific computing