FACULTY OF INFORMATICS Masaryk University PA039: Supercomputer Architecture and Intensive Computing Compiling and Code Optimization Luděk Matýska Spring 2024 Luděk Matýska • Compilings • Spring 2024 1/33 FACULTY OF INFORMATICS Masaryk University Repetition - RISC processors ■ Limited number of instructions, same size ■ Simple address modes, Load/Store, sufficient number of registers ■ Delayed branches, branch prediction, out-of-order execution ■ Superscalar (e.g. 2xFPU, 2xALU, special address instructions) ■ Superpipeline ■ Caches Luděk Matýska • Compilings • Spring 2024 2/33 FACULTY OF INFORMATICS Masaryk University Optimizing Compiler ■ Translation to the Intermediate language ■ Optimization ■ intra-procedural analysis ■ cycle optimization ■ global optimization (inter-process optimization) ■ Code generation ■ use of all superscalar units Luděk Matýska • Compilings • Spring 2024 3/33 FACULTY OF INFORMATICS Masaryk University Intermediate Language ■ Quadruple (generally fl-tuple) ■ Instruction: operator, two operands, result ■ Example ■ Operation op writen as: X := Y op Z ■ Memory: accessible through temporary variables tn ■ Branches: condition calculated separately ■ Branches: jumps to absolute addresses Luděk Matýska • Compilings • Spring 2024 4/33 FACULTY OF INFORMATICS Masaryk University Basic translation while ( ] < n ) { } A: B: tl t2 t3 jmp t4 t5 t6 t7 k t8 t9 k = k + j*2 m = j*2 j++ = 3 = n = tl < CB) t3 t2 jmp (C) TRUE k j t5*2 t4+t6 t7 j t8*2 m tie til 3 = t9 = j = tl0+l = til jmp (A) TRUE C: : Luděk Matýska • Compilings • Spring 2024 FACULTY OF INFORMATICS Masaryk University Basic blocks ■ Program is represented as a flow graph ■ BLock - a code segment without branches/jumps ■ One entry and one exit point ■ Block as a DAG (Directed Acyclic Graph) ■ Optimization within blocks ■ Removal of repeated (sub)expressions ■ Removal of redundant variables Luděk Matýska • Compilings • Spring 2024 6/33 FACULTY OF INFORMATICS Masaryk University Directed Acyclic Graph B:: t4 t5 t6 t7 k t8 t9 m tie tn 3 k 3 t5*2 t4+t6 t7 D t8*2 t9 D tl0+l til jmp (A) TRUE "2" J,t5,t8,t10 \ A / K,t4 W7 V f * 1 / \ t6,t9,M\ ^ ^ / \ \ >x \ X **■ \ X **• \ X "v \ X \ X ««■» \ X >. \ x \ X % *x *»» t7,K (+) (±) t11'J Luděk Matýska • Compilings • Spring 2024 7/33 FACULTY OF INFORMATICS Masaryk University Modified translation t4 ■ — k B: : t4 := k t5 ■ — • t5 • t6 ■ — t5*2 t6 := t5*2 t7 • — t4+t6 m := t6 k • — t7 t7 := t6+t4 t8 ■ — • 3 k := t7 t9 : ■ — t8*2 til := t5+l m : — t9 • = til tie : — • 1 jmp (A) TRUE til : — tl0+l • — til jmp (A) TRUE Luděk Matýska • Compilings • Spring 2024 8/33 FACULTY OF INFORMATICS Masaryk University Additional concepts ■ Variables ■ Definition and place of use ■ Cycles ■ Target code generation ■ Includes the so-called peephole optimization Luděk Matýska • Compilings • Spring 2024 9/33 FACULTY OF INFORMATICS Masaryk University Optimized code tl := j t2 := n t3 := tl < jmp (B) t3 jmp (C) TRUE t4 := k t5 := J t6 := t5*2 t7 := t4+t6 k := t7 t8 = j t9 = t8*2 m = t9 tie = j tn = tl0+l j = til jmp (A) TRUE A: : tl := j t2 := n t4 := k t9 := m tl2 := tl+tl t3 := tl >= t2 jmp (Bl) t3 B: : t4 := t4+tl2 t9 := tl2 tl := tl+1 tl2 := tl2+2 t3 := tl < t2 jmp (B) t3 Bl:: k := t4 m := t9 C: : Luděk Matýska • Compilings • Spring 2024 10/33 FACULTY OF INFORMATICS Masaryk University Classical optimizations ■ Copy propagation ■ Examples: X = Y Z = 1. + X ^ ■ Constants processing ■ constants propagation ■ constant folding ■ Dead-code elimination ■ inaccessible code ■ saving cache capacity for instructio Luděk Matýska • Compilings • Spring 2024 FACULTY OF INFORMATICS Masaryk University Classical optimizations II ■ Strength reduction ■ Example: K**2 K*K ■ Variable renaming ■ Example ■ Common subexpressions elimination (important especially for evaluation of array indices) x = y*z; q = r+x+x; x = a+b x0 = y*z; q = r+x0+x0; x = a+b Luděk Matýska • Compilings • Spring 2024 12/33 FACULTY OF INFORMATICS Masaryk University Classical optimizations III ■ Move of invariant code from cycles ■ Simplification of induction variables (expressions with them) ■ A(I) is usually computed as: address = base_address(A) + (I-l)*sizeof _datatype(A) which can be in a linear cycle easily simplified to outside cycle: address = base_address(A) - sizeof .datatype (A) within cycle: address = address + sizeof _datatype(A) ■ Register allocation Luděk Matýska • Compilings • Spring 2024 13/33 FACULTY OF INFORMATICS Masaryk University Garbage elimination ■ Procedures, macros ■ Inlining ■ Conditional expressions ■ Comples expressions reorganization ■ Excessive/redundant tests (if vs case) ■ Conditional expressions within cycles ■ Cycle (induction variable) independent ■ Cycle (induction variable) dependent ■ Iteration independent ■ Dependence between iterations Luděk Matýska • Compilings • Spring 2024 14/33 FACULTY OF INFORMATICS Masaryk University Conditional expressions - example DO 1=1,K IF (N .EQ 0) THEN A(I)=A(TO+B(TO*C ELSE A(I>0 ENDIF IF (N .EQ 0) THEN DO 1=1,K A(I)=ACI>BCI) CONTINUE ELSE DO 1=1,K A(I)=0 CONTINUE ENDIF Luděk Matýska • Compilings • Spring 2024 FACULTY OF INFORMATICS Masaryk University Garbage elimination II ■ Reduction ■ min (or max): for(i=0;i z) ? a[i] : ZJ ■ how to deal with a recursive dependency: for(i=0;i z0) ? a[i] : z0; zl=(a[i+l] > zl) ? a[i+l] • zl; } z=(z0 < zl) ? zl : z0; Luděk Matýska • Compilings • Spring 2024 16/33 FACULTY OF INFORMATICS Masaryk University Reduction - Associative transformations ■ Numerical imprecision: 4 valid decimal digits (X + Y) + Z = (.00005 + .00005) + 1.0000 .00010 + + 1.0000 = 1.0001 ale X + (Y + Z) = .00005 + (.00005 + 1.0000) = .00005 + 1.0000 = 1.0000 ■ Reduction DO 1=1,N SUM=SUM+A(I)*B(I) Reduction with recursive dependency - can we use the same trick as with min reduction? Luděk Matýska • Compilings • Spring 2024 17/33 FACULTY OF INFORMATICS Masaryk University Garbage elimination III ■ Branches (jumps) ■ Type conversion REALM A(1000) REAL*8 B(1000) DO 1=1,1000 ACl>A(I)*B(I) ■ Manual optimization ■ Common subexpressions ■ Code move ■ Array processing (intelligent compiler, C and pointers) Luděk Matýska • Compilings • Spring 2024 18/33 FACULTY OF INFORMATICS Masaryk University Cycle optimization ■ Goals: ■ Overhead reduction ■ Better access to memory (efficient use of caches) ■ Parallelism increase Luděk Matýska • Compilings • Spring 2024 19/33 FACULTY OF INFORMATICS Masaryk University RAW, WAR and WAW dependencies ■ Named according how variables are used in the code (two occurencies) ■ Read after Read (RAR) ■ "Benign" (in fact no) dependency ■ Read after Write (RAW) ■ "True" dependency ■ Most problematic, order cannot be changed ■ Write after Read (WAR) ■ "Antidependency" ■ Can be dealt with by renaming ■ Write after Write (WAW) ■ "Output" dependency ■ Order cannot change unless checked for other dependencies Luděk Matýska • Compilings • Spring 2024 20/33 FACULTY OF INFORMATICS Masaryk University Data dependencies I Flow Dependencies (backward dependencies) ■ Example: A(2:N) = A(1:N-1)+B(2:N) DO 1=2 N D0 I=2'N'2 AriVAri lW-n A(I)=A(I-1)+BCI) A(I)-A(I 1>B(I) A(I+1)=A(I-1)+B(I)+B(I+1) Anti-Dependencies ■ Variable renaming as a default solution ■ Example: DO 1=1,N DO 1=1 N a'w= w * e ' DO 1=1 N acq = bcid * e bJj^ afi+2. * c BCI) = ACI+2) * C DO 1=1,N A(I)'= A'CI) Luděk Matýska • Compilings • Spring 2024 21/33 FACULTY OF INFORMATICS I Masaryk University Data dependencies II ■ Output Dependencies ■ Example: A(I) = C(I) * 2 A(I+2) = D(I) + E ■ Several values of a variable are computed during the cycle execution, but only the "last" is to be written ■ Not always easy to recognize, which value is "the last" Luděk Matýska • Compilings • Spring 2024 22/33 FACULTY OF INFORMATICS Masaryk University Loop unrolling I ■ Cycle body copies several times within the cycle DO 1=1,N A(I)=A(I)+B(I)*C DO 1=1,N,4 A(I)=A(I)+B(I)*C A(i+i>A(i+i>B(i+i;)*c A(I+2)=A(I+2)+B(I+2)*C A(I+3)=A(I+3)+B(I+3)*C Luděk Matýska • Compilings • Spring 2024 23/33 FACULTY OF INFORMATICS Masaryk University Loop unrolling II ■ Major purpose ■ Overhead reduction ■ Reduction of number of iterations (=number of branches) ■ Parallelism increase (also within a single superscalar processor) ■ Software pipelining ■ Pre-a postconditioning Loops ■ Actual number of iterations adaptation Luděk Matýska • Compilings • Spring 2024 24/33 FACULTY OF INFORMATICS Masaryk University Loop unrolling III ■ Unsuitable Loops ■ Small number of iterations —> full loop unrolling ■ "Fat" (=too large)) cycles: already include sufficient number of opportunities for parallelization ■ Loops with procedure calls: see also the procedure inlining ■ Loops with conditional expressions: more important for older processors ■ "Recursive" loops: with internal dependencies (cross iterations) (a[i]=a[i]+a[i-l]*b) Luděk Matýska • Compilings • Spring 2024 25/33 FACULTY OF INFORMATICS Masaryk University Loop unrolling problems ■ Unrolling with a bad number of iterations ■ Register clogging ■ Misses of instruction cache (too Long cycle) ■ Hardware problems ■ esp. on multiprocessors with shared memory (cache coherency, bus overload,...) ■ Special cases: external Loops unrolling, Loops combination Luděk Matýska • Compilings • Spring 2024 26/33 FACULTY OF INFORMATICS Masaryk University Loops combination Repeated use of data (cache efficiency) Large Loop body Compiler can do the combination if there is no code between Loops a[0]=b[0]+l for(i=0;i1,N DO D=1,N =>► DO 1=1, N A(I,D)=B(I,D)+C(I)^D A(I,D)=B(I,D)+C(D)^D Luděk Matýska • Compilings • Spring 2024 29/33 FACULTY OF INFORMATICS Masaryk University Memory access optimization II ■ Combination into blocks ■ Example: DO 1=1,N DO D=1,N A(3,I)=A(3,I)+B(I,3) DO 1=1,N,2 DO D=1,N,2 A(3,I)=A(3,I)+B(I,3) AQ3+1,I)=A(D+1,I)+B(I,3+1) A(D,I+1)=A(D,I+1)+B(I+1,D) A(D+1,I+1)=A(D+1,I+1)+B(I+1,3+1) Luděk Matýska • Compilings • Spring 2024 30/33 FACULTY OF INFORMATICS Masaryk University Cache optimization - Blocking ■ A general technique to split loops working with arrays to loops that work on a block of arrays ■ Example: Matrix transposition - c is transpose of a do 1=1,m do >1,n c[j+i*n] = a[i+j*m] Easy, but for any large n and m the cache overflow occurs do 1=1,m,b do :=i,n,b do 11=1,b do d:=i,b c[j+jj+Ci+ii)#n] = a[i+ii+(i+jj)^m] b is the block size, derived from the data cache size Luděk Matýska • Compilings • Spring 2024 31/33 FACULTY OF INFORMATICS I Masaryk University Memory access optimization III ■ Indirect addressing ■ Example: t>[i]=a[i+k]"*c, value of k unknown in the compile time a[k[i]] += b[i]*c ■ Use of pointers ■ Insufficient memory capacity ■ "Manual" processing ■ Virtual memory Luděk Matýska • Compilings • Spring 2024 32/33 FACULTY OF INFORMATICS Masaryk University More reading ■ http://www.inf.ed.ac.uk/teaching/courses/copt/ Luděk Matýska • Compilings • Spring 2024 33/33