FACULTY
OF INFORMATICS
Masaryk University
PA039: Supercomputer Architecture and Intensive Computing
Compiling and Code Optimization
Luděk Matýska
Spring 2024
Luděk Matýska • Compilings • Spring 2024
1/33
FACULTY
OF INFORMATICS
Masaryk University
Repetition - RISC processors
■ Limited number of instructions, same size
■ Simple address modes, Load/Store, sufficient number of registers
■ Delayed branches, branch prediction, out-of-order execution
■ Superscalar (e.g. 2xFPU, 2xALU, special address instructions)
■ Superpipeline
■ Caches
Luděk Matýska • Compilings • Spring 2024
2/33
FACULTY
OF INFORMATICS
Masaryk University
Optimizing Compiler
■ Translation to the Intermediate language
■ Optimization
■ intra-procedural analysis
■ cycle optimization
■ global optimization (inter-process optimization)
■ Code generation
■ use of all superscalar units
Luděk Matýska • Compilings • Spring 2024
3/33
FACULTY
OF INFORMATICS
Masaryk University
Intermediate Language
■ Quadruple (generally fl-tuple)
■ Instruction: operator, two operands, result
■ Example
■ Operation op writen as: X := Y op Z
■ Memory: accessible through temporary variables tn
■ Branches: condition calculated separately
■ Branches: jumps to absolute addresses
Luděk Matýska • Compilings • Spring 2024
4/33
FACULTY
OF INFORMATICS
Masaryk University
Basic translation
while ( ] < n ) {
}
A:
B:
tl t2 t3 jmp
t4 t5 t6 t7 k
t8 t9
k = k + j*2 m = j*2
j++
= 3 = n
= tl < CB) t3
t2
jmp (C) TRUE
k j
t5*2
t4+t6
t7
j
t8*2
m
tie til
3
= t9 = j
= tl0+l = til
jmp (A) TRUE
C: :
Luděk Matýska • Compilings • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Basic blocks
■ Program is represented as a flow graph
■ BLock - a code segment without branches/jumps
■ One entry and one exit point
■ Block as a DAG (Directed Acyclic Graph)
■ Optimization within blocks
■ Removal of repeated (sub)expressions
■ Removal of redundant variables
Luděk Matýska • Compilings • Spring 2024
6/33
FACULTY
OF INFORMATICS
Masaryk University
Directed Acyclic Graph
B::
t4 t5 t6 t7 k
t8 t9 m
tie tn
3
k 3
t5*2
t4+t6
t7
D
t8*2 t9
D
tl0+l til
jmp (A) TRUE
"2" J,t5,t8,t10 \ A /	K,t4	
W7 V f * 1             / \		
t6,t9,M\  ^ ^     / \ \ >x \                                         X **■ \                           X **• \                               X "v \ X \                        X ««■»		
\                           X >. \ x \ X %         *x *»» t7,K (+)	(±) t11'J	
Luděk Matýska • Compilings • Spring 2024
7/33
FACULTY
OF INFORMATICS
Masaryk University
Modified translation
t4	■ —	k	B: : t4		:= k
t5	■ —	•	t5		•
t6	■ —	t5*2	t6		:= t5*2
t7	• —	t4+t6	m		:= t6
k	• —	t7	t7		:= t6+t4
t8	■ —	• 3	k		:= t7
t9 :	■ —	t8*2	til		:= t5+l
m :	—	t9	•		= til
tie :	—	• 1	jmp	(A) TRUE	
til :	—	tl0+l			
•	—	til			
jmp (A) TRUE
Luděk Matýska • Compilings • Spring 2024
8/33
FACULTY
OF INFORMATICS
Masaryk University
Additional concepts
■ Variables
■ Definition and place of use
■ Cycles
■ Target code generation
■ Includes the so-called peephole optimization
Luděk Matýska • Compilings • Spring 2024
9/33
FACULTY
OF INFORMATICS
Masaryk University
Optimized code
tl		:= j
t2		:= n
t3		:= tl <
jmp	(B) t3	
jmp	(C) TRUE	
t4		:= k
t5		:= J
t6		:= t5*2
t7		:= t4+t6
k		:= t7
t8		= j
t9		= t8*2
m		= t9
tie		= j
tn		= tl0+l
j		= til
jmp	(A) TRUE	
A: :      tl := j
t2 := n
t4 := k
t9 := m
tl2 := tl+tl
t3 := tl >= t2
jmp (Bl) t3
B: :      t4 := t4+tl2
t9 := tl2
tl := tl+1
tl2 := tl2+2
t3 := tl < t2
jmp (B) t3
Bl::    k := t4
m := t9
C: :
Luděk Matýska • Compilings • Spring 2024
10/33
FACULTY
OF INFORMATICS
Masaryk University
Classical optimizations
■ Copy propagation
■ Examples:
X = Y
Z = 1. + X ^
■ Constants processing
■ constants propagation
■ constant folding
■ Dead-code elimination
■ inaccessible code
■ saving cache capacity for instructio
Luděk Matýska • Compilings • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Classical optimizations II
■ Strength reduction
■ Example: K**2 K*K
■ Variable renaming
■ Example
■ Common subexpressions elimination
(important especially for evaluation of array indices)
x = y*z; q = r+x+x; x = a+b
x0 = y*z;
q   = r+x0+x0;
x   = a+b
Luděk Matýska • Compilings • Spring 2024
12/33
FACULTY
OF INFORMATICS
Masaryk University
Classical optimizations III
■ Move of invariant code from cycles
■ Simplification of induction variables (expressions with them)
■ A(I) is usually computed as:
address = base_address(A) +
(I-l)*sizeof _datatype(A)
which can be in a linear cycle easily simplified to
outside cycle:
address = base_address(A) -
sizeof .datatype (A)
within cycle:
address = address + sizeof _datatype(A)
■ Register allocation
Luděk Matýska • Compilings • Spring 2024
13/33
FACULTY
OF INFORMATICS
Masaryk University
Garbage elimination
■ Procedures, macros
■ Inlining
■ Conditional expressions
■ Comples expressions reorganization
■ Excessive/redundant tests (if vs case)
■ Conditional expressions within cycles
■ Cycle (induction variable) independent
■ Cycle (induction variable) dependent
■ Iteration independent
■ Dependence between iterations
Luděk Matýska • Compilings • Spring 2024
14/33
FACULTY
OF INFORMATICS
Masaryk University
Conditional expressions - example
DO 1=1,K IF (N .EQ 0) THEN
A(I)=A(TO+B(TO*C ELSE
A(I>0 ENDIF
IF (N .EQ 0) THEN
DO 1=1,K
A(I)=ACI>BCI) CONTINUE
ELSE
DO 1=1,K A(I)=0
CONTINUE ENDIF
Luděk Matýska • Compilings • Spring 2024
FACULTY
OF INFORMATICS
Masaryk University
Garbage elimination II
■ Reduction
■ min (or max):
for(i=0;i<n;
z=Ca[i] > z) ?   a[i]  : ZJ
■ how to deal with a recursive dependency:
for(i=0;i<n-l;i+=2) {
z0=(a[i] > z0) ?   a[i] : z0; zl=(a[i+l] > zl) ?   a[i+l]  • zl;
}
z=(z0 < zl) ?   zl : z0;
Luděk Matýska • Compilings • Spring 2024
16/33
FACULTY
OF INFORMATICS
Masaryk University
Reduction - Associative transformations
■ Numerical imprecision: 4 valid decimal digits
(X + Y) + Z = (.00005 + .00005) + 1.0000
.00010 + + 1.0000 = 1.0001
ale
X + (Y + Z) = .00005 + (.00005 + 1.0000) =
.00005 + 1.0000 = 1.0000
■ Reduction
DO 1=1,N
SUM=SUM+A(I)*B(I) Reduction with recursive dependency - can we use the same trick as with min reduction?
Luděk Matýska • Compilings • Spring 2024
17/33
FACULTY
OF INFORMATICS
Masaryk University
Garbage elimination III
■ Branches (jumps)
■ Type conversion
REALM A(1000) REAL*8 B(1000) DO 1=1,1000
ACl>A(I)*B(I)
■ Manual optimization
■ Common subexpressions
■ Code move
■ Array processing (intelligent compiler, C and pointers)
Luděk Matýska • Compilings • Spring 2024
18/33
FACULTY
OF INFORMATICS
Masaryk University
Cycle optimization
■ Goals:
■ Overhead reduction
■ Better access to memory (efficient use of caches)
■ Parallelism increase
Luděk Matýska • Compilings • Spring 2024
19/33
FACULTY
OF INFORMATICS
Masaryk University
RAW, WAR and WAW dependencies
■ Named according how variables are used in the code (two occurencies)
■ Read after Read (RAR)
■ "Benign" (in fact no) dependency
■ Read after Write (RAW)
■ "True" dependency
■ Most problematic, order cannot be changed
■ Write after Read (WAR)
■ "Antidependency"
■ Can be dealt with by renaming
■ Write after Write (WAW)
■ "Output" dependency
■ Order cannot change unless checked for other dependencies
Luděk Matýska • Compilings • Spring 2024
20/33
FACULTY
OF INFORMATICS
Masaryk University
Data dependencies I
Flow Dependencies (backward dependencies)
■ Example: A(2:N) = A(1:N-1)+B(2:N)
DO 1=2 N D0 I=2'N'2
AriVAri lW-n A(I)=A(I-1)+BCI) A(I)-A(I 1>B(I) A(I+1)=A(I-1)+B(I)+B(I+1)
Anti-Dependencies
■ Variable renaming as a default solution
■ Example:
DO 1=1,N
DO 1=1 N a'w= w * e
' DO 1=1 N
acq = bcid * e bJj^ afi+2. * c
BCI) = ACI+2) * C DO 1=1,N
A(I)'= A'CI)
Luděk Matýska • Compilings • Spring 2024
21/33
FACULTY
OF INFORMATICS
I   Masaryk University
Data dependencies II
■ Output Dependencies
■ Example:
A(I) = C(I) * 2 A(I+2) = D(I) + E
■ Several values of a variable are computed during the cycle execution, but only the "last" is to be written
■ Not always easy to recognize, which value is "the last"
Luděk Matýska • Compilings • Spring 2024
22/33
FACULTY
OF INFORMATICS
Masaryk University
Loop unrolling I
■ Cycle body copies several times within the cycle DO 1=1,N
A(I)=A(I)+B(I)*C
DO 1=1,N,4
A(I)=A(I)+B(I)*C
A(i+i>A(i+i>B(i+i;)*c
A(I+2)=A(I+2)+B(I+2)*C A(I+3)=A(I+3)+B(I+3)*C
Luděk Matýska • Compilings • Spring 2024
23/33
FACULTY
OF INFORMATICS
Masaryk University
Loop unrolling II
■ Major purpose
■ Overhead reduction
■ Reduction of number of iterations (=number of branches)
■ Parallelism increase (also within a single superscalar processor)
■ Software pipelining
■ Pre-a postconditioning Loops
■ Actual number of iterations adaptation
Luděk Matýska • Compilings • Spring 2024
24/33
FACULTY
OF INFORMATICS
Masaryk University
Loop unrolling III
■ Unsuitable Loops
■ Small number of iterations —> full loop unrolling
■ "Fat" (=too large)) cycles: already include sufficient number of opportunities for parallelization
■ Loops with procedure calls: see also the procedure inlining
■ Loops with conditional expressions: more important for older processors
■ "Recursive" loops: with internal dependencies (cross iterations) (a[i]=a[i]+a[i-l]*b)
Luděk Matýska • Compilings • Spring 2024
25/33
FACULTY
OF INFORMATICS
Masaryk University
Loop unrolling problems
■ Unrolling with a bad number of iterations
■ Register clogging
■ Misses of instruction cache (too Long cycle)
■ Hardware problems
■ esp. on multiprocessors with shared memory (cache coherency, bus overload,...)
■ Special cases: external Loops unrolling, Loops combination
Luděk Matýska • Compilings • Spring 2024
26/33
FACULTY
OF INFORMATICS
Masaryk University
Loops combination
Repeated use of data (cache efficiency) Large Loop body
Compiler can do the combination if there is no code between Loops
a[0]=b[0]+l
for(i=0;i<n;c[0]=a[0]/2
a[i]=b[i]+l for(i=l;i<n;i++) {
for(i=0;i<n;i++) _^ a[i]=b[i]+l
c[i]=a[i]/2 =^ c[i]=a[i]/2
for(i=0;i<n;i++) d[i-l]=l/c[i] d[i]=l/c[i+l] }
d[n]=l/c[n+l]
Luděk Matýska • Compilings • Spring 2024
27/33
FACULTY
OF INFORMATICS
Masaryk University
External loops unrolling
■ Example:
DO 1=1,N DO 3=1,N
A(I)=A(I)+B(I,D)*C(D)
■ A(l) is a constant in the internal loop, C(J) is properly walked through
■ B(IJ) Fortran has inverse order!
DO 1=1,N,4
DO 3=1,N
A(I+0)=ACI+0)+B(I+0,D)*C(D)
A(I+1)=A(I+1)+B(I+1, jyCQJ^ A(I+2)=A(I+2)+B(I+21 ;j)*C(D)
ACI+3)=ACI+3)+BCI+3, :i)*C(D)
Luděk Matýska • Compilings • Spring 2024
28/33
FACULTY
OF INFORMATICS
I   Masaryk University
Memory access optimization
■ Optimal: smallest step (cache optimization)
■ Work with arrays - C vs. Fortran
■ C: stored by rows, the right index is fastest to change
■ Fortran: stored by columns, the left index is fastest to change
■ Index inversion
■ Example:
DO 1=1, N DO >1,N
DO D=1,N =>► DO 1=1, N
A(I,D)=B(I,D)+C(I)^D A(I,D)=B(I,D)+C(D)^D
Luděk Matýska • Compilings • Spring 2024
29/33
FACULTY
OF INFORMATICS
Masaryk University
Memory access optimization II
■ Combination into blocks ■ Example:
DO 1=1,N
DO D=1,N
A(3,I)=A(3,I)+B(I,3)
DO 1=1,N,2 DO D=1,N,2
A(3,I)=A(3,I)+B(I,3)
AQ3+1,I)=A(D+1,I)+B(I,3+1) A(D,I+1)=A(D,I+1)+B(I+1,D) A(D+1,I+1)=A(D+1,I+1)+B(I+1,3+1)
Luděk Matýska • Compilings • Spring 2024
30/33
FACULTY
OF INFORMATICS
Masaryk University
Cache optimization - Blocking
■ A general technique to split loops working with arrays to loops that work on a block of arrays
■ Example: Matrix transposition - c is transpose of a
do 1=1,m do >1,n
c[j+i*n] = a[i+j*m] Easy, but for any large n and m the cache overflow occurs
do 1=1,m,b
do :=i,n,b
do 11=1,b
do d:=i,b
c[j+jj+Ci+ii)#n] = a[i+ii+(i+jj)^m] b is the block size, derived from the data cache size
Luděk Matýska • Compilings • Spring 2024
31/33
FACULTY
OF INFORMATICS
I   Masaryk University
Memory access optimization III
■ Indirect addressing
■ Example:
t>[i]=a[i+k]"*c, value of k unknown in the compile time a[k[i]] += b[i]*c
■ Use of pointers
■ Insufficient memory capacity
■ "Manual" processing
■ Virtual memory
Luděk Matýska • Compilings • Spring 2024
32/33
FACULTY
OF INFORMATICS
Masaryk University
More reading
■ http://www.inf.ed.ac.uk/teaching/courses/copt/
Luděk Matýska • Compilings • Spring 2024
33/33