Reduction	Cross-kernel Optimizations	General Advices	Searching for Bottlenecks
oooooooooooooooooo	ooo	ooooooo	ooooo
Code Optimizations
in Fihpovic Fall 2020
Jiří Filipovič        Code Optimizations
Reduction
•ooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Vector Reduction
Let v be the vector of size n. We want to compute x = vi-
Jiří Filipovič        Code Optimizations
Reduction
•ooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Vector Reduction
Let v be the vector of size n. We want to compute x = vi-C code (not very reasonable for floats)
int  x = 0;
for   (int  i = 0;   i < n; i++) x += v [ i ] ;
There is flow dependency across iterations.
Jiří Filipovič        Code Optimizations
Reduction
•ooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Vector Reduction
Let v be the vector of size n. We want to compute x = vi-C code (not very reasonable for floats)
int  x = 0;
for   (int  i = 0;   i < n; i++) x += v [ i ] ;
There is flow dependency across iterations.
• we cannot compute completely parallel
• addition is (at least in theory :-)) associative
• so, we do not need to add numbers in sequential order
Jiří Filipovič        Code Optimizations
Reduction
O0OOOOOOOOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
The sequential algorithm performs seven steps:
(((((("i + v2) + V3) + "4) + "5) + v&) + ^) + v8
Jiří Filipovič        Code Optimizations
Reduction
O0OOOOOOOOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
The sequential algorithm performs seven steps:
(((((K + V2) + V3) + V4) + V5) + V6) + V7) + V8
Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((v5 + v6) + (vj + v8))
Jiří Filipovič        Code Optimizations
Reduction
O0OOOOOOOOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
The sequential algorithm performs seven steps:
((((((vi + V2) + v3) + v4) + vb) + v6) + Vj) + v8
Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((vb + v6) + (w + vs)) We can work in parallel now:
• four additions in the first step
o two additions in the second step
• one addition in the third step
In summary, we perform n — 1 additions in log2 n parallel steps!
Jiří Filipovič        Code Optimizations
Reduction
OO0OOOOOOOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
We have found the parallel algorithm
• the same number of additions as the serial algorithm
• in logarithmic time (if we have enough cores) We add results of previous additions
• flow-dependency across threads
• we need global barrier
Jiří Filipovič        Code Optimizations
Reduction
OOO0OOOOOOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Naive Approach
The simplest scheme of the algorithm:
• for even /, i < n perform v[i] += v[i+l]
• repeat for n /= 2 untill n > 1 The performance is not ideal
9 2n numbers loaded from global memory
• n numbers stored to global memory
o log2 n kernel invocations
We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead.
Jiří Filipovič        Code Optimizations
Reduction
OOOO0OOOOOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Exploiting Data Locality
We can add more than pairs during single kernel call.
• each block bx loads m numbers into shared memory
o it reduces the input (in shared memory in log2 m steps)
• it stores only one number containing YlT=mXx vi
Reduces both memory transfers and number of kernel invocations
• number of loads: n + % + % + .. + ^ = (n - 1)^
• approximately n+ ^ numbers read, ^ written
• \ogm n kernel invocations
Jiří Filipovič        Code Optimizations
Reduction
00000*000000000000
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 1
__global__  void  reduce 1(int *v){ extern  __shared__   int   sv [ ];
unsigned  int  tid = threadldx.x;
unsigned  int   i = blockldx.x*blockDim.x + threadldx.x; sv[tid]  = v[i]; __syncthreads();
for(unsigned  int  s=l;   s < blockDim.x;   s *= 2) { if   (tid % (2*s) = 0)
}
sv[tid] += sv[tid + s]; __syncthreads () ;
if (tid
0)
}
v[biockldx.x]  = sv[0];
Jiří Filipovič        Code Optimizations
Reduction	Cross-kernel Optimizations	General Advices	Searching for Bottlenecks
OOOOOO0OOOOOOOOOOO	ooo	ooooooo	ooooo
			
Performance			
Beware modulo operation. High degree of divergence
• during the first iteration, only half of threads is working
• during the second iteration, only quarter of threads is working o etc.
Performance on GTX 280: 3.77 GB/s (0.94 MEIem/s).
Jiří Filipovič        Code Optimizations
Reduction
OOOOOOO0OOOOOOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 2
We will modify indexation
for   (unsigned  int  s =  1;   s < blockDim.x;   s *= 2) { int   index = 2  *  s  * tid; if   (index < blockDim.x)
sv[index]  += sv[index + s]; __syncthreads();
}
Performance: 8.33 GB/s (2.08 MEIem/s).
The code is free of modulo and divergence, but generates shared memory bank conflicts.
Jiří Filipovič        Code Optimizations
Reduction OOOOOOOO0OOOOOOOOO	Cross-kernel Optimizations               General Advices               Searching for Bottlenecks ooo                                       ooooooo ooooo
Implementation 2	
So we can try another indexing...
for   (unsigned  int  s = blockDim . x / 2;   s >  0;   s »= 1) { if   (tid < s)
s v[tid]  += s v [ t i d + s] ; __syncthreads();
}
No divergence and no conflicts. Performance 16.34 GB/s (4.08 MEIem/s). Half of threads do not compute...
Jiří Filipovič        Code Optimizations
Reduction
ooooooooo«oooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 4
We can add numbers during loading them from global memory.
unsigned  int  i = blockldx.x*(blockDim.x*2) + threadldx.x; sv[tid]  = v[i]  + v[i+blockDim.x];
Performance 27.16 GB/s (6.79 MEIem/s).
There is no problem with data access, but the performance is still low - we will focus to instructions.
Jiří Filipovič        Code Optimizations
Reduction	Cross-kernel Optimizations	General Advices	Searching for Bottlenecks
oooooooooo«ooooooo	ooo	ooooooo	ooooo
			
Implementation	5		
The number of active threads decreases during computation in shared memory.
9 in the last six iterations, only the last warp is active
• the warp is synchronized implicitly on GPUs with c.c. < 7.0, so we do not need syncthreadsQ
9 we need volatile variable in this case
• condition if(tid < s) does not spare any computation So we can unroll the last warp...
Jiří Filipovič        Code Optimizations
Reduction
OOOOOOOOOOO0OOOOOO
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 5
float  mySum = 0;
for   (unsigned  int  s = blockDim . x / 2;   s >  32;   s »= 1){ if   (tid < s)
sv[tid]  = mySum = mySum + sv[tid + s]; __syncthreads();
}
if  (tid < 32){
volat ile		float	* s = sv	>				
s	tid]	mySum	= mySum	+	s	tid	+	32]
s	tid]	mySum	= mySum	+	s	tid	+	16]
s	tid]	mySum	= mySum	+	s	tid	+	8];
s	tid]	mySum	= mySum	+	s	tid	+	4];
s	tid]	mySum	= mySum	+	s	tid	+	2];
s	tid]	mySum	= mySum	+	s	tid	+	i];
}
We save time in all warps (the last warp is simpler, others exits earlier from the for loop).
Performance: 37.68 GB/s (9.42 MEIem/s).   <Q> 4 ._ „
J in Fihpovic
Code Optimizations
Reduction
oooooooooooo«ooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 5
For c.c. 3.0 or greater, we can use warp shuffle:
if   (tid < 32){
mySum += sdata[tid + 32];
for   (int   offset = warpSize/2;   offset >  0;   offset /= 2) mySum += __shfl_down_sync(mySum, offset);
}
This is safe for all GPUs.
Jiří Filipovič        Code Optimizations
Reduction	Cross-kernel Optimizations	General Advices	Searching for Bottlenecks
0000000000000*0000	ooo	ooooooo	ooooo
			
Implementation	6		
Can we unroll the for loop?
If we know the number of iterations, we can unroll it
• the number of iterations depends on the block size Can we implement it generically?
• algorithm uses blocks of size 2n
• the block size is upper-bound
• if we know the block size during compilation, we can use a template
template <unsigned  int blockSize> __global__  void  reduce6(int *v)
Jiří Filipovič        Code Optimizations
Reduction	Cross-kernel Optimizations	General Advices	Searching for Bottlenecks
OOOOOOOOOOOOOO0OOO	ooo	ooooooo	ooooo
			
Implementation	6		
Conditions using blockSize are evaluated during compilation:
if   (blockSize >= 512){ if   (tid < 256)
sv[tid] += sv[tid + 256]; __syncthreads();
}
if   (blockSize >= 256){ if   (tid < 128)
sv[tid] += sv[tid + 128]; __syncthreads();
}
if   (blockSize >= 128){ if   (tid < 64)
sv[tid]  += sv[tid + 64]; __syncthreads();
}
Performance: 50.64 GB/s (12.66 MEIem/s).
J in Fihpovic
Code Optimizations
Reduction OOOOOOOOOOOOOOO0OO	Cross-kernel Optimizations               General Advices               Searching for Bottlenecks ooo                                       ooooooo ooooo
Implementation "	J
Can we implement faster algorithm? Let's reconsider the complexity:
9 \ogn parallel steps
9 n — 1 additions
o time complexity for p threads running in parallel (using p processors): 0(^- + \ogn)
Cost of parallel computation
• defined as number of processors multiplied by time complexity o if we assign one thread to one data element, we get p — n
• and the cost is 0(n • log n) o which is not efficient
J in Fihpovic
Code Optimizations
Reduction
OOOOOOOOOOOOOOOO0O
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 7
Decreasing the cost
• we use 0(r°—) threads
v log n /
• each thread performs 0(\ogn) sequential steps a after that, it performs (D(\ogn) parallel steps
• time complexity is the same
• the cost is O(n) What it means in practice?
• we reduce overhead of the computation (e.g. integer arithmetics)
• advantage if we have much more threads that is needed to saturate GPU
J in Fihpovic
Code Optimizations
Reduction
ooooooooooooooooo<
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 7
We modify loading into shared memory
unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid]  = 0;
while(i < n){
s v[t i d] += v[i]  + v[i+blockSize]; i += gridSize;
}
__syncthreads () ;
Performance: 77.21 GB/s (19.3 MEIem/s).
Jiří Filipovič        Code Optimizations
Reduction
ooooooooooooooooo<
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Implementation 7
We modify loading into shared memory
unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid]  = 0;
while(i < n){
s v[t i d] += v[i]  + v[i+blockSize]; i += gridSize;
}
__syncthreads () ;
Performance: 77.21 GB/s (19.3 MEIem/s).
You can find those implementations in CUDA SDK.
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations •oo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Intra-kernel Optimizations
The compiler optimizes each kernel separately, so it may miss some optimization opportunities.
• kernel fusion - gluing code from several kernels into one kernel
• kernel fission - splitting a kernel into several smaller kernels
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Kernel fusion
Performance impact of kernel fusion 9 reduce kernel execution overhead
• may add more parallelism
• allow more scalar code optimizations: common subexpression elimination, loop fusion, condition fusion
• reduce global memory transfers if kernels are flow-dependent or input-dependent
Correctness
• no flow dependency between thread blocks
• shared memory and registers locality has to be maintained
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations oom
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Kernel fission
Kernel fission reduces resources consumption
• increases occupancy
• may allow to use different algorithm (e.g. if part of the algorithm uses different amount of parallelism or different amount of resources)
• more complicated and divergent codes may be separated (e.g. handling array boundaries)
Correctness
• much easier, we just need to transfer data between new kernels
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices •oooooo
Searching for Bottlenecks
ooooo
Problem Choice
Before we start with code acceleration, we should consider carefully, if it is meaningful. The accelerated code should be
o critical for application performance (profile... and profile on real data)
• large enough (usually not ideal for relatively simple but latency critical application)
• parallelizable (problematic e.g. in simulation of a small system evolving for a long time)
o sufficient number of flops to memory transfers (consider slow PCI-E)
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices O0OOOOO
Searching for Bottlenecks
ooooo
Problem Choice
Do we optimize running time or power consumption?
• accelerators are usually faster, but also have higher power consumption
• how to deal with hybrid systems (e.g. CPU, GPU and Xeon Phi)
9 influences decision what to buy as well as what to use (which resources let in power-saving mode)
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices oo«oooo
Searching for Bottlenecks
ooooo
Algorithm Design
Parallelization
• we need to parallelize computational problem
o we should be aware about target architecture even in this stage (consider e.g. graph algorithms)
It is difficult to accelerate codes on GPU:
• if threads within the warp access rather random addresses in the memory
• if threads within the warp diverges (by nature of the algorithm)
• if the parallelism is insufficient in certain parts of computation
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices OOO0OOO
Searching for Bottlenecks
ooooo
How to Write Bug-Free Code
Test if API and kernel calls are successful
• otherwise, errors can appear later...
The memory allocation on GPU occurs seldom
• if your modify your code in a way your kernel does not write any result, you got a result from its previous run
• clear output arrays for debugging purposes Be aware of out-of-bounds shared memory access
• kernel usually runs successfully, but one block interferes with another
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices OOOO0OO
Searching for Bottlenecks
ooooo
Optimization
Start with the most important optimizations and continue with less important (so the effect of less important optimizations is not hidden). In general, this order should work well:
• PCI-E transfers reduction/overlay
• global memory access (bandwidth, latency)
• access to other types of memory
• divergence
• parallelism configuration (block size, amount of serial work per thread)
• instruction optimization
It is good idea to write your code configurable
• block size, number of serial iterations per thread, loop unrolling factor, used algorithm .. .
• use macros or templates to ease configutation of the optimizations
J in Fihpovic
Code Optimizations
Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks
oooooooooooooooooo ooo ooooo«o ooooo
Interpretation of Algorithm Performance
Some optimizations may be hidden
• e.g. optimizing instruction cannot help when code is bound by wrong global memory access
• can be reduced by applying more important optimizations earlier
a use the profiler
The optimization space is not continuous
• due to restricted amount of GPU resources
• e.g. improving efficiency of scalar code by using one more register may decrease the performance by restricting GPU occupancy
Performance is data-dependent
o data size: partition camping, underutilized GPU
• data content: sparse data with varying structure
1 J        " ^)Q,(>
J in Fihpovic
Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices oooooo*
Searching for Bottlenecks
ooooo
What is real speedup over CPU?
Comparison of a theoretical peak is basic metric
• however, the speedup can be lower
• insufficient parallelism
• inappropriate data structures, random access
• PCI-E bottleneck (especially multi-GPU algorithms)
• however, the speedup can be also higher
• frequent usage of SFUs
• complicated vectorization on CPU
• insufficient scaling on SMP (cache interferences, NUMA)
• different scaling of CPU and GPU with growing problem size
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks •oooo
Searching for Bottlenecks
The amount of arithmetic operations and memory transfers tells us what is expected to be a limit for algorithm
• sometimes bottleneck is not clear (overhead instructions, irregular memory access)
• code profiling - suitable to identify issues with instructions throughput or bad memory access pattern, more difficult to identify source of latency problems
o code modifications - more precise, but more difficult and not usable in all cases
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks otooo
Profiling
How close is the code to the hardware limits?
• profiler shows the overall utilization of particular GPU subsystems, such as cache, global memory, FP instructions etc.
Issues identification
profiler detects some issues, such as shared memory bank conflicts or code divergence
We can inspect a code in details
• time spent on particular instructions/C for CUDA lines of code
• we need to compile the code with flag -lineinf o
Jiří Filipovič        Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks oo»oo
Code Modifications
Global memory performance
• we comment-out the computation
9 but we need to somehow use loaded data (to disallow compiler to exclude loading)
• we can check with profiler that the same amount of data are transfered
Instructions performance
• we comment-out data movement
o but the resulting data is needed to be stored (to disallow compiler to exclude computation)
• but we do not want to store data. ..
we can move the code storing data into condition which is evaluated as false during computation (but not during compilation)
• be aware of execution overhead in the cas$ of Jast kernels t
J in Fihpovic
Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks ooo«o
Code Modifications
Be aware of occupancy changes
• code modifications can release some resources
• we can restrict occupancy by allocating some dummy shared memory array
Interpretation of measured times
o original kernel execution time is close to sum of computation and memory kernel time - the latency is an issue
• computation or memory kernel time dominates and is closed to original kernel time - the performance is bounded by computation or memory
• computation and memory kernel times are similar to original kernel time - we need to optimize both
J in Fihpovic
Code Optimizations
Reduction
oooooooooooooooooo
Cross-kernel Optimizations
ooo
General Advices
ooooooo
Searching for Bottlenecks oooo*
Code Modifications
Approximation of an optimization effect
• when we already know some performance issue
9 when we want to know the effect of optimization before we actually implement it
• we can modify the code without preserving original functionality, but preserving amount of work and removing performance issue
• cannot be done in all cases
• may show us if we really address the performance issue
• see matrix transposition example
Jiří Filipovič        Code Optimizations