Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Code Optimizations
in Fihpovic
fall 2015
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
•oooooooooooooooooooo
Vector Reduction
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Let v be the vector of size n. We want to compute x = X)/!=i vi-
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
•oooooooooooooooooooo
Vector Reduction
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Let v be the vector of size n. We want to compute x = X)/!=i vi-C code (not very reasonable for floats)
int  x = 0;
for   (int  i = 0;   i < n; i++) x += v [ i ] ;
There is flow dependency across iterations.
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
•oooooooooooooooooooo
Vector Reduction
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Let v be the vector of size n. We want to compute x = X)/!=i vi-C code (not very reasonable for floats)
int  x = 0;
for   (int  i = 0;   i < n; i++) x += v [ i ] ;
There is flow dependency across iterations.
• we cannot be completely parallel
• addition is (at least in theory :-)) associative
• so, we do not need to add numbers in sequential order
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
O0OOOOOOOOOOOOOOOOOOO
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
The sequential algorithm performs seven steps:
(((((("l + v2) + "a) + v4) + v5) + v6) + v7) + v8
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
O0OOOOOOOOOOOOOOOOOOO
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
The sequential algorithm performs seven steps:
((((((^1 + V2) + V3) + VA) + V5) + V6) + V7) + V8
Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((v5 + v6) + (vj + v8))
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
O0OOOOOOOOOOOOOOOOOOO
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Parallel Algorithm
The sequential algorithm performs seven steps:
((((((vi + V2) + v3) + v4) + vb) + v6) + Vj) + v8
Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((vb + v6) + (w + vs)) We can work in parallel now:
• four additions in the first step
• two additions in the second step
• one addition in the third step
In summary, we perform n — 1 additions in log2 n parallel steps!
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks
OO0OOOOOOOOOOOOOOOOOO ooooooo ooooo
Parallel Algorithm
We have found the parallel algorithm
• the same number of additions as the serial algorithm
• in logarithmic time (if we have enough cores) We add results of previous additions
• flow-dependency across threads
• we need global barrier
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOO0OOOOOOOOOOOOOOOOO
Naive Approach
General Advices
ooooooo
Searching for Bottlenecks
ooooo
The simplest scheme of the algorithm:
• for even /, i < n perform v[ij += v[i+l]
• repeat for n /= 2 untill n > 1 The performance is not ideal
• 2n numbers loaded from global memory
• n numbers stored to global memory
• log2 n kernel invocations
We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead.
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOOO0OOOOOOOOOOOOOOOO
Exploiting Data Locality
General Advices
ooooooo
Searching for Bottlenecks
ooooo
We can add more than tuples during single kernel call.
• each block bx loads m numbers into shared memory
• it reduces the input (in shared memory in log2 m steps)
• it stores only one number containing YlT=mXx v>
Reduces both memory transfers and number of kernel invocations
• number of loads: n + % + % + .. + ^ = (n - 1)^
• approximately n+    numbers read, ^ written
• \ogm n kernel invocations
Jiří Filipovič        Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 1
Reduction Cross-kernel Optimizations
ooooo«ooooooooooooooo
__global__  void  reduce 1(int *v){ extern  __shared__   int   sv [ ];
unsigned  int  tid = threadldx.x;
unsigned  int   i = blockldx.x*blockDim.x + threadldx.x; sv[tid]  = v[i]; __syncthreads();
for(unsigned  int  s=l;  s < blockDim.x;   s *= 2) { if   (tid % (2*s) = 0)
}
sv[tid] += sv[tid + s]; __syncthreads () ;
if (tid
0)
}
v[biockldx.x]  = sv[0];
Jiří Filipovič        Code Optimizations
Reduction                      Cross-kernel Optimizations	General Advices	Searching for Bottlenecks
OOOOOO0OOOOOOOOOOOOOO	ooooooo	ooooo
		
Performance		
Costly modulo operation. Possibly high degree of divergence
• during the first iteration, only half of threads is working
• during the second iteration, only quarter of threads is working
• etc.
Performance on GTX 280: 3.77 GB/s (0.94 MEIem/s).
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOOOOOO0OOOOOOOOOOOOO
Implementation
General Advices
ooooooo
Searching for Bottlenecks
ooooo
We will modify indexation
for   (unsigned  int  s =  1;   s < blockDim.x;   s *= 2) { int   index = 2  *  s  * tid; if   (index < blockDim.x)
sv[index]  += sv[index + s]; __syncthreads();
}
Performance: 8.33 GB/s (2.08 MEIem/s).
The code is free of modulo and divergence, but generates shared memory bank conflicts.
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOOOOOOO0OOOOOOOOOOOO
Implementation
General Advices
ooooooo
Searching for Bottlenecks
ooooo
So we can try another indexing...
for   (unsigned  int  s = blockDim . x / 2;   s >  0;   s »= 1) { if   (tid < s)
s v[tid]  += s v [ t i d + s] ; __syncthreads();
}
No divergence and no conflicts. Performance 16.34 GB/s (4.08 MEIem/s). Half of threads do not compute...
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooo«ooooooooooo
Implementation
General Advices
ooooooo
Searching for Bottlenecks
ooooo
We can add numbers during loading them from global memory.
unsigned  int  i = blockldx.x*(blockDim.x*2) + threadldx.x; sv[tid]  = v[i]  + v[i+blockDim.x];
Performance 27.16 GB/s (6.79 MEIem/s).
There is no problem with data access, but the performance is still low - we will focus to instructions.
Jiří Filipovič        Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 5
Reduction Cross-kernel Optimizations
oooooooooo«oooooooooo
The number of active threads decreases during computation in shared memory.
• in the last six iterations, only the last warp is active
• the warp is synchronized implicitly, so we do not need syncthreadsQ (but this is not safe optimization)
• we need volatile variable in this case
• condition if(tid < s) does not spare any computation So we can unroll the last warp...
Jiří Filipovič        Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 5
Reduction Cross-kernel Optimizations
OOOOOOOOOOO0OOOOOOOOO
float  mySum = 0;
for   (unsigned  int s if   (tid < s)
sv [ t id ]  = mySum __syncthreads();
}
if   (tid < 32){
blockDim . x/2; s > 32; s »= 1){ mySum + sv[tid + s];
volat ile		float	* s = sv					
s	tid]	mySum	= mySum	+	s	tid	+	32] ;
s	tid]	mySum	= mySum	+	s	tid	+	16];
s	tid]	mySum	= mySum	+	s	tid	+	8];
s	tid]	mySum	= mySum	+	s	tid	+	4];
s	tid]	mySum	= mySum	+	s	tid	+	2];
s	tid]	mySum	= mySum	+	s	tid	+	i];
}
We save time in all warps (the last warp is simpler, others exits earlier from the for loop).
Performance: 37.68 GB/s (9.42 MEIem/s).   <Q>       <   „ 4 ._ „
Jin Fihpovic
Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 5
Reduction Cross-kernel Optimizations
oooooooooooo«oooooooo
For c.c. 3.0 or greater, we can use warp shuffle:
if
}
(tid < 32){ mySum += sdata[tid + 32];
for   (int   offset = warpSize/2;   offset > 0; mySum += __shfl_down(mySum, offset);
offset /= 2)
Jiří Filipovič        Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 6
Reduction Cross-kernel Optimizations
0000000000000*0000000
Can we unroll the for loop?
If we know the number of iterations, we can unroll it
• the number of iterations depends on the block size Can we implement it generically?
• algorithm uses blocks of size 2n
• the block size is upper-bounded
• if we know the block size during compilation, we can use a template
template <unsigned  int blockSize> __global__  void  reduce6(int *v)
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOOOOOOOOOOOOO0OOOOOO
General Advices
ooooooo
Searching for Bottlenecks
ooooo
mplementation
Conditions using blockSize are evaluated during compilation
if   (blockSize >= 512){ if   (tid < 256)
sv[tid] += sv[tid + 256]; __syncthreads();
}
if   (blockSize >= 256){ if   (tid < 128)
sv[tid] += sv[tid + 128]; __syncthreads();
}
if   (blockSize >= 128){ if   (tid < 64)
sv[tid]  += sv[tid + 64]; __syncthreads();
}
Performance: 50.64 GB/s (12.66 MEIem/s).
Jin Fihpovic
Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 7
Reduction Cross-kernel Optimizations
OOOOOOOOOOOOOOO0OOOOO
Can we implement faster algorithm? Let's reconsider the complexity:
• logr? parallel steps
• n — 1 additions
• time complexity for p threads running in parallel (using p processors):       + logr?)
Cost of parallel computation
• defined as number of processors multiplied by time complexity
• if we assign one thread to one data element, we get p — n
• and the cost is 0(n • log r?)
• which is not efficient
Jin Fihpovic
Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 7
Reduction Cross-kernel Optimizations
OOOOOOOOOOOOOOOO0OOOO
Decreasing the cost
• we use Ofr0—) threads
v log n /
• each thread performs 0(\ogn) sequential steps
• after that, it performs (D(\ogn) parallel steps
• time complexity is the same
• the cost is O(n) What it means in practice?
• we reduce overhead of the computation (e.g. integer arithmetics)
• advantage if we have much more threads that is needed to saturate GPU
• moreover, thread execution overhead is reduced
Jin Fihpovic
Code Optimizations
General Advices Searching for Bottlenecks
ooooooo ooooo
Implementation 7
Reduction Cross-kernel Optimizations
OOOOOOOOOOOOOOOOO0OOO
We modify loading into shared memory
unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid]  = 0;
while(i < n){
s v[t i d] += v[i]  + v[i+blockSize]; i += gridSize;
}
__syncthreads () ;
Performance: 77.21 GB/s (19.3 MEIem/s).
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOOOOOOOOOOOOOOOO0OOO
Implementation 7
General Advices
ooooooo
Searching for Bottlenecks
ooooo
We modify loading into shared memory
unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid]  = 0;
while(i < n){
s v[t i d] += v[i]  + v[i+blockSize]; i += gridSize;
}
__syncthreads () ;
Performance: 77.21 GB/s (19.3 MEIem/s).
You can find those implementations in CUDA SDK.
4 □ ►   < [3>
Reduction Cross-kernel Optimizations
000000000000000000*00
Intra-kernel Optimizations
General Advices
ooooooo
Searching for Bottlenecks
ooooo
The compiler optimizes each kernel separately, so it may miss some optimization opportunities.
• kernel fusion - gluing code from several kernels into one kernel
• kernel fission - splitting a kernel into several smaller kernels
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
OOOOOOOOOOOOOOOOOOO0O
General Advices
ooooooo
Searching for Bottlenecks
ooooo
Kernel fusion
Performance impact of kernel fusion
• reduce kernel execution overhead
• may add more parallelism
• allow more scalar code optimizations: common subexpression elimination, loop fusion, condition fusion
• reduce global memory transfers if kernels are flow-dependent or input-dependent
Correctness
• no flow dependency between thread blocks
• shared memory and registers locality has to be maintained
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks
OOOOOOOOOOOOOOOOOOOO*
Kernel fission
Kernel fission reduces resources consumption
• increases occupancy
• may allow to use different algorithm (e.g. if part of the algorithm uses different amount of parallelism or different amount of resources)
Correctness
• much easier, we just need to transfer data between new kernels
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices •oooooo
Searching for Bottlenecks
ooooo
Problem Choice
Before we start with code acceleration, we should consider carefully, if it is meaningful. The accelerated code should be
• critical for application performance (profile... and profile on real data)
• large enough (usually not ideal for relatively simple but latency critical application)
• parallelizable (problematic e.g. in simulations of small system evolving for long time)
• sufficient number of flops to memory transfers (consider slow PCI-E)
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks
ooooooooooooooooooooo o«ooooo ooooo
Problem Choice
Do we optimize running time or power consumption?
• accelerators are usually faster, but also have higher power consumption
• how to deal with hybrid systems (e.g. CPU, GPU and Xeon Phi)
• influences decision what to buy as well as what to use (which resources let in power-saving mode)
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices OO0OOOO
Searching for Bottlenecks
ooooo
Algorithm Design
Parallelization
• we need to parallelize computational problem
• we should be aware about target architecture even in this stage (consider e.g. graph algorithms)
It is difficult to accelerate codes on GPU:
• if threads within the warp access rather random addresses in the memory
• if threads within the warp diverges (by nature of the algorithm)
• if the parallelism is insufficient in certain parts of computation
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices OOO0OOO
Searching for Bottlenecks
ooooo
How to Write Bug-Free Code
Test if API and kernel calls are successful
• otherwise, errors can appear later...
The memory allocation on GPU is quite deterministic
• if your kernel does not write any result, you got a result from previous run
• clear output arrays for debugging purposes Be aware of out-of-bounds shared memory access
• kernel usually runs successfully, but one block interferes with another
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
Optimization
General Advices OOOO0OO
Searching for Bottlenecks
ooooo
Start with the most important optimizations and continue with less important (so the effect of less important optimizations is not hidden). In general, this order should work well:
• PCI-E transfers reduction/overlay
• global memory access (bandwidth, latency)
• access to other types of memory
• parallelism configuration (block size, amount of serial work per thread)
• divergence
• instruction optimization
It is good idea to write your code configurable
• block size, serial iteration per thread, loop unrolling factor, used algorithm ...
• use macros or template to allow optimization during compilation
Jin Fihpovic
Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices ooooo«o
Searching for Bottlenecks
ooooo
Interpretation of Algorithm Performance
Some optimizations may be hidden
• e.g. optimizing instruction cannot help when code is bound by wrong global memory access
• can be reduced by applying more important optimizations more early
• use the profiler
The optimization space is not continuous
• due to restricted amount of GPU resources
• e.g. improving efficiency of scalar code by using one more register may decrease the performance by restricting GPU occupancy
Performance is data-dependent
• data size: partition camping, underutilized GPU
• data content: sparse data with varying structure
1 i □ i   < 9 i ill ^0,0
Jin Fihpovic
Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices 000000«
Searching for Bottlenecks
ooooo
What is real speedup over CPU?
Comparison of theoretical peak is basic metric
• however, the speedup can be lower
• insufficient parallelism
• inappropriate data structures, random access
• PCI-E bottleneck (especially multi-GPU algorithms)
• however, the speedup can be also higher
• frequent usage of SFUs
• complicated vectorization on CPU
• insufficient scaling on SMP (cache interferences, NUMA)
• different scaling of CPU and GPU with growing problem size
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks
OOOOOOOOOOOOOOOOOOOOO OOOOOOO «0000
Searching for Bottlenecks
The amount of arithmetic operations and memory transfers tells us what is expected to be a limit for algorithm
• sometimes bottleneck is not clear (overhead instructions, irregular memory access)
• code profiling - suitable to identify issues with instructions throughput or bad memory access pattern, more difficult to identify source of latency problems
• code modifications - more precise, but more difficult and not usable in all cases
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices
ooooooo
Searching for Bottlenecks otooo
Profiling
How close is the code to the hardware limits?
• profiler shows the overall utilization of particular GPU subsystems, such as cache, global memory, FP instructions etc.
Issues identification
• profiler detects some issues, such as shared memory bank conflicts or code divergence
We can inspect a code in details
• time spent on particular instructions/C for CUDA lines of code
• we need to compile the code with flag -lineinf o
Jiří Filipovič        Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices
ooooooo
Searching for Bottlenecks ootoo
Code Modifications
Global memory performance
• we comment-out the computation
• but we need to somehow use loaded data (to disallow compiler to exclude loading)
• we can check with profiler that the same amount of data are transfered
Instructions performance
• we comment-out data movement
• but the resulting data is needed to be stored (to disallow compiler to exclude computation)
• but we do not want to store data......
• we can move storing data into condition which is evaluated as false during computation (but no during compilation)
• be aware of execution overhead in the case of fast kernels
Jin Fihpovic
Code Optimizations
Reduction Cross-kernel Optimizations
ooooooooooooooooooooo
General Advices
ooooooo
Searching for Bottlenecks ooo«o
Code Modifications
Be aware of occupancy changes
• code modifications can release some resources
• we can restrict occupancy by allocating some dummy shared memory array
Interpretation of measured times
• original kernel execution time is close to sum of computation and memory kernel time - the latency is an issue
• computation or memory kernel time dominates and is closed to original kernel time - the performance is bounded by computation or memory
• computation and memory kernel times are similar to original kernel time - we need to optimize both
Jin Fihpovic
Code Optimizations
Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks
ooooooooooooooooooooo ooooooo oooo*
Code Modifications
Approximation of optimization effect
• when we already know some performance issue
• when we want to know the effect of optimization before we actually implement it
• we can modify the code without preserving original functionality, but preserving amount of work and removing performance issue
• cannot be done in all cases
• may show us if we really address the performance issue
• see matrix transposition example
Jiří Filipovič        Code Optimizations