Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices ooooooo Searching for Bottlenecks ooooo Code Optimizations in Fihpovic fall 2015 Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations •oooooooooooooooooooo Vector Reduction General Advices ooooooo Searching for Bottlenecks ooooo Let v be the vector of size n. We want to compute x = X)/!=i vi- Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations •oooooooooooooooooooo Vector Reduction General Advices ooooooo Searching for Bottlenecks ooooo Let v be the vector of size n. We want to compute x = X)/!=i vi-C code (not very reasonable for floats) int x = 0; for (int i = 0; i < n; i++) x += v [ i ] ; There is flow dependency across iterations. Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations •oooooooooooooooooooo Vector Reduction General Advices ooooooo Searching for Bottlenecks ooooo Let v be the vector of size n. We want to compute x = X)/!=i vi-C code (not very reasonable for floats) int x = 0; for (int i = 0; i < n; i++) x += v [ i ] ; There is flow dependency across iterations. • we cannot be completely parallel • addition is (at least in theory :-)) associative • so, we do not need to add numbers in sequential order Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations O0OOOOOOOOOOOOOOOOOOO General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm The sequential algorithm performs seven steps: (((((("l + v2) + "a) + v4) + v5) + v6) + v7) + v8 Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations O0OOOOOOOOOOOOOOOOOOO General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm The sequential algorithm performs seven steps: ((((((^1 + V2) + V3) + VA) + V5) + V6) + V7) + V8 Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((v5 + v6) + (vj + v8)) Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations O0OOOOOOOOOOOOOOOOOOO General Advices ooooooo Searching for Bottlenecks ooooo Parallel Algorithm The sequential algorithm performs seven steps: ((((((vi + V2) + v3) + v4) + vb) + v6) + Vj) + v8 Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((vb + v6) + (w + vs)) We can work in parallel now: • four additions in the first step • two additions in the second step • one addition in the third step In summary, we perform n — 1 additions in log2 n parallel steps! Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks OO0OOOOOOOOOOOOOOOOOO ooooooo ooooo Parallel Algorithm We have found the parallel algorithm • the same number of additions as the serial algorithm • in logarithmic time (if we have enough cores) We add results of previous additions • flow-dependency across threads • we need global barrier Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOO0OOOOOOOOOOOOOOOOO Naive Approach General Advices ooooooo Searching for Bottlenecks ooooo The simplest scheme of the algorithm: • for even /, i < n perform v[ij += v[i+l] • repeat for n /= 2 untill n > 1 The performance is not ideal • 2n numbers loaded from global memory • n numbers stored to global memory • log2 n kernel invocations We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead. Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOOO0OOOOOOOOOOOOOOOO Exploiting Data Locality General Advices ooooooo Searching for Bottlenecks ooooo We can add more than tuples during single kernel call. • each block bx loads m numbers into shared memory • it reduces the input (in shared memory in log2 m steps) • it stores only one number containing YlT=mXx v> Reduces both memory transfers and number of kernel invocations • number of loads: n + % + % + .. + ^ = (n - 1)^ • approximately n+ numbers read, ^ written • \ogm n kernel invocations Jiří Filipovič Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 1 Reduction Cross-kernel Optimizations ooooo«ooooooooooooooo __global__ void reduce 1(int *v){ extern __shared__ int sv [ ]; unsigned int tid = threadldx.x; unsigned int i = blockldx.x*blockDim.x + threadldx.x; sv[tid] = v[i]; __syncthreads(); for(unsigned int s=l; s < blockDim.x; s *= 2) { if (tid % (2*s) = 0) } sv[tid] += sv[tid + s]; __syncthreads () ; if (tid 0) } v[biockldx.x] = sv[0]; Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks OOOOOO0OOOOOOOOOOOOOO ooooooo ooooo Performance Costly modulo operation. Possibly high degree of divergence • during the first iteration, only half of threads is working • during the second iteration, only quarter of threads is working • etc. Performance on GTX 280: 3.77 GB/s (0.94 MEIem/s). Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOOOOOO0OOOOOOOOOOOOO Implementation General Advices ooooooo Searching for Bottlenecks ooooo We will modify indexation for (unsigned int s = 1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) sv[index] += sv[index + s]; __syncthreads(); } Performance: 8.33 GB/s (2.08 MEIem/s). The code is free of modulo and divergence, but generates shared memory bank conflicts. Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOOOOOOO0OOOOOOOOOOOO Implementation General Advices ooooooo Searching for Bottlenecks ooooo So we can try another indexing... for (unsigned int s = blockDim . x / 2; s > 0; s »= 1) { if (tid < s) s v[tid] += s v [ t i d + s] ; __syncthreads(); } No divergence and no conflicts. Performance 16.34 GB/s (4.08 MEIem/s). Half of threads do not compute... Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooo«ooooooooooo Implementation General Advices ooooooo Searching for Bottlenecks ooooo We can add numbers during loading them from global memory. unsigned int i = blockldx.x*(blockDim.x*2) + threadldx.x; sv[tid] = v[i] + v[i+blockDim.x]; Performance 27.16 GB/s (6.79 MEIem/s). There is no problem with data access, but the performance is still low - we will focus to instructions. Jiří Filipovič Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 5 Reduction Cross-kernel Optimizations oooooooooo«oooooooooo The number of active threads decreases during computation in shared memory. • in the last six iterations, only the last warp is active • the warp is synchronized implicitly, so we do not need syncthreadsQ (but this is not safe optimization) • we need volatile variable in this case • condition if(tid < s) does not spare any computation So we can unroll the last warp... Jiří Filipovič Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 5 Reduction Cross-kernel Optimizations OOOOOOOOOOO0OOOOOOOOO float mySum = 0; for (unsigned int s if (tid < s) sv [ t id ] = mySum __syncthreads(); } if (tid < 32){ blockDim . x/2; s > 32; s »= 1){ mySum + sv[tid + s]; volat ile float * s = sv s tid] mySum = mySum + s tid + 32] ; s tid] mySum = mySum + s tid + 16]; s tid] mySum = mySum + s tid + 8]; s tid] mySum = mySum + s tid + 4]; s tid] mySum = mySum + s tid + 2]; s tid] mySum = mySum + s tid + i]; } We save time in all warps (the last warp is simpler, others exits earlier from the for loop). Performance: 37.68 GB/s (9.42 MEIem/s). < „ 4 ._ „ Jin Fihpovic Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 5 Reduction Cross-kernel Optimizations oooooooooooo«oooooooo For c.c. 3.0 or greater, we can use warp shuffle: if } (tid < 32){ mySum += sdata[tid + 32]; for (int offset = warpSize/2; offset > 0; mySum += __shfl_down(mySum, offset); offset /= 2) Jiří Filipovič Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 6 Reduction Cross-kernel Optimizations 0000000000000*0000000 Can we unroll the for loop? If we know the number of iterations, we can unroll it • the number of iterations depends on the block size Can we implement it generically? • algorithm uses blocks of size 2n • the block size is upper-bounded • if we know the block size during compilation, we can use a template template __global__ void reduce6(int *v) Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOOOOOOOOOOOOO0OOOOOO General Advices ooooooo Searching for Bottlenecks ooooo mplementation Conditions using blockSize are evaluated during compilation if (blockSize >= 512){ if (tid < 256) sv[tid] += sv[tid + 256]; __syncthreads(); } if (blockSize >= 256){ if (tid < 128) sv[tid] += sv[tid + 128]; __syncthreads(); } if (blockSize >= 128){ if (tid < 64) sv[tid] += sv[tid + 64]; __syncthreads(); } Performance: 50.64 GB/s (12.66 MEIem/s). Jin Fihpovic Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 7 Reduction Cross-kernel Optimizations OOOOOOOOOOOOOOO0OOOOO Can we implement faster algorithm? Let's reconsider the complexity: • logr? parallel steps • n — 1 additions • time complexity for p threads running in parallel (using p processors): + logr?) Cost of parallel computation • defined as number of processors multiplied by time complexity • if we assign one thread to one data element, we get p — n • and the cost is 0(n • log r?) • which is not efficient Jin Fihpovic Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 7 Reduction Cross-kernel Optimizations OOOOOOOOOOOOOOOO0OOOO Decreasing the cost • we use Ofr0—) threads v log n / • each thread performs 0(\ogn) sequential steps • after that, it performs (D(\ogn) parallel steps • time complexity is the same • the cost is O(n) What it means in practice? • we reduce overhead of the computation (e.g. integer arithmetics) • advantage if we have much more threads that is needed to saturate GPU • moreover, thread execution overhead is reduced Jin Fihpovic Code Optimizations General Advices Searching for Bottlenecks ooooooo ooooo Implementation 7 Reduction Cross-kernel Optimizations OOOOOOOOOOOOOOOOO0OOO We modify loading into shared memory unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid] = 0; while(i < n){ s v[t i d] += v[i] + v[i+blockSize]; i += gridSize; } __syncthreads () ; Performance: 77.21 GB/s (19.3 MEIem/s). Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOOOOOOOOOOOOOOOO0OOO Implementation 7 General Advices ooooooo Searching for Bottlenecks ooooo We modify loading into shared memory unsigned int gridSize = blockSize*2*gridDim.x ; sv[tid] = 0; while(i < n){ s v[t i d] += v[i] + v[i+blockSize]; i += gridSize; } __syncthreads () ; Performance: 77.21 GB/s (19.3 MEIem/s). You can find those implementations in CUDA SDK. 4 □ ► < [3> Reduction Cross-kernel Optimizations 000000000000000000*00 Intra-kernel Optimizations General Advices ooooooo Searching for Bottlenecks ooooo The compiler optimizes each kernel separately, so it may miss some optimization opportunities. • kernel fusion - gluing code from several kernels into one kernel • kernel fission - splitting a kernel into several smaller kernels Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations OOOOOOOOOOOOOOOOOOO0O General Advices ooooooo Searching for Bottlenecks ooooo Kernel fusion Performance impact of kernel fusion • reduce kernel execution overhead • may add more parallelism • allow more scalar code optimizations: common subexpression elimination, loop fusion, condition fusion • reduce global memory transfers if kernels are flow-dependent or input-dependent Correctness • no flow dependency between thread blocks • shared memory and registers locality has to be maintained Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks OOOOOOOOOOOOOOOOOOOO* Kernel fission Kernel fission reduces resources consumption • increases occupancy • may allow to use different algorithm (e.g. if part of the algorithm uses different amount of parallelism or different amount of resources) Correctness • much easier, we just need to transfer data between new kernels Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices •oooooo Searching for Bottlenecks ooooo Problem Choice Before we start with code acceleration, we should consider carefully, if it is meaningful. The accelerated code should be • critical for application performance (profile... and profile on real data) • large enough (usually not ideal for relatively simple but latency critical application) • parallelizable (problematic e.g. in simulations of small system evolving for long time) • sufficient number of flops to memory transfers (consider slow PCI-E) Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks ooooooooooooooooooooo o«ooooo ooooo Problem Choice Do we optimize running time or power consumption? • accelerators are usually faster, but also have higher power consumption • how to deal with hybrid systems (e.g. CPU, GPU and Xeon Phi) • influences decision what to buy as well as what to use (which resources let in power-saving mode) Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices OO0OOOO Searching for Bottlenecks ooooo Algorithm Design Parallelization • we need to parallelize computational problem • we should be aware about target architecture even in this stage (consider e.g. graph algorithms) It is difficult to accelerate codes on GPU: • if threads within the warp access rather random addresses in the memory • if threads within the warp diverges (by nature of the algorithm) • if the parallelism is insufficient in certain parts of computation Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices OOO0OOO Searching for Bottlenecks ooooo How to Write Bug-Free Code Test if API and kernel calls are successful • otherwise, errors can appear later... The memory allocation on GPU is quite deterministic • if your kernel does not write any result, you got a result from previous run • clear output arrays for debugging purposes Be aware of out-of-bounds shared memory access • kernel usually runs successfully, but one block interferes with another Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo Optimization General Advices OOOO0OO Searching for Bottlenecks ooooo Start with the most important optimizations and continue with less important (so the effect of less important optimizations is not hidden). In general, this order should work well: • PCI-E transfers reduction/overlay • global memory access (bandwidth, latency) • access to other types of memory • parallelism configuration (block size, amount of serial work per thread) • divergence • instruction optimization It is good idea to write your code configurable • block size, serial iteration per thread, loop unrolling factor, used algorithm ... • use macros or template to allow optimization during compilation Jin Fihpovic Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices ooooo«o Searching for Bottlenecks ooooo Interpretation of Algorithm Performance Some optimizations may be hidden • e.g. optimizing instruction cannot help when code is bound by wrong global memory access • can be reduced by applying more important optimizations more early • use the profiler The optimization space is not continuous • due to restricted amount of GPU resources • e.g. improving efficiency of scalar code by using one more register may decrease the performance by restricting GPU occupancy Performance is data-dependent • data size: partition camping, underutilized GPU • data content: sparse data with varying structure 1 i □ i < 9 i ill ^0,0 Jin Fihpovic Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices 000000« Searching for Bottlenecks ooooo What is real speedup over CPU? Comparison of theoretical peak is basic metric • however, the speedup can be lower • insufficient parallelism • inappropriate data structures, random access • PCI-E bottleneck (especially multi-GPU algorithms) • however, the speedup can be also higher • frequent usage of SFUs • complicated vectorization on CPU • insufficient scaling on SMP (cache interferences, NUMA) • different scaling of CPU and GPU with growing problem size Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks OOOOOOOOOOOOOOOOOOOOO OOOOOOO «0000 Searching for Bottlenecks The amount of arithmetic operations and memory transfers tells us what is expected to be a limit for algorithm • sometimes bottleneck is not clear (overhead instructions, irregular memory access) • code profiling - suitable to identify issues with instructions throughput or bad memory access pattern, more difficult to identify source of latency problems • code modifications - more precise, but more difficult and not usable in all cases Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices ooooooo Searching for Bottlenecks otooo Profiling How close is the code to the hardware limits? • profiler shows the overall utilization of particular GPU subsystems, such as cache, global memory, FP instructions etc. Issues identification • profiler detects some issues, such as shared memory bank conflicts or code divergence We can inspect a code in details • time spent on particular instructions/C for CUDA lines of code • we need to compile the code with flag -lineinf o Jiří Filipovič Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices ooooooo Searching for Bottlenecks ootoo Code Modifications Global memory performance • we comment-out the computation • but we need to somehow use loaded data (to disallow compiler to exclude loading) • we can check with profiler that the same amount of data are transfered Instructions performance • we comment-out data movement • but the resulting data is needed to be stored (to disallow compiler to exclude computation) • but we do not want to store data...... • we can move storing data into condition which is evaluated as false during computation (but no during compilation) • be aware of execution overhead in the case of fast kernels Jin Fihpovic Code Optimizations Reduction Cross-kernel Optimizations ooooooooooooooooooooo General Advices ooooooo Searching for Bottlenecks ooo«o Code Modifications Be aware of occupancy changes • code modifications can release some resources • we can restrict occupancy by allocating some dummy shared memory array Interpretation of measured times • original kernel execution time is close to sum of computation and memory kernel time - the latency is an issue • computation or memory kernel time dominates and is closed to original kernel time - the performance is bounded by computation or memory • computation and memory kernel times are similar to original kernel time - we need to optimize both Jin Fihpovic Code Optimizations Reduction Cross-kernel Optimizations General Advices Searching for Bottlenecks ooooooooooooooooooooo ooooooo oooo* Code Modifications Approximation of optimization effect • when we already know some performance issue • when we want to know the effect of optimization before we actually implement it • we can modify the code without preserving original functionality, but preserving amount of work and removing performance issue • cannot be done in all cases • may show us if we really address the performance issue • see matrix transposition example Jiří Filipovič Code Optimizations