Code optimization for GPUs Advices in Fihpovic spring 2024 Jin Fihpovic Code optimization for GPUs 1/56 Parallelism Memory access optimization •ooo ooooooooooooooooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo GPU Parallelism Parallel algorithms need to be designed w.r.t. the parallelism available in the HW • GPU: array of SIMT multiprocessors working using shared memory Decomposition for GPU o coarse-grained decomposition of the problem into the parts that don't need intensive communication • fine-grained decomposition similar to vectorization (but SIMT is more flexible) Parallelism Memory access optimization o«oo ooooooooooooooooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo SIMT A multiprocessor of G80 has one unit executing an instruction • all 8 SPs have to execute the same instruction • new instruction is executed every 4 cycles o 32 threads (so called warp) need to execute the same instruction, warp size is fixed for all existing CUDA hardware How about code branching? • if different parts of a warp perform different instructions, they are serialized • decreases performance—should be avoided The multiprocessor is thus (nearly) MIMD (Multiple-Instruction Multiple-Thread) from programmer's perspective and SIMT (Single-Instruction Multiple-Thread) from performance perspective. Jiří Filipovič Code optimization for GPUs 3/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oo«o ooooooooooooooooooo oooooooo oo oooooooooooooooooo oooo Thread Properties GPU threads are very lightweight compared to CPU threads. • their run time can be very short (even tens of instructions) 9 there should be many of them • they should not use large amount of resources Threads are aggregated into blocks • all threads of the block always run on the same multiprocessor (multiple blocks can run at one multiprocessor) o having sufficient number of blocks is substantial to achieve good scalability Number of threads and thread blocks per multiprocessor is limited. Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices ooo« ooooooooooooooooooo oooooooo oo oooooooooooooooooo oooo Memory Latency Masking Memory has latency • global memory has high latency (hundreds of cycles) • registers and shared memory have read-after-write latency Memory latency hiding is different from CPU o no instructions are executed out of order (but ILP can be exploited by forcing finalization of load instruction just before loaded data are needed) • no or limited cache When a warp waits for data from memory, another warp may be executed • allows memory latency hiding • requires execution of more threads than the number of GPU cores • thread execution scheduling and switching is implemented directly in HW without overhead Jiří Filipovič Code optimization for GPUs 5/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO «000000000000000000 oooooooo oo oooooooooooooooooo oooo Global Memory Access Optimization Performance of global memory becomes a bottleneck easily • global memory bandwdith is low relatively to arithmetic performance of GPU (GT200 > 24 FLOPS/float, GF100 > 30, GK110 > 62, GM200 > 73, GP100 > 53, GV100 > 67, TU102 > 76, GA100 > 50, AD102 > 72, GH100 > 20, but > 295 with 32-bit tensor ops) • 400-600 cycles latency The throughput can be significantly worse with bad parallel access pattern • the memory has to be accessed coalesced 9 use of just certain subset of memory regions should be avoided (partition camping) Jiří Filipovič Code optimization for GPUs 6/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO O0OOOOOOOOOOOOOOOOO oooooooo oo oooooooooooooooooo oooo Coalesced Memory Access (C. C. < 2.0) A half of a warp can transfer data using single transaction or one to two transactions when transferring a 128 B word • it is necessary to use large words <* one memory transaction can transfer aligned 32 B, 64 B, or 128 B words • GPUs with c. c. < 1.2 • the accessed block has to begin at an address divisible by 16x data size • k-th thread has to access k-th block element • some threads may not participate • if these rules are not obeyed, each element is retrieved using a separate memory transaction J in Fihpovic Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OO0OOOOOOOOOOOOOOOO oooooooo oo oooooooooooooooooo oooo Coalesced Memory Access (C. C. < 2.0) GPUs with c. c. > 1.2 are less restrictive • each transfer is split into 32 B, 64 B, or 128 B transactions in a way to serve all requests with the least number of transactions • order of threads can be arbitrarily permuted w.r.t. transferred elements Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOO0OOOOOOOOOOOOOOO oooooooo oo oooooooooooooooooo oooo Coalesced Memory Access (C. C. < 2.0) Threads are aligned, element block is contiguous, order is not permuted - coalesced access on all GPUs 1 1 1 1 1 1 11 Jiří Filipovič Code optimization for GPUs 9/56 ~--S »000° - „ access "SSS5000 ^ ^oVooooooooo . con^uoUrP\Js^c- " * a\\s^eá' •„« onG?u — Tbreads ar _ one wan ^gggggtt 0de ytf f W»P0>,Vt Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooo«ooooooooooooo oooooooo oo oooooooooooooooooo oooo Unaligned Memory Access (C. C. < 2.0) Similar case may result in a need for two transactions J \±}WfttHit*f- n \Tl rhhUi /// /// / ///;/// rnuj / / / i i i ,■ y /1 / / / i i * Jiří Filipovič Code optimization for GPUs 11/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo oooooo«oooooooooooo oooooooo oo oooooooooooooooooo oooo Unaligned Memory Access Performance (C. C. < 2.0) Older GPUs perform smallest possible transfer (32 B) for each element, thus reducing performance to 1/8 Newer GPUs perform (c. c. > 1.2) two transfers D 2 4 6 8 10 12 14 16 onset Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOO0OOOOOOOOOOO oooooooo oo oooooooooooooooooo oooo Interleaved Memory Access Performance (C. C. < 2.0) The bigger the spaces between elements, the bigger performance drop on GPUs with c. c. > 1.2 - the effect is rather dramatic GTX2SQ FX5B0Ü "i—I—r D 2 4 B 8 10 12 14 16 18 Stride J in Fihpovic Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOO0OOOOOOOOOO oooooooo oo oooooooooooooooooo oooo Global Memory Access with Fermi (C. C. Fermi has LI and L2 cache o LI: 256 B per row, 16 kB or 48 kB per multiprocessor in total o L2: 32 B per row, 768 kB on GPU in total What are the advantages? • more efficient programs with unpredictable data locality • more efficient when shared memory is not used from some reason o unaligned access - no slowdown in principle • interleaved access - data needs to be used before it is flushed from the cache, otherwise the same or bigger problem as with c. c. < 2.0 (LI cache may be turned of to avoid overfetching) Jiří Filipovič Code optimization for GPUs 14 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOO0OOOOOOOOO oooooooo oo oooooooooooooooooo oooo Global Memory Access with "gaming" Kepler (C. C. = 3.0) There is only L2 cache for read/write global memory access • L2: 32 B per row, up to 1.5 MB per GPU • LI: for local memory, 16KB, 32KB or 48KB in total Jiří Filipovič Code optimization for GPUs 15/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo oooooooooo«oooooooo oooooooo oo oooooooooooooooooo oooo Global Memory Access with fully-featured Kepler, Maxwe and Pascal (3.5 < C. C. < 6.0) Read-only data cache • shared with textures • compiler tries to use, we can help with __restrict__ and -ldg() 9 slower than Fermi's LI No LI cache for local memory • inefficient for programs heavily using local memory Jiří Filipovič Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOO0OOOOOOO oooooooo oo oooooooooooooooooo oooo Global Memory Access with Volta and newer (C. C. > 6.0) Faster LI is back • the same hardware as shared memory, but for read-only buffers j in Fihpovic Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooooooooooo«oooooo oooooooo HW Organization of Shared Memory Shared memory is organized into memory banks, which can be accessed in parallel • c. c. 1.x 16 banks, c. c. > 2.0 32 banks • memory space mapped in an interleaved way with 32 b shift or 64 b shift (c.c. 3.x) • to use full memory performance, we have to access data in different banks • broadcast implemented - if all threads access the same data Parallelism Memory access optimization oooo ooooooooooooo«ooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo Bank Conflict Bank conflict • occurs when some threads in warp/half-warp access data in the same memory bank with several exceptions • threads access exactly the same data • threads access different half-words of 64 b word (c.c. 3.x) • when occurs, memory access gets serialized • performance drop is proportional to number of parallel operations that the memory has to perform to serve a request Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo oooooooooooooo«oooo oooooooo oo oooooooooooooooooo oooo Access without Conflicts Jiří Filipovič Code optimization for GPUs 20 /56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOO0OOO OOOOOOOO OO OOOOOOOOOOOOOOOOOO OOOO n-Way Conflicts Jiří Filipovič Code optimization for GPUs 21/56 Parallelism Memory access optimization OOOO OOOOOOOOOOOOOOOO0OO Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices OOOOOOOOOOOOOOOOOO OOOO Broadcast -J J -J -J -J -J -J -J -J J -J -J -J -J Jiří Filipovič Code optimization for GPUs 22/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOO0O oooooooo oo oooooooooooooooooo oooo Access Patterns Alignment is not needed, bank conflicts not generated int x = s[threadldx.x + offset]; Interleaving does not create conflicts if c is odd, for c.c. > 3.0 no conflict if c = 2 and 32 b numbers are accessed int x = s[threadldx.x * c]; Access to the same variable never generates conflicts on c. c. 2.x, while on 1.x only if thread count accessing the variable is multiple of 16 int x = s[threadldx.x / c]; Jiří Filipovič Code optimization for GPUs 23/56 CPU-GPU memory transfers Transfers between host and GPU memory • need to be minimized (often at cost of decreasing efficiency of computation on GPU) • may be accelerated using page-locked memory • it is more efficient to transfer large blocks at once • computations and memory transfers should be overlapped Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO «0000000 oo oooooooooooooooooo oooo Matrix Transposition From theoretical perspective: • a trivial problem • a trivial parallelization • trivially limited by the memory throughput (no arithmetic ops done) __global__ void mtran(float *odata, float* idata, int n){ int x = blockldx.x * blockDim.x + threadldx.x; int y = blockldx.y * blockDim.y + threadldx.y; odata[x*n + y] = idata[y*n + x]; Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO O0OOOOOO oo oooooooooooooooooo oooo Performance When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Jin Fihpovic i3P Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO O0OOOOOO oo oooooooooooooooooo oooo Performance When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem? Access to odata is interleaved. After modification (copy instead of transpose matrices): odata[y*n + x] = idata[y*n + x]; the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s. Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition OO0OOOOO Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo Removing Interleaving The matrix can be processed per tiles 9 we read the tile into the shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition OO0OOOOO Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo Removing Interleaving The matrix can be processed per tiles 9 we read the tile into the shared memory row-wise • we will store its transposition into the global memory row-wise • thus having both reading and writing without interleaving What size of tiles should be used? o lets consider square tiles for simplicity • for aligned reading, the row size has to be multiple of 16 • we can consider tile sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations • best size can be determined experimentally Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOO0OOOO oo oooooooooooooooooo oooo Tiled Transposition __global__ void mtran_coalesced(float *odata, float *idata, int n) { __shared__ float tile[TILE_DIM][TILE_DIM]; int x = blockldx.x * TILE_DIM + threadldx.x; int y = blockldx.y * TILE_DIM + threadldx.y; int index_in = x + y*n; x = blockldx.y * TILE_DIM + threadldx.x; y = blockldx.x * TILE_DIM + threadldx.y; int index_out = x + y*n; for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) tile[threadldx.y+i][threadldx.x] = idata[index_in+i*n]; __syncthreads(); for (int i = 0; i < TILE_DIM; i += BL0CK_R0WS) odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i]; } Jiří Filipovič Code optimization for GPUs 28 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOO0OOO oo oooooooooooooooooo oooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s Jiří Filipovič Code optimization for GPUs 29/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOO0OOO oo oooooooooooooooooo oooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s that's significantly better but still less than simple copying Jiří Filipovič Code optimization for GPUs 29/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOO0OOO oo oooooooooooooooooo oooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere Jiří Filipovič Code optimization for GPUs 29/56 Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition OOOO0OOO Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo Performance The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s • that's significantly better but still less than simple copying • the kernel is more complex, contains synchronization • we need to figure out whether we got the maximum or there's still a problem somewhere 9 if we only copy within the blocks, we get 94.9GB/s 9 something is still sub-optimal Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo ooooo«oo oo oooooooooooooooooo oooo Shared Memory When reading from the global memory, we write into the shared memory row-wise t ile [ threadldx . y+i ] [ threadldx . x ] = idata [ index_in+i*n ] ; Jiří Filipovič Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOO0OO oo oooooooooooooooooo oooo Shared Memory When reading from the global memory, we write into the shared memory row-wise t ile [ threadldx . y+i ] [ threadldx . x ] = idat a [ index_in+i*n ] ; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i ] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creating 16-way bank conflict. S1 Jiří Filipovič Code optimization for GPUs 30/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOO0OO oo oooooooooooooooooo oooo Shared Memory When reading from the global memory, we write into the shared memory row-wise t ile [ threadldx . y+i ] [ threadldx . x ] = idata [ index_in+i*n ] ; When writing to the global memory, we read from the shared memory column-wise odata[index_out+i*n] = tile[threadldx.x][threadldx.y+i ] ; That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creating 16-way bank conflict. A solution is padding: __shared__ float tile[TILE.DIM][TILE_DIM + 1]; Jin Fihpovic Code optimization for GPUs Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition oooooo«o Instructions Speed oo Reduction General Advices oooooooooooooooooo oooo Performance Now our implementations shows 93.4 GB/s. 9 as good as simple copying • it seems we can't do much better for given matrix o beware of different input data sizes (partition camping) Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo ooooooo« oo oooooooooooooooooo oooo Performance Summary All optimizations were only toward better accommodation of HW properties • however, we got 17.6x speedup • when creating an algorithm, it is necessary to understand HW limitations • otherwise we wouldn't have to develop specifically for GPUs... Jiří Filipovič Code optimization for GPUs 32/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo «o oooooooooooooooooo oooo Instructions performance Some instructions are relatively slower than with CPUs • integer division and modulo • 32-bit integer multiplication with c.c. 1.x • 24-bit integer multiplication with c.c. 2.x and newer Some instructions are relatively faster than with CPUs 9 lower-precision arithmetics performed by SFUs • —sinf(x), —cosf(x), —expf(x), sincosf(x), ^rsqrtf(x) etc. Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo om oooooooooooooooooo oooo Small loops have significant overhead • jumps • conditions • control variable updates • significant part of instructions may be pointer arithmetics • low I LP Loop unrolling is an option • partially may be done by the compiler • we can do manual unrolling or use #pragma unroll Jiří Filipovič Code optimization for GPUs 34 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo «00000000000000000 oooo Vector Reduction Let v be the vector of size n. We want to compute x = Vi Jiří Filipovič Code optimization for GPUs 35/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO «00000000000000000 oooo Vector Reduction Let v be the vector of size n. We want to compute x C code (not very reasonable for floats) int x = 0; for (int i = 0; i < n; i++) x += v [ i ] ; There is flow dependency across iterations. En i=i v> Jiří Filipovič Code optimization for GPUs ^ O. 35/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO «00000000000000000 oooo Vector Reduction Let v be the vector of size n. We want to compute x = vi-C code (not very reasonable for floats) int x = 0; for (int i = 0; i < n; i++) x += v [ i ] ; There is flow dependency across iterations. • we cannot compute completely parallel • addition is (at least in theory :-)) associative • so, we do not need to add numbers in sequential order Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo o«oooooooooooooooo oooo Parallel Algorithm The sequential algorithm performs seven steps: (((((("I + v2) + V3) + "4) + "5) + W>) + V7) + v8 Jiří Filipovič Code optimization for GPUs 36/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo o«oooooooooooooooo oooo Parallel Algorithm The sequential algorithm performs seven steps: (((((K + V2) + V3) + V4) + V5) + V6) + V7) + V8 Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((v5 + v6) + (vj + v8)) Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo o«oooooooooooooooo oooo Parallel Algorithm The sequential algorithm performs seven steps: ((((((vi + V2) + v3) + v4) + vb) + v6) + Vj) + v8 Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((vb + v6) + (w + vs)) We can work in parallel now: • four additions in the first step o two additions in the second step • one addition in the third step In summary, we perform n — 1 additions in log2 n parallel steps! Jiří Filipovič Code optimization for GPUs 36 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo oo«ooooooooooooooo oooo Parallel Algorithm We have found the parallel algorithm • the same number of additions as the serial algorithm • in logarithmic time (if we have enough cores) We add results of previous additions • flow-dependency across threads • we need global barrier Jiří Filipovič Code optimization for GPUs 37 / 56 Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices OOO0OOOOOOOOOOOOOO oooo Naive Approach The simplest scheme of the algorithm: • for even /, i < n perform v[i] += v[i+l] o repeat for n /= 2 untill n > 1 The performance is not ideal o 2n numbers loaded from global memory • n numbers stored to global memory o log2 n kernel invocations We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead. Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOO0OOOOOOOOOOOOO oooo Exploiting Data Locality We can add more than pairs during single kernel call. • each block bx loads m numbers into shared memory o it reduces the input (in shared memory in log2 m steps) • it stores only one number containing YlT=bmXx v> Reduces both memory transfers and number of kernel invocations • number of loads: n + £ + £ + .. + -g-n = (n - • approximately n + — numbers read, — written 9 \ogm n kernel invocations Jiří Filipovič Code optimization for GPUs 39/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOOO0OOOOOOOOOOOO oooo mplementation 1 __global__ void reducel(int *v){ extern __shared__ int sv [ ]; unsigned int tid = threadldx.x; unsigned int i = blockldx.x*blockDim.x + threadldx.x; sv[tid] = v[i]; __syncthreads(); for(unsigned int s=l; s < blockDim.x; s *= 2) { if (tid % (2*s) = 0) } sv[tid] += sv[tid + s]; __syncthreads () ; } if (tid 0) v[blockldx.x] = sv[0]; Jiří Filipovič Code optimization for GPUs 40/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOOOO0OOOOOOOOOOO oooo Performance Beware modulo operation. High degree of divergence • during the first iteration, only half of threads is working • during the second iteration, only quarter of threads is working • etc. Performance on GTX 280: 3.77 GB/s (0.94 MEIem/s). Jiří Filipovič Code optimization for GPUs 41/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo ooooooo^oooooooooo oooo Implementation We will modify indexation for (unsigned int s = 1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) sv[index] += sv[index + s]; __syncthreads(); } Performance: 8.33 GB/s (2.08 MEIem/s). The code is free of modulo and divergence, but generates shared memory bank conflicts. S1 Jiří Filipovič Code optimization for GPUs 42/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo oooooooo«ooooooooo oooo mplementation 3 So we can try another indexing... for (unsigned int s = blockDim . x / 2; s > 0; s »= 1) { if (tid < s) s v[tid] += s v[t i d + s] ; __syncthreads(); } No divergence and no conflicts. Performance 16.34 GB/s (4.08 MEIem/s). Half of threads do not compute... Jiří Filipovič Code optimization for GPUs 43/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOOOOOOO0OOOOOOOO oooo Implementation 4 We can add numbers during loading them from global memory. unsigned int i = blockldx.x*(blockDim.x*2) + threadldx.x; sv[tid] = v[i] + v[i+blockDim.x]; Performance 27.16 GB/s (6.79 MEIem/s). There is no problem with data access, but the performance is still low - we will focus to instructions. Jiří Filipovič Code optimization for GPUs 44/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOOOOOOOO0OOOOOOO oooo Implementation 5 The number of active threads decreases during computation in shared memory. • in the last six iterations, only the last warp is active • the warp is synchronized implicitly on GPUs with c.c. < 7.0, so we do not need syncthreadsQ 9 we need volatile variable in this case • condition if(tid < s) does not spare any computation So we can unroll the last warp... Jiří Filipovič Code optimization for GPUs 45 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOOOOOOOOO0OOOOOO oooo Implementation 5 float mySum = 0; for (unsigned int s = blockDim . x / 2; s > 32; s »= 1){ if (tid < s) sv[tid] = mySum = mySum + sv[tid + s]; __syncthreads(); } if (tid < 32){ volatile float *s = sv; s tid] = mySum = mySum + s tid + 32] // __sy n cthreads () s tid] = mySum = mySum + s tid + 16] // __sy n cthreads () s tid] = mySum = mySum + s tid + 8]; ll- _sy n cthreads (); s tid] = mySum = mySum + s tid + 4]; ll- _sy n cthreads (); s tid] = mySum = mySum + s tid + 2]; ll - _sy n cthreads (); s tid] = mySum = mySum + s tid + i]; } We save time in all warps (the last warp is simpler, others exits earlier from the for loop). Performance: 37.68 GB/s (9.42 MEIem/s). < n „ , ._ _ Jiří Filipovič Code optimization for GPUs 46 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo oooooooooooo«ooooo oooo mplementation 5 For c.c. 3.0 or greater, we can use warp shuffle: if (tid < 32){ mySum += sdata[tid + 32]; for (int offset = warpSize/2; offset > 0; offset /= 2) mySum += __shfl_down_sync(mySum, offset); } This is safe for all GPUs. Jin Fihpovic ■5P Code optimization for GPUs Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices OOOO OOOOOOOOOOOOOOOOOOO OOOOOOOO OO OOOOOOOOOOOOO0OOOO oooo Implementation 6 Can we unroll the for loop? If we know the number of iterations, we can unroll it • the number of iterations depends on the block size Can we implement it generically? • algorithm uses blocks of size 2n • the block size is upper-bound • if we know the block size during compilation, we can use a template template __global__ void reduce6(int *v) Jiří Filipovič Code optimization for GPUs 48 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo oooooooooooooo«ooo oooo Implementation 6 Conditions using blockSize are evaluated during compilation: if (blockSize >= 512){ if (tid < 256) sv[tid] += sv[tid + 256]; __syncthreads(); } if (blockSize >= 256){ if (tid < 128) sv[tid] += sv[tid + 128]; __syncthreads(); } if (blockSize >= 128){ if (tid < 64) sv[tid] += sv[tid + 64]; __syncthreads(); } Performance: 50.64 GB/s (12.66 MEIem/s). <□► < rS1 ► < ^ ► < ^ ► 1 -O O Jiří Filipovič Code optimization for GPUs 49 / 56 Parallelism Memory access optimization oooo ooooooooooooooooooo Implementation 7 Matrix Transposition oooooooo Instructions Speed oo Reduction 000000000000000« General Advices •oo oooo Can we implement faster algorithm? Let's reconsider the complexity: • logr? parallel steps 9 n — 1 additions 9 time complexity for p threads running in parallel (using p processors): + logr?) Cost of parallel computation 9 defined as number of processors multiplied by time complexity • if we assign one thread to one data element, we get p — n 9 and the cost is 0(n • log n) 9 which is not efficient Jiří Filipovič Code optimization for GPUs 50 / 56 Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices OOOOOOOOOOOOOOOO0O oooo Implementation 7 Decreasing the cost • we use 0(r°—) threads ^ log n ' • each thread performs 0(\ogn) sequential steps • after that, it performs (D(\ogn) parallel steps • time complexity is the same • the cost is O(n) What it means in practice? <* we reduce overhead of the computation (e.g., integer arithmetics) • advantage if we have much more threads that is needed to saturate GPU Jin Filipovic Code optimization for GPUs 51/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction oooo ooooooooooooooooooo oooooooo oo ooooooooooooooooo« General Advices oooo mplementation 7 We modify loading into shared memory unsigned int gridSize = blockSize*2*gridDim.x; sv[tid] = 0; while(i < n){ s v[t i d] += v[i] + v[i+blockSize]; i += gridSize; } __syncthreads () ; Performance: 77.21 GB/s (19.3 MEIem/s). Jiří Filipovič Code optimization for GPUs 52/56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo ooooooooooooooooo* oooo Implementation 7 We modify loading into shared memory unsigned int gridSize = blockSize*2*gridDim.x; sv[tid] = 0; while(i < n){ s v[tid] += v[i] + v[i+blockSize]; i += gridSize; } __syncthreads () ; Performance: 77.21 GB/s (19.3 MEIem/s). You can find those implementations in CUDA SDK. Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices OOOOOOOOOOOOOOOOOO «000 Problem Choice Before we start with code acceleration, we should consider carefully, if it is meaningful. The accelerated code should be o critical for application performance (profile... and profile on real data) • large enough (usually not ideal for relatively simple but latency critical application) • parallelizable (problematic, e.g., in simulation of a small system evolving for a long time) o sufficient number of flops to memory transfers (consider slow PCI-E) Parallelism Memory access optimization oooo ooooooooooooooooooo Matrix Transposition oooooooo Instructions Speed oo Reduction General Advices oooooooooooooooooo o«oo Parallelization Vector addition a simple to formulate data-parallel code 9 no synchronization Reduction o may seem sequential at a glance • but can be computed in logr? steps • needs some thinking about how to parallelize Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo oooooooooooooooooo oo«o Code divergence Code divergence • hurts performance when divergence occurs within the warp o it can be easy to remove • reduction • but also hard to remove • graph processing, many independent automata • sometimes different algorithm needs to be formulated Jiří Filipovič Code optimization for GPUs 55 / 56 Parallelism Memory access optimization Matrix Transposition Instructions Speed Reduction General Advices oooo ooooooooooooooooooo oooooooo oo oooooooooooooooooo ooo« Scattered memory access Scattered memory access • if not accessed in coalesced way, memory bandwidth is reduced • often difficult to work-around • graph processing • sometimes we can investigate different data structures • sparse matrices • with more regular structures, we can use shared memory • matrix transposition Jiří Filipovič Code optimization for GPUs 56 /56