Code optimization for GPUs
Advices
in Fihpovic
spring 2024
Jin Fihpovic
Code optimization for GPUs
1/56
Parallelism Memory access optimization •ooo ooooooooooooooooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
GPU Parallelism
Parallel algorithms need to be designed w.r.t. the parallelism available in the HW
• GPU: array of SIMT multiprocessors working using shared memory
Decomposition for GPU
o coarse-grained decomposition of the problem into the parts that don't need intensive communication
• fine-grained decomposition similar to vectorization (but SIMT is more flexible)
Parallelism Memory access optimization o«oo ooooooooooooooooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
SIMT
A multiprocessor of G80 has one unit executing an instruction
• all 8 SPs have to execute the same instruction
• new instruction is executed every 4 cycles
o 32 threads (so called warp) need to execute the same
instruction, warp size is fixed for all existing CUDA hardware
How about code branching?
• if different parts of a warp perform different instructions, they are serialized
• decreases performance—should be avoided
The multiprocessor is thus (nearly) MIMD (Multiple-Instruction Multiple-Thread) from programmer's perspective and SIMT (Single-Instruction Multiple-Thread) from performance perspective.
Jiří Filipovič        Code optimization for GPUs 3/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oo«o        ooooooooooooooooooo   oooooooo oo oooooooooooooooooo oooo
Thread Properties
GPU threads are very lightweight compared to CPU threads.
• their run time can be very short (even tens of instructions) 9 there should be many of them
• they should not use large amount of resources Threads are aggregated into blocks
• all threads of the block always run on the same multiprocessor (multiple blocks can run at one multiprocessor)
o having sufficient number of blocks is substantial to achieve good scalability
Number of threads and thread blocks per multiprocessor is limited.
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
ooo«        ooooooooooooooooooo   oooooooo oo oooooooooooooooooo oooo
Memory Latency Masking
Memory has latency
• global memory has high latency (hundreds of cycles)
• registers and shared memory have read-after-write latency Memory latency hiding is different from CPU
o no instructions are executed out of order (but ILP can be exploited by forcing finalization of load instruction just before loaded data are needed)
• no or limited cache
When a warp waits for data from memory, another warp may be executed
• allows memory latency hiding
• requires execution of more threads than the number of GPU cores
• thread execution scheduling and switching is implemented directly in HW without overhead
Jiří Filipovič        Code optimization for GPUs 5/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO «000000000000000000    oooooooo oo oooooooooooooooooo oooo
Global Memory Access Optimization
Performance of global memory becomes a bottleneck easily
• global memory bandwdith is low relatively to arithmetic performance of GPU (GT200 > 24 FLOPS/float, GF100 > 30, GK110 > 62, GM200 > 73, GP100 > 53, GV100 > 67, TU102 > 76, GA100 > 50, AD102 > 72, GH100 > 20, but > 295 with 32-bit tensor ops)
• 400-600 cycles latency
The throughput can be significantly worse with bad parallel access pattern
• the memory has to be accessed coalesced
9 use of just certain subset of memory regions should be avoided (partition camping)
Jiří Filipovič        Code optimization for GPUs 6/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO O0OOOOOOOOOOOOOOOOO    oooooooo oo oooooooooooooooooo oooo
Coalesced Memory Access (C. C. < 2.0)
A half of a warp can transfer data using single transaction or one to two transactions when transferring a 128 B word
• it is necessary to use large words
<* one memory transaction can transfer aligned 32 B, 64 B, or 128 B words
• GPUs with c. c. < 1.2
• the accessed block has to begin at an address divisible by 16x data size
• k-th thread has to access k-th block element
• some threads may not participate
• if these rules are not obeyed, each element is retrieved using a separate memory transaction
J in Fihpovic
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OO0OOOOOOOOOOOOOOOO    oooooooo oo oooooooooooooooooo oooo
Coalesced Memory Access (C. C. < 2.0)
GPUs with c. c. > 1.2 are less restrictive
• each transfer is split into 32 B, 64 B, or 128 B transactions in a way to serve all requests with the least number of transactions
• order of threads can be arbitrarily permuted w.r.t. transferred elements
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOO0OOOOOOOOOOOOOOO    oooooooo oo oooooooooooooooooo oooo
Coalesced Memory Access (C. C. < 2.0)
Threads are aligned, element block is contiguous, order is not permuted - coalesced access on all GPUs
															
															
															
															
															
1		1		1			1	1	1	11					
Jiří Filipovič        Code optimization for GPUs 9/56
~--S »000°
-      „ access "SSS5000 ^ ^oVooooooooo
. con^uoUrP\Js^c- " * a\\s^eá'     •„« onG?u —
Tbreads ar     _ one wan ^gggggtt
0de
ytf f W»P0>,Vt
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooo«ooooooooooooo   oooooooo oo oooooooooooooooooo oooo
Unaligned Memory Access (C. C. < 2.0)
Similar case may result in a need for two transactions
		J
\±}WfttHit*f-		
		n
\Tl rhhUi /// ///		/
///;/// rnuj / /		
	/   i   i i ,■  y   /1 / /  /   i   i *	
Jiří Filipovič        Code optimization for GPUs 11/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        oooooo«oooooooooooo   oooooooo oo oooooooooooooooooo oooo
Unaligned Memory Access Performance (C. C. < 2.0)
Older GPUs perform smallest possible transfer (32 B) for each element, thus reducing performance to 1/8 Newer GPUs perform (c. c. > 1.2) two transfers
D    2   4    6    8   10 12 14 16
onset
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOO0OOOOOOOOOOO    oooooooo oo oooooooooooooooooo oooo
Interleaved Memory Access Performance (C. C. < 2.0)
The bigger the spaces between elements, the bigger performance drop on GPUs with c. c. > 1.2 - the effect is rather dramatic
GTX2SQ FX5B0Ü
"i—I—r D   2   4   B   8  10 12 14 16 18
Stride
J in Fihpovic
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOO0OOOOOOOOOO    oooooooo oo oooooooooooooooooo oooo
Global Memory Access with Fermi (C. C.
Fermi has LI and L2 cache
o LI: 256 B per row, 16 kB or 48 kB per multiprocessor in total
o L2: 32 B per row, 768 kB on GPU in total What are the advantages?
• more efficient programs with unpredictable data locality
• more efficient when shared memory is not used from some reason
o unaligned access - no slowdown in principle
• interleaved access - data needs to be used before it is flushed from the cache, otherwise the same or bigger problem as with c. c. < 2.0 (LI cache may be turned of to avoid overfetching)
Jiří Filipovič        Code optimization for GPUs 14 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOO0OOOOOOOOO    oooooooo oo oooooooooooooooooo oooo
Global Memory Access with "gaming" Kepler (C. C. = 3.0)
There is only L2 cache for read/write global memory access
• L2: 32 B per row, up to 1.5 MB per GPU
• LI: for local memory, 16KB, 32KB or 48KB in total
Jiří Filipovič        Code optimization for GPUs
15/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        oooooooooo«oooooooo   oooooooo oo oooooooooooooooooo oooo
Global Memory Access with fully-featured Kepler, Maxwe and Pascal (3.5 < C. C. < 6.0)
Read-only data cache
• shared with textures
• compiler tries to use, we can help with __restrict__ and -ldg()
9 slower than Fermi's LI
No LI cache for local memory
• inefficient for programs heavily using local memory
Jiří Filipovič
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOO0OOOOOOO    oooooooo oo oooooooooooooooooo oooo
Global Memory Access with Volta and newer (C. C. > 6.0)
Faster LI is back
• the same hardware as shared memory, but for read-only buffers
j in Fihpovic
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooooooooooo«oooooo oooooooo
HW Organization of Shared Memory
Shared memory is organized into memory banks, which can be accessed in parallel
• c. c. 1.x 16 banks, c. c. > 2.0 32 banks
• memory space mapped in an interleaved way with 32 b shift or 64 b shift (c.c. 3.x)
• to use full memory performance, we have to access data in different banks
• broadcast implemented - if all threads access the same data
Parallelism Memory access optimization oooo ooooooooooooo«ooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
Bank Conflict
Bank conflict
• occurs when some threads in warp/half-warp access data in the same memory bank with several exceptions
• threads access exactly the same data
• threads access different half-words of 64 b word (c.c. 3.x)
• when occurs, memory access gets serialized
• performance drop is proportional to number of parallel operations that the memory has to perform to serve a request
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        oooooooooooooo«oooo   oooooooo oo oooooooooooooooooo oooo
Access without Conflicts
Jiří Filipovič        Code optimization for GPUs 20 /56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOO0OOO    OOOOOOOO OO OOOOOOOOOOOOOOOOOO OOOO
n-Way Conflicts
Jiří Filipovič        Code optimization for GPUs 21/56
Parallelism Memory access optimization OOOO OOOOOOOOOOOOOOOO0OO
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
OOOOOOOOOOOOOOOOOO OOOO
Broadcast
-J J -J -J -J -J
-J
-J
-J J
-J
-J
-J -J
Jiří Filipovič        Code optimization for GPUs
22/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOO0O    oooooooo oo oooooooooooooooooo oooo
Access Patterns
Alignment is not needed, bank conflicts not generated
int  x = s[threadldx.x + offset];
Interleaving does not create conflicts if c is odd, for c.c. > 3.0 no conflict if c = 2 and 32 b numbers are accessed
int  x = s[threadldx.x  * c];
Access to the same variable never generates conflicts on c. c. 2.x, while on 1.x only if thread count accessing the variable is multiple of 16
int  x = s[threadldx.x  / c];
Jiří Filipovič        Code optimization for GPUs
23/56
CPU-GPU memory transfers
Transfers between host and GPU memory
• need to be minimized (often at cost of decreasing efficiency of computation on GPU)
• may be accelerated using page-locked memory
• it is more efficient to transfer large blocks at once
• computations and memory transfers should be overlapped
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    «0000000 oo oooooooooooooooooo oooo
Matrix Transposition
From theoretical perspective:
• a trivial problem
• a trivial parallelization
• trivially limited by the memory throughput (no arithmetic ops done)
__global__  void mtran(float  *odata,   float*   idata,   int n){ int  x = blockldx.x  *  blockDim.x + threadldx.x; int  y = blockldx.y *  blockDim.y + threadldx.y; odata[x*n + y]  = idata[y*n + x];
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    O0OOOOOO oo oooooooooooooooooo oooo
Performance
When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem?
Jin Fihpovic
i3P
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    O0OOOOOO oo oooooooooooooooooo oooo
Performance
When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem?
Access to odata is interleaved. After modification (copy instead of transpose matrices):
odata[y*n + x]  = idata[y*n + x];
the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s.
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition OO0OOOOO
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
Removing Interleaving
The matrix can be processed per tiles
9 we read the tile into the shared memory row-wise
• we will store its transposition into the global memory row-wise
• thus having both reading and writing without interleaving
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition OO0OOOOO
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
Removing Interleaving
The matrix can be processed per tiles
9 we read the tile into the shared memory row-wise
• we will store its transposition into the global memory row-wise
• thus having both reading and writing without interleaving What size of tiles should be used?
o lets consider square tiles for simplicity
• for aligned reading, the row size has to be multiple of 16
• we can consider tile sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations
• best size can be determined experimentally
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOO0OOOO oo oooooooooooooooooo oooo
Tiled Transposition
__global__  void mtran_coalesced(float  *odata,   float  *idata,   int n)
{
__shared__   float   tile[TILE_DIM][TILE_DIM];
int x = blockldx.x * TILE_DIM + threadldx.x; int y = blockldx.y * TILE_DIM + threadldx.y; int   index_in = x + y*n;
x = blockldx.y * TILE_DIM + threadldx.x; y = blockldx.x * TILE_DIM + threadldx.y; int   index_out = x + y*n;
for   (int   i = 0;   i < TILE_DIM;   i += BL0CK_R0WS)
tile[threadldx.y+i][threadldx.x]  = idata[index_in+i*n];
__syncthreads();
for   (int   i = 0;   i < TILE_DIM;   i += BL0CK_R0WS)
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i];
}
Jiří Filipovič        Code optimization for GPUs 28 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOO0OOO oo oooooooooooooooooo oooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
Jiří Filipovič        Code optimization for GPUs
29/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOO0OOO oo oooooooooooooooooo oooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
that's significantly better but still less than simple copying
Jiří Filipovič        Code optimization for GPUs
29/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOO0OOO oo oooooooooooooooooo oooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still less than simple copying
• the kernel is more complex, contains synchronization
• we need to figure out whether we got the maximum or there's still a problem somewhere
Jiří Filipovič        Code optimization for GPUs
29/56
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition OOOO0OOO
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
Performance
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still less than simple copying
• the kernel is more complex, contains synchronization
• we need to figure out whether we got the maximum or there's still a problem somewhere
9 if we only copy within the blocks, we get 94.9GB/s 9 something is still sub-optimal
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   ooooo«oo oo oooooooooooooooooo oooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
t ile [ threadldx . y+i ] [ threadldx . x ]  = idata [ index_in+i*n ] ;
Jiří Filipovič
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOO0OO oo oooooooooooooooooo oooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
t ile [ threadldx . y+i ] [ threadldx . x ]  = idat a [ index_in+i*n ] ;
When writing to the global memory, we read from the shared memory column-wise
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i ] ;
That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creating 16-way bank conflict.
S1
Jiří Filipovič        Code optimization for GPUs
30/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOO0OO oo oooooooooooooooooo oooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
t ile [ threadldx . y+i ] [ threadldx . x ]  = idata [ index_in+i*n ] ;
When writing to the global memory, we read from the shared memory column-wise
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i ] ;
That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creating 16-way bank conflict.
A solution is padding:
__shared__   float   tile[TILE.DIM][TILE_DIM + 1];
Jin Fihpovic
Code optimization for GPUs
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition oooooo«o
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo oooo
Performance
Now our implementations shows 93.4 GB/s. 9 as good as simple copying
• it seems we can't do much better for given matrix
o beware of different input data sizes (partition camping)
Parallelism   Memory access optimization   Matrix Transposition  Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   ooooooo« oo oooooooooooooooooo oooo
Performance Summary
All optimizations were only toward better accommodation of HW properties
• however, we got 17.6x speedup
• when creating an algorithm, it is necessary to understand HW limitations
• otherwise we wouldn't have to develop specifically for GPUs...
Jiří Filipovič        Code optimization for GPUs
32/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo «o oooooooooooooooooo oooo
Instructions performance
Some instructions are relatively slower than with CPUs
• integer division and modulo
• 32-bit integer multiplication with c.c. 1.x
• 24-bit integer multiplication with c.c. 2.x and newer Some instructions are relatively faster than with CPUs
9 lower-precision arithmetics performed by SFUs
• —sinf(x), —cosf(x), —expf(x), sincosf(x), ^rsqrtf(x) etc.
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo om oooooooooooooooooo oooo
Small loops have significant overhead
• jumps
• conditions
• control variable updates
• significant part of instructions may be pointer arithmetics
• low I LP
Loop unrolling is an option
• partially may be done by the compiler
• we can do manual unrolling or use #pragma unroll
Jiří Filipovič        Code optimization for GPUs 34 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo ooooooooooooooooooo   oooooooo oo «00000000000000000 oooo
Vector Reduction
Let v be the vector of size n. We want to compute x = Vi
Jiří Filipovič        Code optimization for GPUs
35/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO «00000000000000000 oooo
Vector Reduction
Let v be the vector of size n. We want to compute x C code (not very reasonable for floats)
int  x = 0;
for   (int  i = 0;   i < n; i++) x += v [ i ] ;
There is flow dependency across iterations.
En i=i v>
Jiří Filipovič
Code optimization for GPUs
^ O.
35/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO «00000000000000000 oooo
Vector Reduction
Let v be the vector of size n. We want to compute x = vi-C code (not very reasonable for floats)
int  x = 0;
for   (int  i = 0;   i < n; i++) x += v [ i ] ;
There is flow dependency across iterations.
• we cannot compute completely parallel
• addition is (at least in theory :-)) associative
• so, we do not need to add numbers in sequential order
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo o«oooooooooooooooo oooo
Parallel Algorithm
The sequential algorithm performs seven steps:
(((((("I + v2) + V3) + "4) + "5) + W>) + V7) + v8
Jiří Filipovič        Code optimization for GPUs
36/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo o«oooooooooooooooo oooo
Parallel Algorithm
The sequential algorithm performs seven steps:
(((((K + V2) + V3) + V4) + V5) + V6) + V7) + V8
Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((v5 + v6) + (vj + v8))
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo o«oooooooooooooooo oooo
Parallel Algorithm
The sequential algorithm performs seven steps:
((((((vi + V2) + v3) + v4) + vb) + v6) + Vj) + v8
Addition is associative... so let's reorder brackets: ((vi + v2) + (v3 + v4)) + ((vb + v6) + (w + vs)) We can work in parallel now:
• four additions in the first step
o two additions in the second step
• one addition in the third step
In summary, we perform n — 1 additions in log2 n parallel steps!
Jiří Filipovič        Code optimization for GPUs 36 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo oo«ooooooooooooooo oooo
Parallel Algorithm
We have found the parallel algorithm
• the same number of additions as the serial algorithm
• in logarithmic time (if we have enough cores) We add results of previous additions
• flow-dependency across threads
• we need global barrier
Jiří Filipovič        Code optimization for GPUs 37 / 56
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
OOO0OOOOOOOOOOOOOO oooo
Naive Approach
The simplest scheme of the algorithm:
• for even /, i < n perform v[i] += v[i+l] o repeat for n /= 2 untill n > 1
The performance is not ideal
o 2n numbers loaded from global memory
• n numbers stored to global memory
o log2 n kernel invocations
We have three memory accesses to one arithmetics operation and considerable kernel invocation overhead.
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOO0OOOOOOOOOOOOO oooo
Exploiting Data Locality
We can add more than pairs during single kernel call.
• each block bx loads m numbers into shared memory
o it reduces the input (in shared memory in log2 m steps)
• it stores only one number containing YlT=bmXx v>
Reduces both memory transfers and number of kernel invocations
• number of loads: n + £ + £ + .. + -g-n = (n -
• approximately n + — numbers read, — written 9 \ogm n kernel invocations
Jiří Filipovič        Code optimization for GPUs
39/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOOO0OOOOOOOOOOOO oooo
mplementation 1
__global__  void  reducel(int *v){ extern  __shared__   int   sv [ ];
unsigned  int  tid = threadldx.x;
unsigned  int   i = blockldx.x*blockDim.x + threadldx.x; sv[tid]  = v[i]; __syncthreads();
for(unsigned  int  s=l;  s < blockDim.x;   s *= 2) { if   (tid % (2*s) = 0)
}
sv[tid] += sv[tid + s]; __syncthreads () ;
}
if (tid
0)
v[blockldx.x]  = sv[0];
Jiří Filipovič        Code optimization for GPUs
40/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOOOO0OOOOOOOOOOO oooo
Performance
Beware modulo operation. High degree of divergence
• during the first iteration, only half of threads is working
• during the second iteration, only quarter of threads is working
• etc.
Performance on GTX 280: 3.77 GB/s (0.94 MEIem/s).
Jiří Filipovič        Code optimization for GPUs 41/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo ooooooo^oooooooooo oooo
Implementation
We will modify indexation
for   (unsigned  int  s = 1;   s < blockDim.x;   s *= 2) { int   index = 2  *  s  * tid; if   (index < blockDim.x)
sv[index]  += sv[index + s]; __syncthreads();
}
Performance: 8.33 GB/s (2.08 MEIem/s).
The code is free of modulo and divergence, but generates shared memory bank conflicts.
S1
Jiří Filipovič        Code optimization for GPUs
42/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo oooooooo«ooooooooo oooo
mplementation 3
So we can try another indexing...
for   (unsigned  int  s = blockDim . x / 2;   s >  0;   s »= 1) { if   (tid < s)
s v[tid]  += s v[t i d + s] ; __syncthreads();
}
No divergence and no conflicts. Performance 16.34 GB/s (4.08 MEIem/s). Half of threads do not compute...
Jiří Filipovič        Code optimization for GPUs
43/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOOOOOOO0OOOOOOOO oooo
Implementation 4
We can add numbers during loading them from global memory.
unsigned  int  i = blockldx.x*(blockDim.x*2) + threadldx.x; sv[tid]  = v[i]  + v[i+blockDim.x];
Performance 27.16 GB/s (6.79 MEIem/s).
There is no problem with data access, but the performance is still low - we will focus to instructions.
Jiří Filipovič        Code optimization for GPUs
44/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOOOOOOOO0OOOOOOO oooo
Implementation 5
The number of active threads decreases during computation in shared memory.
• in the last six iterations, only the last warp is active
• the warp is synchronized implicitly on GPUs with c.c. < 7.0, so we do not need syncthreadsQ
9 we need volatile variable in this case
• condition if(tid < s) does not spare any computation So we can unroll the last warp...
Jiří Filipovič        Code optimization for GPUs 45 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOOOOOOOOO0OOOOOO oooo
Implementation 5
float  mySum = 0;
for   (unsigned  int  s = blockDim . x / 2;   s >  32;   s »= 1){ if   (tid < s)
sv[tid]  = mySum = mySum + sv[tid + s]; __syncthreads();
}
if   (tid < 32){
volatile  float  *s = sv;
s	tid]	= mySum =	mySum	+	s	tid	+	32]	//	__sy n cthreads ()
s	tid]	= mySum =	mySum	+	s	tid	+	16]	//	__sy n cthreads ()
s	tid]	= mySum =	mySum	+	s	tid	+	8];	ll-	_sy n cthreads ();
s	tid]	= mySum =	mySum	+	s	tid	+	4];	ll-	_sy n cthreads ();
s	tid]	= mySum =	mySum	+	s	tid	+	2];	ll -	_sy n cthreads ();
s	tid]	= mySum =	mySum	+	s	tid	+	i];		
}
We save time in all warps (the last warp is simpler, others exits earlier from the for loop).
Performance: 37.68 GB/s (9.42 MEIem/s).   < n „       , ._ _
Jiří Filipovič        Code optimization for GPUs 46 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo oooooooooooo«ooooo oooo
mplementation 5
For c.c. 3.0 or greater, we can use warp shuffle:
if   (tid < 32){
mySum += sdata[tid + 32];
for   (int   offset = warpSize/2;   offset >  0;   offset /= 2) mySum += __shfl_down_sync(mySum, offset);
}
This is safe for all GPUs.
Jin Fihpovic
■5P
Code optimization for GPUs
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
OOOO OOOOOOOOOOOOOOOOOOO    OOOOOOOO OO OOOOOOOOOOOOO0OOOO oooo
Implementation 6
Can we unroll the for loop?
If we know the number of iterations, we can unroll it
• the number of iterations depends on the block size Can we implement it generically?
• algorithm uses blocks of size 2n
• the block size is upper-bound
• if we know the block size during compilation, we can use a template
template <unsigned  int blockSize> __global__  void  reduce6(int *v)
Jiří Filipovič        Code optimization for GPUs 48 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo oooooooooooooo«ooo oooo
Implementation 6
Conditions using blockSize are evaluated during compilation:
if   (blockSize >= 512){ if   (tid < 256)
sv[tid] += sv[tid + 256]; __syncthreads();
}
if   (blockSize >= 256){ if   (tid < 128)
sv[tid] += sv[tid + 128]; __syncthreads();
}
if   (blockSize >= 128){ if   (tid < 64)
sv[tid]  += sv[tid + 64]; __syncthreads();
}
Performance: 50.64 GB/s (12.66 MEIem/s).
<□► < rS1 ► < ^ ► < ^ ►     1    -O O
Jiří Filipovič        Code optimization for GPUs 49 / 56
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Implementation 7
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction
000000000000000«
General Advices
•oo oooo
Can we implement faster algorithm? Let's reconsider the complexity:
• logr? parallel steps 9 n — 1 additions
9 time complexity for p threads running in parallel (using p processors):        + logr?)
Cost of parallel computation
9 defined as number of processors multiplied by time complexity
• if we assign one thread to one data element, we get p — n 9 and the cost is 0(n • log n)
9 which is not efficient
Jiří Filipovič        Code optimization for GPUs 50 / 56
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
OOOOOOOOOOOOOOOO0O oooo
Implementation 7
Decreasing the cost
• we use 0(r°—) threads
^ log n '
• each thread performs 0(\ogn) sequential steps
• after that, it performs (D(\ogn) parallel steps
• time complexity is the same
• the cost is O(n)
What it means in practice?
<* we reduce overhead of the computation (e.g., integer arithmetics)
• advantage if we have much more threads that is needed to saturate GPU
Jin Filipovic        Code optimization for GPUs 51/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed Reduction
oooo        ooooooooooooooooooo   oooooooo oo ooooooooooooooooo«
General Advices
oooo
mplementation 7
We modify loading into shared memory
unsigned int gridSize = blockSize*2*gridDim.x; sv[tid]  = 0;
while(i < n){
s v[t i d] += v[i]  + v[i+blockSize]; i += gridSize;
}
__syncthreads () ;
Performance: 77.21 GB/s (19.3 MEIem/s).
Jiří Filipovič        Code optimization for GPUs
52/56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo ooooooooooooooooo* oooo
Implementation 7
We modify loading into shared memory
unsigned int gridSize = blockSize*2*gridDim.x; sv[tid]  = 0;
while(i < n){
s v[tid] += v[i]  + v[i+blockSize]; i += gridSize;
}
__syncthreads () ;
Performance: 77.21 GB/s (19.3 MEIem/s).
You can find those implementations in CUDA SDK.
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
OOOOOOOOOOOOOOOOOO «000
Problem Choice
Before we start with code acceleration, we should consider carefully, if it is meaningful. The accelerated code should be
o critical for application performance (profile... and profile on real data)
• large enough (usually not ideal for relatively simple but latency critical application)
• parallelizable (problematic, e.g., in simulation of a small system evolving for a long time)
o sufficient number of flops to memory transfers (consider slow PCI-E)
Parallelism   Memory access optimization
oooo ooooooooooooooooooo
Matrix Transposition
oooooooo
Instructions Speed
oo
Reduction General Advices
oooooooooooooooooo o«oo
Parallelization
Vector addition
a simple to formulate data-parallel code
9 no synchronization Reduction
o may seem sequential at a glance
• but can be computed in logr? steps
• needs some thinking about how to parallelize
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo oooooooooooooooooo oo«o
Code divergence
Code divergence
• hurts performance when divergence occurs within the warp o it can be easy to remove
• reduction
• but also hard to remove
• graph processing, many independent automata
• sometimes different algorithm needs to be formulated
Jiří Filipovič        Code optimization for GPUs 55 / 56
Parallelism   Memory access optimization   Matrix Transposition   Instructions Speed   Reduction General Advices
oooo        ooooooooooooooooooo   oooooooo oo oooooooooooooooooo ooo«
Scattered memory access
Scattered memory access
• if not accessed in coalesced way, memory bandwidth is reduced
• often difficult to work-around
• graph processing
• sometimes we can investigate different data structures
• sparse matrices
• with more regular structures, we can use shared memory
• matrix transposition
Jiří Filipovič        Code optimization for GPUs 56 /56