Matrix Transposition	Instructions Speed	Revision of Matrix Multiplication
oooooooooooooooooo	ooooooooo	ooooooooooooooo
GPU Hardware Performance I
Jiří Filipovič Fall 2015
<<□►   < 3 ►   < 1 ►   4 1 ►
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
•ooooooooooooooooo
Recapitulation
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Global memory
• warp should access data coalesced
• thread blocks should prevent partition camping Shared memory
• threads in warp should access different banks (or the same data)
All memories
• sufficient occupancy needed to hide memory latencies
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition O0OOOOOOOOOOOOOOOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Matrix Transposition
From theoretical perspective:
• a trivial problem
• trivial parallelization
• trivially limited by the memory throughput (no arithmetic ops done)
__global__  void mtran(float   *odata,   float*   idata,   int n){ int  x = blockldx.x  *  blockDim.x + threadldx.x; int  y = blockldx.y  *  blockDim.y + threadldx.y; odata[x*n + y]  = idata[y*n + x];
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
Performance
Matrix Transposition Instructions Speed
OO0OOOOOOOOOOOOOOO ooooooooo
When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem?
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OO0OOOOOOOOOOOOOOO
Performance
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
When running the code on GeForce GTX 280 with large enough matrix 4000 x 4000, the throughput will be 5.3GB/s Where's the problem?
Access to odata is interleaved. After modification (copy instead of transpose matrices):
odata[y*n + x]  = idata[y*n + x];
the throughput is 112.4 GB/s. If idata is accessed in an interleaved way too, the resulting throughput would be 2.7 GB/s.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOO0OOOOOOOOOOOOOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
On Removing Interleaving
The matrix can be processed per tiles
• we read the tile into the shared memory row-wise
• we will store its transposition into the global memory row-wise
• thus having both reading and writing without interleaving
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOO0OOOOOOOOOOOOOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
On Removing Interleaving
The matrix can be processed per tiles
• we read the tile into the shared memory row-wise
• we will store its transposition into the global memory row-wise
• thus having both reading and writing without interleaving What size of tiles should be used?
• lets consider square tiles for simplicity
• for aligned reading, the row size has to be multiple of 16
• we can consider tile sizes of 16 x 16, 32 x 32, and 48 x 48 because of shared memory size limitations
• best size can be determined experimentally
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOO0OOOOOOOOOOOOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Tiled Transposition
__global__  void mtran_coalesced(float   *odata,   float   *idata,   int n){ __shared__   float   tile[TILE_DIM][TILE_DIM];
int x = blockldx.x * TILE_DIM + threadldx.x; int y = blockldx.y * TILE_DIM + threadldx.y; int   index_in = x + y*n;
x = blockldx.y * TILE_DIM + threadldx.x; y = blockldx.x * TILE_DIM + threadldx.y; int   index_out = x + y*n;
for   (int   i = 0;   i < TILE_DIM;   i += BL0CK_R0WS)
tile[threadldx.y+i][threadldx.x]  = idata[index_in+i*n] ;
__syncthreads();
for   (int   i = 0;   i < TILE_DIM;   i += BL0CK_R0WS)
odata[index_out+i*n]  = tile[threadldx.x][threadldx.y+i];
}
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
Performance
Matrix Transposition Instructions Speed
ooooo«oooooooooooo ooooooooo
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
Performance
Matrix Transposition Instructions Speed
ooooo«oooooooooooo ooooooooo
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still far from simple copying
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition ooooo«oooooooooooo
Performance
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still far from simple copying
• the kernel is more complex, contains synchronization
• we need to figure out whether we got the maximum or there's still a problem somewhere
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition ooooo«oooooooooooo
Performance
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
The highest performance was measured for 32 x 32 tile size and 32 x 8 thread block size - 75.1 GB/s
• that's significantly better but still far from simple copying
• the kernel is more complex, contains synchronization
• we need to figure out whether we got the maximum or there's still a problem somewhere
• if we only copy within the blocks, we get 94.9GB/s
• something is still sub-optimal
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
OOOOOO0OOOOOOOOOOO
Shared Memory
Instructions Speed
ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
When reading from the global memory, we write into the shared memory row-wise
t ile [ threadldx . y+i ] [ threadldx . x ]  = idata [ index_in+i*n ] ;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
OOOOOO0OOOOOOOOOOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
tile[threadldx.y+i][threadldx.x]  = idata[index.in+i *n];
When writing to the global memory, we read from the shared memory column-wise
odata[index_out+i *n ]  = tile[threadldx.x][threadldx.y+i ] ;
That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
OOOOOO0OOOOOOOOOOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Shared Memory
When reading from the global memory, we write into the shared memory row-wise
tile[threadldx.y+i][threadldx.x]  = idata[index.in+i *n];
When writing to the global memory, we read from the shared memory column-wise
odata[index_out+i *n]  = tile[threadldx.x][threadldx.y+i ] ;
That's reading with interleaving which is multiple of 16, the whole column is in a single memory bank - thus creaing 16-way bank conflict.
A solution is padding:
__shared__   float   tile[TILE_DIM ] [TILE.DIM + 1];
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOOOOO0OOOOOOOOOO
Performance
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Now our implementations shows 93.4 GB/s.
• as good as simple copying
• it seems we can't do much better for given matrix
• beware of different input data sizes (partition camping)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition 00000000*000000000	Instructions Speed ooooooooo	Revision of Matrix Multiplication ooooooooooooooo
Performance		
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
Performance
Matrix Transposition Instructions Speed
ooooooooo«oooooooo ooooooooo
500        1000       1500       £000       £500       3000       3500 4000
velikost matice
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooo«ooooooo
Performance Drops
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
The performance drops for some size and the behavior is regular
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooo«ooooooo
Performance Drops
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
The performance drops for some size and the behavior is regular
• for matrices sized multiple of 512, we only get 19GB/s
• for other matrices sized multiple of 256, we only get 35GB/s
• for other matrices sized multiple of 128, we only get 62GB/s
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOOOOOOOOO0OOOOOO
Performance Drops
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
One memory region has width of 2 tiles (256 B / 4 B per float, 32 floats in a tile). If we analyze tiles placement w.r.t. matrix size, we learn that
• with multiple of 512 size, the tiles in the same column are in the same region
• with multiple of 256 size, each column is at most in two regions
• with multiple of 128, each column is at most in four regions We have discovered partition camping.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooooo«ooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
How to Remove Partition Camping?
We can pad matrices and avoid bad matrix sizes.
• more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix)
• it occupies more memory
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
How to Remove Partition Camping?
Matrix Transposition Instructions Speed
oooooooooooo«ooooo ooooooooo
We can pad matrices and avoid bad matrix sizes.
• more complicated work with such implementation (all kernels accessing matrix have to implement padding, or we need to convert matrix)
• it occupies more memory
We can change the mapping of thread blocks id's on matrix tiles
• diagonal mapping ensures access to different regions
int  blockIdx_y = blockldx.x;
int  block!dx_x = (blockldx.x+blockldx.y) % gridDim.x;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition 0000000000000*0000
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Performance
New implementation gives 80GB/s
• performance doesn't drop where we saw it previously
• for matrix size of multiple of 128 still worse then the original implementation
• the algorithm is more complex
• we can use it only for the problematic data sizes
For given problem, there may not be (and often there is not) an ideal algorithm for the whole input data size range. It is necessary to benchmark as not all the problems are easily revealed just by looking at the code.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOOOOOOOOOOOO0OOO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Performance
100
o
ÍZ QJ
Q.
o
u in £_
80
diagonálni mapováni optimalizace GMEH + SMEM
n Dri
M
n
n
v■-L
n
500        1000       1500       £000 £500
velikost matice
3000
3500
4000
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition OOOOOOOOOOOOOOO0OO
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Performance
500        1000       1500       £000       £500       3000       3500 4000
velikost matice
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
Performance Summary
Matrix Transposition Instructions Speed
OOOOOOOOOOOOOOOO0O ooooooooo
All optimizations were only toward better accommodation of HW properties
• still we got 17.6x speedup
• when creating an algorithm, it is necessary to understand HW limitations
• otherwise we wouldn't have to develop specifically for GPUs -developing a good sequential algorithm would have been just fine.. .
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition ooooooooooooooooo*
Instructions Speed ooooooooo
Revision of Matrix Multiplication
ooooooooooooooo
Optimizations Effects
Beware of optimization effects
• if we took 4096 x 4096 matrices instead of 4000 x 4000, the memory bank conflict removal would have been just marginal
• after removing partition camping, the effect of memory bank conflicts becomes visible
• thus it makes sense to go from more general/substantial optimizations to the less general ones
• if some (provably correct) optimization does not result in performance increase, we need to analyze, what limits the algorithm performance
Jiří Filipovič        GPU Hardware Performance II
Instructions Speed •oooooooo
Matrix Transposition
oooooooooooooooooo
Processing of Instructions
Revision of Matrix Multiplication
ooooooooooooooo
Processing of instructions on a multiprocessor (c. c. 1.x)
• there are 8 SP cores and 2 SFU cores
• if the SP and SPU instruction processing is not overlapped, the multiprocessor can process up to 8 instructions per cycle
• one warp is thus done in 4 or more cycles
• some instructions are significantly slower
• knowledge of instruction processing time helps us to design efficient code
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
Floating Point Operations
Matrix Transposition Instructions Speed
OOOOOOOOOOOOOOOOOO O0OOOOOOO
GPU is designed as graphical HW
• graphical operations mostly use floating point numbers
• efficiently implemented in GPUs
• newer GPUs (c. c. > 1.3) can work in double precision while older ones in single precision only
• some arithmetic operations are used very frequently in graphics
• GPU implements them in HW
• HW implementation provides lower precision (not in issue for lots of applications)
• differentiated using prefix
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed OO0OOOOOO
Revision of Matrix Multiplication
ooooooooooooooo
Arithmetic Operations
Floating point operations
• addition, multiplication very fast
• multiplication and addition may be combined into a single MAD instruction for c. c. 1.x
• lower precision
• 1 cycle speed on SP
• —fadd-rn() and .JmuLrnQ may be used to enforce avoiding MAD instruction during compilation
• MAD is replaced by FMAD for c. c. > 2.0 (same speed, higher precision)
• 64b versions at lower speed: 1/8 (1.3), 1/2 (2.0), 1/12 (2.1) 1/24 (3.0), 1/3 (3.5), 1/32 (5.x)
• division is relatively slow, reciprocal is faster
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed OOO0OOOOO
Revision of Matrix Multiplication
ooooooooooooooo
Arithmetic Operations
Transcendental functions
• —sinf(x), __cosf(x), __expf(x)
• sinf(x), cosf(x), expf(x) more precise but an order of magnitude slower
• other operations with different speed and precision trade-offs are implemented, see CUDA manual
Integer operations
• addition as for the floating point ops
• multiplication on c. c. 1.x 2 instructions on an MP
• —mul24(x, y) a —umul24(x, y) 8 instructions
• multiplication on c. c. 2.x is as fast as floating point ops, 24-bit version is slow
• division and modulo is very slow, but if n is power of 2, we can utilize
• i/n is equivalent to / >> log2(n)
• i%n is equivalent to /&(a? — 1)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed 0000*0000
Revision of Matrix Multiplication
ooooooooooooooo
Loops
Small loops have significant overhead
• jumps
• conditions
• control variable updates
• significant part of instructions may be pointer arithmetics
• low I LP
Loop unrolling is an option
• partially may be done by the compiler
• we can do manual unrolling or use #pragma unroll
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Other Instructions
Instructions Speed OOOOO0OOO
Revision of Matrix Multiplication
ooooooooooooooo
Other common instructions are performed at the basic speed (i.e correspond to number of SPs)
• comparison
• bit operations
• memory access instructions (given the limitations discussed earlier and memory latency/bandwidth)
• the offset may be register value + constant for 32-bit addressing (higher overhead for 64-bit addressing)
• synchronization (unless we get blocked)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed oooooo«oo
Revision of Matrix Multiplication
ooooooooooooooo
Beware of Shared Memory
If memory bank conflict is avoided, the shared memory is as fast as registers at c.c. 1.x But beware
• instructions can work with only one operand in the shared memory
• if more than one operands in shared memory are used for one instruction, explicit load/store is necessary
• MAD instructions run slower (c.c. 1.x)
• a + s[i] 4 cycles per warp
• a + a * s[i] 5 cycles per warp
• a + b * s[i] 6 cycles per warp
• these details are not published by NVIDIA (revealed through measurements)
• interesting only for really critical code
•<[5i^ -<^^ < ± >     1 ^0,0
Jiří Filipovič        GPU Hardware Performance II
Instructions Speed OOOOOOO0O
Matrix Transposition
oooooooooooooooooo
Beware of Shared Memory
Revision of Matrix Multiplication
ooooooooooooooo
Newer GPUs have relatively slower shared memory (comparing to register speed)
• Fermi and Maxwell have lower bandwidth even if one operand in shared memory is accessed
• Kepler uses only 1/2 of available bandwidth for 32-bit access
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication
ooooooooooooooo
C for CUDA Compilation
Matrix Transposition Instructions Speed
oooooooooooooooooo oooooooo»
Device code can be compiled into PTX assembler and binary files
• PTX is intermediate code, does not correspond directly to GPU instructions
• easier to read
• harder to figure out what really happens on GPU Binary files may be disassembled using cuobjdump tool
• for GT200 and newer
• decuda for older GPUs (may not work completely reliably)
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Naive Implementation
Instructions Speed ooooooooo
Revision of Matrix Multiplication •oooooooooooooo
global__  void mmul(float  *A,   float  *B,   float  *C,   int n){ int  x = blockldx.x*blockDim.x + threadldx.x; int  y = blockldx.y*blockDim.y + threadldx.y;
float  tmp = 0; for   (int  k = 0;   k < n; k++) tmp += A[y*n+k]   * B[k*n+x];
C[y*n + x]  = tmp;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooooooooooo	Instructions Speed ooooooooo	Revision of Matrix Multiplication O0OOOOOOOOOOOOO
		
Recapitulation
Naive implementation
• each thread computes one element of the resulting matrix
• memory-bound
• theoretical peak 66.8GFIops
• performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication O0OOOOOOOOOOOOO
Recapitulation
Naive implementation
• each thread computes one element of the resulting matrix
• memory-bound
• theoretical peak 66.8GFIops
• performance depends on threads arrangement - blocks 128 x 1: 36.6GFIops, blocks 1 x 128: 3.9GFIops
Now, we understand the results
• theoretical maximum cannot be reached - we access GPU memory in at least 32-byte chunks, so reading from A is not efficient
• blocks 128 x 1 result in coalesced access into B, blocks 1 x 128 result in strided access
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Recapitulation
Instructions Speed ooooooooo
Revision of Matrix Multiplication oo^oooooooooooo
We have implemented a tiled algorithm
• each thread block read tiles from A, B into shared memory, exploit data locality (data are moved into shared memory once and read many times)
• theoretical peak 568GFIops, we have reached 198GFIops We can improve the implementation...
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication OOO0OOOOOOOOOOO
Tiled Algorithm
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
global__  void mmul(float  *A,   float  *B,
int  bx = blockldx.x;
int  by = blockldx.y;
int  tx = threadldx.x;
int  ty = threadldx.y;
__shared__   float  As[BLOCK][BLOCK ] ;
__shared__   float  Bs[BLOCK][BLOCK ] ;
float  *C,   int n){
float  Csub = O.Of;
for   (int  b = 0;   b < n/BLOCK ; b++){
As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx]; Bs[ty][tx] = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ; __syncthreads () ;
}
for   (int  k = 0;   k < BLOCK; k++)
Csub += As[ty][k]* Bs[k][tx]; __syncthreads () ;
}
C[(ty + by*BL0CK)*n + bx*BL0CK+tx]  = Csub;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOO0OOOOOOOOOO
Implementation Pitfalls
As[ty][tx] = A[(ty + by*BL0CK)*n + b*BL0CK+tx ] ; Bs[ty][tx]  = B[(ty + b*BL0CK)*n + bx*BLOCK+tx ] ;
C[(ty + by*BL0CK)*n + bx*BLOCK+tx]  = Csub ;
Global memory access is OK.
Csub += As [ ty ] [k] * Bs [k ] [ tx ] ;
Also shared memory access is OK.
• if a thread block x-size is multiple of warp size, variable As is broadcasted
• array Bs is read in contiguous lines, which is conflict-free
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Theoretical Peak
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOO0OOOOOOOOO
Can we be more precise in theoretical peak computation?
o we have used a theoretical peak of GPU in MAD instructions (622GFIops)
• now, we know that MAD instructions with operand in shared memory are 50% slower
• the more precise theoretical bound is 415GFIops
• our implementation is still far from that
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication oooooo«oooooooo
Performance Pitfalls
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
What causes performance degradation?
• overhead of kernel execution, thread creation
• mainly for fast kernels and low instructions per thread
• threads can do more work in serial
• instruction overhead
• pointer arithmetics, loops
• can be reduced
• synchronization
• may or may not be an issue
• load/store in computation
• two operands in SMEM per one MAD instruction
If we count the performance bound for one load per MAD with operand in SMEM, we get 244GFIops.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooooooooooo	Instructions Speed                                     Revision of Matrix Multiplication OOOOOOOOO OOOOOOO0OOOOOOO
Searching for Better	mplementation
Can be a number of load instructions decreased?
< □ ► < is ► < = >
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooooooooooo	Instructions Speed ooooooooo	Revision of Matrix Multiplication OOOOOOO0OOOOOOO
		
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooooooooooo	Instructions Speed ooooooooo	Revision of Matrix Multiplication OOOOOOO0OOOOOOO
		
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
• exploiting data locality in registers decreases shared memory pressure
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition oooooooooooooooooo	Instructions Speed ooooooooo	Revision of Matrix Multiplication 0000000*0000000
		
Searching for Better	Implementation	
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
• exploiting data locality in registers decreases shared memory pressure
• how to do it? we reduce number of threads and assign more work to them
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOOOO0OOOOOOO
Searching for Better Implementation
Can be a number of load instructions decreased?
• exploiting data locality in shared memory decreases global memory pressure
• exploiting data locality in registers decreases shared memory pressure
• how to do it? we reduce number of threads and assign more work to them
Thread block of size m x n will process tile of size m x m, where m = n • k; k 6 N.
• large m potentially increases synchronization overhead
• small m reduces shared memory locality
• small n reduces available parallelism
• we will find value for m and n experimentally
Jiří Filipovič        GPU Hardware Performance II
Instructions Speed ooooooooo
Matrix Transposition
oooooooooooooooooo
Searching for Better Implementation
Revision of Matrix Multiplication OOOOOOOO0OOOOOO
Best results found for m = 32, n = 16 (32 x 16 blocks working with 32 x 32 tiles).
• one load to two MAD instructions results in theoretical boun 311GFIops
• we have 235.4 GFIops
• something is wrong
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication 000000000*00000
Code disassembly
We focus on the inner loop
Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ;
mov.b32  $rO ,   s[$ofs4+OxOOOO] add.b32  $ofs4 ,   $ofs2 , 0x00000180 mad.rn.f32  $r7 ,   s [ $of s 1+0x0008 ] ,   $rO , $r7 mad.rn.f32  $r8 ,   s [ $of s3+0x0008 ] ,   $rO , $r8
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication 000000000*00000
Code disassembly
We focus on the inner loop
Csubl += As [ty ] [k]*Bs [k] [ tx ] ; Csub2 += As [ty + 16][k]*Bs [k] [tx] ;
mov.b32  $rO ,   s[$ofs4+OxOOOO] add.b32  $ofs4 ,   $ofs2 , 0x00000180 mad.rn.f32  $r7 ,   s [ $of s 1+0x0008 ] ,   $rO , $r7 mad.rn.f32  $r8 ,   s [ $of s3+0x0008 ] ,   $rO , $r8
Compiler was able to use constant offsets only for As
• strided access into Bs generates one load and one integer add
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooo«oooo
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooo«oooo
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0 ,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s [ $of s3+0x0034 ] ,   $r0 , $r6
mad.rn.f32  $r8 ,   s [ $of s 1+0x0008 ] ,   $r0 , $r8
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooo«oooo
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0 ,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s [ $of s3+0x0034 ] ,   $r0 , $r6
mad.rn.f32  $r8 ,   s [ $of s 1+0x0008 ] ,   $r0 , $r8
New issue - memory bank conflicts
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooo«oooo
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0 ,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s [ $of s3+0x0034 ] ,   $r0 , $r6
mad.rn.f32  $r8 ,   s [ $of s 1+0x0008 ] ,   $r0 , $r8
New issue - memory bank conflicts • solved by padding
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooo«oooo
Removing the ADD instruction
We store transposed data into Bs and modify the inner loop
Csubl += As [ty ] [k]*Bs [tx ] [k ] ; Csub2 += As [ty + 16][k]*Bs [tx] [k] ;
After disassembling, we se there is no ADD instruction
mov.b32  $r0 ,   s[$ofs4+0x0008]
mad.rn.f32  $r6 ,   s [ $of s3+0x0034 ] ,   $r0 , $r6
mad.rn.f32  $r8 ,   s [ $of s 1+0x0008 ] ,   $r0 , $r8
New issue - memory bank conflicts
• solved by padding Resulting speed: 276.2 GFIops.
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication ooooooooooo«ooo
Can we Reach Better Performance'
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
Our results are petty close to theoretical bound for one load per two MADs.
• to get better performance, tiled algorithm has to be revised
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication ooooooooooo«ooo
Can we Reach Better Performance'
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
Our results are petty close to theoretical bound for one load per two MADs.
• to get better performance, tiled algorithm has to be revised The main issue is that we multiply two tiles in shared memory
• need of usage load instructions together with MAD instructions
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication ooooooooooo«ooo
Can we Reach Better Performance'
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
Our results are petty close to theoretical bound for one load per two MADs.
• to get better performance, tiled algorithm has to be revised The main issue is that we multiply two tiles in shared memory
• need of usage load instructions together with MAD instructions
Can we have only one block in shared memory?
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooooo«oo
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
4  □  ►   4 ^ >■   4 S ►   4 Š
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOOOOOOOOO0OO
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOOOOOOOOO0OO
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
• rows in B can be read one after another, so we can use register to do so
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOOOOOOOOO0OO
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
• rows in B can be read one after another, so we can use register to do so
• tile in C can be stored in registers
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOOOOOOOOO0OO
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
• rows in B can be read one after another, so we can use register to do so
• tile in C can be stored in registers
• we work in only one operand in shared memory, so explicit loads are not needed
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication OOOOOOOOOOOO0OO
New Tiled Algorithm
We will iteratively perform rank-1 update of tiles in C using column in A and row in B
• columns in A are read from shared memory
• rows in B can be read one after another, so we can use register to do so
• tile in C can be stored in registers
• we work in only one operand in shared memory, so explicit loads are not needed
• theoretical bound is now done by speed of MAD instruction with operand in shared memory: 415GFIops
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication ooooooooooooo«o
New Tiled Algorithm
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
The best-performing configuration:
• matrix A read by 16 x 16 tiles, stored in shared memory
• matrix B read by 64 x 1 tiles, stored in registers
• tiles of matrix C have 64 x 16 size, they are stored in registers
Jiří Filipovič        GPU Hardware Performance II
Revision of Matrix Multiplication ooooooooooooo«o
New Tiled Algorithm
Matrix Transposition Instructions Speed
oooooooooooooooooo ooooooooo
The best-performing configuration:
• matrix A read by 16 x 16 tiles, stored in shared memory
• matrix B read by 64 x 1 tiles, stored in registers
• tiles of matrix C have 64 x 16 size, they are stored in registers The measured performance is 375GFIops.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooooooo*
Summary
Implementation	performance	rel. A	abs. A
Naive, blocks 1 x 128	3.9GFIops		
Naive	36.6 GFIops	9.4x	9.4x
Tiled algorithm	198GFIops	5.4x	51x
Thread blocks 32 x 16, tiles 32 x 32	235 GFIops	1.19x	60 x
Removing ADD instruction	276 GFIops	1.17x	71x
One block in shared memory	375 GFIops	1.36x	96 x
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooooooo*
Summary
Implementation	performance	rel. A	abs. A
Naive, blocks 1 x 128	3.9GFIops		
Naive	36.6 GFIops	9.4x	9.4x
Tiled algorithm	198GFIops	5.4x	51x
Thread blocks 32 x 16, tiles 32 x 32	235 GFIops	1.19x	60 x
Removing ADD instruction	276 GFIops	1.17x	71x
One block in shared memory	375 GFIops	1.36x	96 x
• The most relevant is exploiting memory locality and basic memory access optimization.
Jiří Filipovič        GPU Hardware Performance II
Matrix Transposition
oooooooooooooooooo
Instructions Speed ooooooooo
Revision of Matrix Multiplication oooooooooooooo*
Summary
Implementation	performance	rel. A	abs. A
Naive, blocks 1 x 128	3.9GFIops		
Naive	36.6 GFIops	9.4x	9.4x
Tiled algorithm	198GFIops	5.4x	51x
Thread blocks 32 x 16, tiles 32 x 32	235 GFIops	1.19x	60 x
Removing ADD instruction	276 GFIops	1.17x	71x
One block in shared memory	375 GFIops	1.36x	96 x
• The most relevant is exploiting memory locality and basic memory access optimization.
• Finer optimizations are relatively challenging, important for really performance critical codes.
Jiří Filipovič        GPU Hardware Performance II