Introduction	CUDA OCL	AMD GPU Architecture
oooo	ooooooooo	ooooooooooo
OpenCL
J-W I- ■ I " "V in Fihpovic
Fall 2020
Jin Filipovic OpenCL
Introduction •ooo
CUDA OCL
ooooooooo
AMD GPU Architecture
ooooooooooo
OpenCL
What is OpenCL?
• an open standard for heterogeneous systems programming
<* low-level, derived from C, HW abstraction very similar to CUDA
Advantages over CUDA
• can be used for wide area of HW
9 open standard, independent on a single corporation Disadvantages compared to CUDA
• more complex API (similar to CUDA Driver API)
• often less mature implementation
• slower implementation of new HW features
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
o«oo ooooooooo ooooooooooo
Portability
One implementation can be compiled for different types of HW
o if we do not use extensions ...
However, the implementation optimized for some type of HW may be very slow on another HW
o we need to re-optimize for different HW architectures
So, it is the standard for programming of various types of HW, but we need to write different kernels for different architectures.
• high importance of easily modifiable code or autotuning
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
oo«o ooooooooo ooooooooooo
Performance Portability
600
rN^U3COOrN^^«)OrN^U3«>Or\l^lOOOOrN*±lDOOOrN^-ir>COOrN 01MrviXi^in^r^rNrNtHOaiK)00r--LDLO^t^-r0fNt-IOOtT100rvLDiXiLn H^Ou1^01Hf*lln^mH^l^lDWOl^l■*^^1000rS'l^DMO\riff|lfl^Sl rlrlririrliNMNrNrNrororfimm^^^^^^inifli/li/lifl
N
Obrazek: SGEMM optimized for Fermi and Cypress, running on Fermi
Du et al. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming
J in Fihpovic
OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
ooo» ooooooooo ooooooooooo
Performance Portability
1600
iN^iD(»OiS^iO(»OM<}iD«IOM^iO(»ON^iO(i30rN!}iI)«IOrN Oiecr^LD^LTi^rr)fNry|^oaiMMr^^LO^^rOfNrHOOa,iOOr^tDiriLri rHrriL0r^OTtHmLnr^CTl^rO^^C>0OrN^^00OiNrtLDC>0CTlt-lrniOr--ai
N
Obrazek: SGEMM optimized for Fermi a Cypress, running on Cypress2.
Du et al. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming
J in Fihpovic
OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO «00000000 ooooooooooo
Main Differences
OpenCL is not integrated to C/C++
• the OpenCL kernel is stored as a string, which is usually compiled during program execution
• kernel cannot share code with C/C++ codebase (user-defined types, common functions etc.)
Kernels in OpenCL do not use pointers
• we cannot dereference, use pointer arithmetics, link different buffers
o we can traverse the buffer by index, of course OpenCL is strictly derived from C
• no C++ stuff
OpenCL uses queues for HW devices
• eases using multiple devices/streams Queues can work out-of-order
• eases load balancing
J in Fihpovic
OpenCL
Introduction
oooo
CUDA OCL
O0OOOOOOO
AMD GPU Architecture
ooooooooooo
CUDA-OpenCL dictionary
Main differences in terminology
CUDA	OpenCL
multiprocessor	compute unit
scalar processor	processing element
thread	work-item
thread block	work-group
grid	ND Range
shared memory	local memory
registers	private memory
J in Fihpovic
OpenCL
□ [31
Introduction
oooo
CUDA OCL
OO0OOOOOO
AMD GPU Architecture
ooooooooooo
Vector Addition - Kernel
CUDA
__global__  void  addvec(float  *a,   float  *b ,   float *c)
{
int  i = biockldx.x*biockDim.x + threadldx.x; c[i]  = a[i]  + b[i];
}
OpenCL
..kernel void vecadd(__global float * a, __global float * b, __global  float  * c)
{
int  i = get_global_id(0 ) ; c[i]  = a[i]  + b[i];
}
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOO0OOOOO ooooooooooo
Vector Addition - Host Code
To execute the kernel, we need
• to define a platform
• device (at least one)
• context
• queues
a allocate and copy data
• compile the kernel code
• configure the kernel and execute it
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
oooo oooo«oooo ooooooooooo
Vector Addition - Platform Definition
cl_uint  num_devices_returned; cl_device_id  cdDevice;
err = clGetDeviceIDs(NULL ,   CL_DEVICE_TYPE_GPU , 1, &cdDevice ,  <^num_devices_returned ) ;
cl_context  hContext ;
hContext = clCreateContext(0 , 1, &cdDevice , NULL, NULL, ^err); cl_command_queue  hQueue;
hQueue = clCreateCommandQueue(hContext ,   hDevice ,   0, &err ) ;
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
oooo ooooo«ooo ooooooooooo
Vector Addition - Platform Definition
The platform can have more devices
• can be selected by the type (e.g. a GPU)
• can be selected by vendor
• we can also choose HW using finer informations
• number of cores
• frequency
• memory size
• extensions (double precision, atomic operations etc.) Each device needs at least one queue
• cannot be used otherwise
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOO0OO ooooooooooo
Vector Addition - Memory Allocation and Copy
cl_mem hdA ,   hdB ,   hdC ;
hdA = clCreateBuffer(hContext ,   CL_MEM_READ_ONLY , cnDimension  *  sizeof(cl_float),   pA, 0);
There is no explicit copy - allocation and copy is performed in lazy fashion, i.e. in time when data are needed. Consequently, the target device is not defined in the memory allocation.
Jin Filipovic OpenCL
Introduction
oooo
CUDA OCL ooooooo«o
AMD GPU Architecture
ooooooooooo
Vector Addition - Kernel Execution
const  unsigned  int   cnBlockSize = 512; const  unsigned  int   cnBlocks = 3;
const  unsigned  int   cnDimension = cnBlocks  *  cnBlockSize;
cl_program  hProgram;
hProgram = clCreateProgramWithSource(hContext,   1,   sProgramSource ,0, 0);
clBuildProgram(hProgram,   0,   0,   0,   0, 0); cl_kernel  hKernel;
hKernel = clCreateKernel(hProgram ,   "addvec" , 0);
clSetKernelArg(hKernel , 0, sizeof(cl_mem) , (void *)&hdA ) clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hdB) clSetKernelArg(hKernel,   2,   sizeof(cl_mem),   (void  *)&hdC)
clEnqueueNDRangeKernel(hQueue,   hKernel,   1,   0, <^cnDimension , ^cnBlockSize,   0,   0,   0);
Jin Filipovic OpenCL
Introduction CUDA      OCL AMD GPU Architecture
oooo oooooooo* ooooooooooo
Vector Addition - Cleanup
clReleaseKernel(hKernel ) ; clReleaseProgram(hProgram ) ; clReleaseMemObj (hdA ) ; clReleaseMemObj (hdB ) ; clReleaseMemObj(hdC); clReleaseCommandQueue(hQueue ) ; cIReleaseContext(hContext );
Jin Fihpovic
OpenCL
□
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO «0000000000
AMD VLIW GPU Architecture
Older processors
• Evergreen and Northern Islands
We will discuss main differences between AMD and NVIDIA GPU
• the rest is very similar Main differences
• VLIW architecture
9 two memory access modes - the fast path and complete path
less sensitive to misaligned access, more sensitive to partition camping analogy
• wavefront (the warp analogy) has 64 threads
Jiŕí Filipovič OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO O0OOOOOOOOO
VLIW Architecture
VLIW
• the instruction word includes several independent operations
• static planning of instruction parallelism (dependencies analyzed during compilation)
• allows higher density of ALUs
• threads should perform a code with sufficient instruction parallelism and a compiler needs to recognize it
• easier in typical graphics tasks than general computating ones
• AMD GPU implements VLIW-5 or VLIW-4, 1 instruction is SFU
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO 00*00000000
Optimizations for VLIW
Explicit vectorization
o we work with vector variables (e.g. float4)
• generation of VLIW is straightforward for the compiler Implicit generation of VLIW
• we write a scalar code
9 compiler tries to recognize independent instruction and create VLIW code
o we can help the compiler by unrolling and grouping the same operations performing different iterations
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO OOO0OOOOOOO
Optimizations for VLIW
Issues with VLIW
• higher consumption of on-chip resources per thread (unrolling, vector types)
• we need independent instructions
• problematic e.g. with conditions
• together with large wavefront it is highly sensitive to divergence
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
oooo ooooooooo oooo«oooooo
Global Memory Access
Fast path vs. complete path
a fast path is significantly faster
• fast path is used for load/store of 32-bit values
• complete path is used for everything other (values of different size, atomics)
• the compiler needs to explicitly use one of those paths
• access path is the same for the whole buffer, so we can degrade the global memory bandwidth easily
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO OOOOO0OOOOO
Fast path vs. complete path
..kernel void
CopyComplete(__global   const   float  *   input,   __global  float* output)
{
int  gid = get_global_id(0 ) ; if   (gid < 0){
atom.add((__global  int   *)   output ,1);
}
output[gid]  = input[gid];
}
The condition if (gid < 0) is never true, but enforces using complete path.
• bandwidth difference on Radeon HD 5870: 96GB/s vs. 18GB/s
J in Fihpovic
OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO OOOOOO0OOOO
Global Memory Access
Permutation of thread-element mapping in wavefront
• small penalization (< 10%) 9 better than c.c. < 1.2
Faster access using 128-bit in single instruction
• e.g. accessing float4
• 122GB/s instead 96GB/s using HD 5870 and the memory copy example
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO 0000000*000
Memory Channels
Radeons of 5000 series have memory channels interleaved by 256 bytes
o all threads within wavefront should use the same channel
wavefront accessing the aligned contiguous block of 32-bit elements (with arbitrary permutation of thread-element mapping) uses the same channel
• if multiple channels are accessed by wavefront, the access is serialized
• occurs e.g. in misaligned access
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
OOOO OOOOOOOOO OOOOOOOO0OO
Bank and Channel Conflicts
Analogy of partition camping
• the global memory is accessed using banks and channels
• concurrent workgroups should access via different channels and different banks
• bandwidth is limited otherwise
• the arrangement of banks depends on the number of channels
• for instance, 8 channels means that the bank switches every 2KB
• high penalization of accessing the same channel and the same bank (0.3 vs. 93GB/s on Radeon HD 5870)
Jin Filipovic OpenCL
Introduction
oooo
Local Data Storage
CUDA OCL
ooooooooo
AMD GPU Architecture ooooooooo«o
Local Data Storage (LDS) is very similar to NVIDIA's shared memory
a composed of 32 or 16 banks
• the quarter-wafefront needs to access different banks simultaneously
o otherwise the bank conflicts appear
• in the case of 32 banks we can efficiently use float2
• broadcast is supported for a single value (analogy of c.c. 1.x)
Jin Filipovic OpenCL
Introduction CUDA —► OCL AMD GPU Architecture
oooo ooooooooo oooooooooo*
AMD GCN GPU Architecture
Nowadays architecture, known as Graphic Core Next. Significantly different than previous generations
• no VLIW, compute unit contains one scalar processor and four vector processors
the code performed by threads is scalar (vectorized code usually slower because of resource consumption) • conditions penalization is lower compared to VLIW
• LI cache for read and write
• concurrent kernel invocations
Jin Filipovic OpenCL