<span id="page-0-0"></span>

メロメ メ都 メメ きょくきょう

重

 $299$ 

# **OpenCL**

#### Jiří Filipovič

#### Fall 2020

Jiří Filipovič [OpenCL](#page-24-0)

#### <span id="page-1-0"></span>What is OpenCL?

- an open standard for heterogeneous systems programming
- **.** low-level, derived from C, HW abstraction very similar to CUDA
- Advantages over CUDA
	- can be used for wide area of HW
	- open standard, independent on a single corporation

Disadvantages compared to CUDA

- more complex API (similar to CUDA Driver API)
- often less mature implementation
- **•** slower implementation of new HW features

つくい

<span id="page-2-0"></span>

One implementation can be compiled for different types of HW

 $\bullet$  if we do not use extensions  $\dots$ 

However, the implementation optimized for some type of HW may be very slow on another HW

• we need to re-optimize for different HW architectures So, it is the standard for programming of various types of HW, but we need to write different kernels for different architectures.

• high importance of easily modifiable code or autotuning

4 0 8

<span id="page-3-0"></span>

[Introduction](#page-1-0) [CUDA](#page-5-0) → OCL [AMD GPU Architecture](#page-14-0)

重

 $2Q$ 

### Performance Portability



Obrázek: SGEMM optimized for Fermi and Cypress, running on Fermi<sup>1</sup>.

<sup>1</sup> Du et al. From CUDA to OpenCL: Towards a Performance-portable S[olut](#page-2-0)i[on f](#page-4-0)[or](#page-2-0) [Mu](#page-3-0)[lti](#page-4-0)[-pl](#page-0-0)[at](#page-1-0)[fo](#page-4-0)[rm](#page-5-0)[GP](#page-1-0)[U](#page-4-0) Programming イロト イ押 トイモト イモト

<span id="page-4-0"></span>

**[Introduction](#page-1-0)**  $\overline{C}$  **[CUDA](#page-5-0) → OCL**  $\overline{OC}$   $\overline{$ 

重

 $2Q$ 

### Performance Portability



Obrázek: SGEMM optimized for Fermi a Cypress, running on Cypress<sup>2</sup>.

<sup>2</sup> Du et al. From CUDA to OpenCL: Towards a Performance-portable S[olut](#page-3-0)i[on f](#page-5-0)[or](#page-3-0) [Mu](#page-4-0)[lti](#page-5-0)[-pl](#page-0-0)[at](#page-1-0)[fo](#page-4-0)[rm](#page-5-0)[GP](#page-1-0)[U](#page-4-0) Programming イロメ イ御メ イヨメ イヨメ

### <span id="page-5-0"></span>Main Differences

- OpenCL is not integrated to  $C/C++$ 
	- the OpenCL kernel is stored as a string, which is usually compiled during program execution
	- kernel cannot share code with  $C/C++$  codebase (user-defined types, common functions etc.)
- Kernels in OpenCL do not use pointers
	- we cannot dereference, use pointer arithmetics, link different buffers
	- we can traverse the buffer by index, of course
- OpenCL is strictly derived from C
	- no  $C++$  stuff
- OpenCL uses queues for HW devices
	- eases using multiple devices/streams
- Queues can work out-of-order
	- **e** eases load balancing

**K ロ ⊁ K 伊 ⊁ K 店** 

∢ 重→

 $2Q$ 

≮ロト ⊀個 ▶ ⊀ 君 ▶ ⊀ 君 ▶

活

 $299$ 

# CUDA-OpenCL dictionary

Main differences in terminology



K ロ ▶ K @ ▶ K 결 ▶ K 결 ▶ │ 결

 $2Q$ 

### Vector Addition – Kernel

#### CUDA

```
__global__ void addvec ( float ∗a , float ∗b , float ∗c )
{
  int i = blockIdx x *blockDim x + threadIdx x;
 c[i] = a[i] + b[i];}
```
### OpenCL

```
__kernel void vecadd ( _ global float * a,  _ global float * b,
__global float ∗ c )
{
  int i = get\_global_id(0);
 c[i] = a[i] + b[i];}
```
 $2Q$ 

### Vector Addition – Host Code

To execute the kernel, we need

- to define a platform
	- device (at least one)
	- context
	- queues
- allocate and copy data
- compile the kernel code
- configure the kernel and execute it

4 0 8 4 A

K ロ ▶ K 御 ▶ K 君 ▶ K 君 ▶ ○ 君

 $2Q$ 

### Vector Addition – Platform Definition

```
cl uint num devices returned :
cl_device_id cdDevice ;
err = c1GetDeviceIBs (NULL, CL_DEVICE_TYPE_GPU, 1, 1)&cdDevice , &num_devices_returned ) ;
cl context hContext:
hContext = clCreateContext (0, 1, \&cdDevice, NULL, NULL, &err);
cl_command_queue hQueue ;
hQueue = clCreateCommandQueue(hContext, hDevice, 0, \&err);
```
### Vector Addition – Platform Definition

The platform can have more devices

- $\bullet$  can be selected by the type (e.g. a GPU)
- can be selected by vendor
- we can also choose HW using finer informations
	- number of cores
	- frequency
	- memory size
	- extensions (double precision, atomic operations etc.)
- Each device needs at least one queue
	- **•** cannot be used otherwise

### Vector Addition – Memory Allocation and Copy

```
cl_mem hdA , hdB , hdC ;
hdA = clCreateBuffer ( hContext , CL_MEM_READ_ONLY ,
  cnDimension * sizeof (c1_fload), pA, 0);. . .
```
There is no explicit copy – allocation and copy is performed in lazy fashion, i.e. in time when data are needed. Consequently, the target device is not defined in the memory allocation.

 $\left\{ \begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \end{array} \right\}$  ,  $\left\{ \begin{array}{ccc} 0 & 0 & 0 \\ 0 & 0 & 0 \end{array} \right\}$ 

[Introduction](#page-1-0)  $\text{CUDA} \rightarrow \text{OCL}$  $\text{CUDA} \rightarrow \text{OCL}$  $\text{CUDA} \rightarrow \text{OCL}$  [AMD GPU Architecture](#page-14-0)

KID KARA KE KIER I E YOKO

### Vector Addition – Kernel Execution

```
const unsigned int cnBlockSize = 512;
const unsigned int cnBlocks = 3;
const unsigned int cnDimension = cnBlocks ∗ cnBlockSize ;
cl_program hProgram ;
hProgram = clCreateProgramWithSource(hContext, 1, sProgramSource
  , 0, 0 ) ;
\text{clBuilderrogram}(\text{hProgram}, 0, 0, 0, 0, 0);
cl kernel hKernel :
hKernel = clCreateKernel (hProgram, "addvec", 0);
clSetKernelArg (hKernel, 0, sizeof ( cl_mem ), ( void *)&hdA );
clSetKernelArg (hKernel, 1, sizeof ( cl_mem ), ( void *)&hdB );
clSetKernelArg ( hKernel, 2, sizeof (c_1 mem), (void *)\&hdhdC);
clEnqueueNDRangeKernel ( hQueue , hKernel , 1 , 0 , &cnDimension
  , &cnBlockSize, 0, 0, 0);
```
 $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$  $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$  $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$  $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$  $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$  $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$  $| Introduction | CUDA      | OCL         | AMD GPU Architecture |
|--------------|-----------|-------------|----------------------|
| 0000         | 000000000 | 00000000000 |                      |$ 

 $2Q$ 

K 로 )

### Vector Addition – Cleanup

```
clReleaseKernel ( hKernel ) ;
clReleaseProgram ( hProgram ) ;
clReleaseMemObj ( hdA ) ;
clReleaseMemObj ( hdB ) ;
clReleaseMemObj ( hdC ) ;
clReleaseCommandQueue ( hQueue ) ;
clReleaseContext ( hContext ) ;
```
**K ロ ▶ | K 御 ▶ | K 舌** 

### <span id="page-14-0"></span>AMD VLIW GPU Architecture

Older processors

• Evergreen and Northern Islands

We will discuss main differences between AMD and NVIDIA GPU

• the rest is very similar

Main differences

- VLIW architecture
- two memory access modes the fast path and complete path
- **•** less sensitive to misaligned access, more sensitive to partition camping analogy
- wavefront (the warp analogy) has 64 threads

### VLIW Architecture

#### VLIW

- **•** the instruction word includes several independent operations
- static planning of instruction parallelism (dependencies analyzed during compilation)
- allows higher density of ALUs
- threads should perform a code with sufficient instruction parallelism and a compiler needs to recognize it
	- easier in typical graphics tasks than general computating ones
- AMD GPU implements VLIW-5 or VLIW-4, 1 instruction is **SFU**

# Optimizations for VLIW

Explicit vectorization

- we work with vector variables (e.g. float4)
- generation of VLIW is straightforward for the compiler

Implicit generation of VLIW

- we write a scalar code
- compiler tries to recognize independent instruction and create VLIW code
- we can help the compiler by unrolling and grouping the same operations performing different iterations

 $2Q$ 

## Optimizations for VLIW

Issues with VLIW

- higher consumption of on-chip resources per thread (unrolling, vector types)
- we need independent instructions
	- **•** problematic e.g. with conditions
- **•** together with large wavefront it is highly sensitive to divergence

4 0 F

## Global Memory Access

Fast path vs. complete path

- fast path is significantly faster
- $\bullet$  fast path is used for load/store of 32-bit values
- complete path is used for everything other (values of different size, atomics)
- the compiler needs to explicitly use one of those paths
	- access path is the same for the whole buffer, so we can degrade the global memory bandwidth easily

 $2Q$ 

メロメ メ御メ メミメ メミメー

#### Fast path vs. complete path

```
__kernel void
CopyComplete ( __ global const float * input, __ global float* output)
{
  int \text{gid} = \text{get_global_id}(0);if (gid < 0){
    atom\_add ((\_global int *) output, 1);}
  output [gid] = input [gid];
}
```
The condition if  $(gid < 0)$  is never true, but enforces using complete path.

• bandwidth difference on Radeon HD 5870: 96 GB/s vs. 18 GB/s

 $2Q$ 

# **Global Memory Access**

Permutation of thread-element mapping in wavefront

- small penalization  $(< 10\%)$
- $\bullet$  better than c.c.  $< 1.2$

Faster access using 128-bit in single instruction

- e.g. accessing float4
- 122 GB/s instead 96 GB/s using HD 5870 and the memory copy example

**K ロ ▶ K 倒 ▶** 

# Memory Channels

Radeons of 5000 series have memory channels interleaved by 256 bytes

- all threads within wavefront should use the same channel
	- wavefront accessing the aligned contiguous block of 32-bit elements (with arbitrary permutation of thread-element mapping) uses the same channel
- if multiple channels are accessed by wavefront, the access is serialized
	- occurs e.g. in misaligned access

# Bank and Channel Conflicts

Analogy of partition camping

- the global memory is accessed using banks and channels
- concurrent workgroups should access via different channels and different banks
	- **•** bandwidth is limited otherwise
- the arrangement of banks depends on the number of channels
	- for instance. 8 channels means that the bank switches every 2 KB
- high penalization of accessing the same channel and the same bank (0.3 vs. 93 GB/s on Radeon HD 5870)

## Local Data Storage

Local Data Storage (LDS) is very similar to NVIDIA's shared memory

- composed of 32 or 16 banks
- the quarter-wafefront needs to access different banks simultaneously
	- otherwise the bank conflicts appear
	- in the case of 32 banks we can efficiently use float2
- broadcast is supported for a single value (analogy of c.c.  $1.x$ )

 $4.17 \times$ 

# <span id="page-24-0"></span>AMD GCN GPU Architecture

Nowadays architecture, known as Graphic Core Next. Significantly different than previous generations

- no VLIW, compute unit contains one scalar processor and four vector processors
	- the code performed by threads is scalar (vectorized code usually slower because of resource consumption)
	- conditions penalization is lower compared to VLIW
- L1 cache for read and write
- **e** concurrent kernel invocations