About The Class

Motivation

GPU Architecture

C for CUDA

イロト イヨト イヨト イヨト

æ

Sample Code

# Introduction, CUDA Basics

Jiří Filipovič

Fall 2024

Jiří Filipovič Introduction, CUDA Basics

| About The Class | Motivation<br>00000000 | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------------------|------------------|------------|-------------|
| About the o     | lass                   |                  |            |             |

The class is focused on algorithm design and programming of *general purpose* computing applications on *many-core vector processors* 

| About The Class<br>●○○○○○○ | Motivation | GPU Architecture | C for CUDA | Sample Code |
|----------------------------|------------|------------------|------------|-------------|
| About the c                | lass       |                  |            |             |

The class is focused on algorithm design and programming of *general purpose* computing applications on *many-core vector processors* 

We will focus to CUDA GPUs first:

- C for CUDA is good for teaching (easy API, a lot of examples available, mature compilers and tools)
- restricted to NVIDIA GPUs and x86 CPUs (with PGI)

## About the class

The class is focused on algorithm design and programming of *general purpose* computing applications on *many-core vector processors* 

We will focus to CUDA GPUs first:

- C for CUDA is good for teaching (easy API, a lot of examples available, mature compilers and tools)
- restricted to NVIDIA GPUs and x86 CPUs (with PGI)

After learning CUDA, we focus to OpenCL

- programming model very similar to CUDA, easy to learn when you already know CUDA
- can be used with various HW devices
- $\bullet\,$  we will focus on code optimizations for x86, Intel MIC (Xeon Phi) and AMD GPUs

The class is practically oriented – besides efficient parallelization, we will focus on writing efficient code.

| About The Class<br>○●○○○○○ | Motivation | GPU Architecture | C for CUDA | Sample Code |
|----------------------------|------------|------------------|------------|-------------|
| What is offer              | red        |                  |            |             |

You will learn:

- architecture of NVIDIA and AMD GPUs, Xeon Phi
- architecture-aware design of data-parallel algorithms
- programming in C for CUDA and OpenCL
- performance tuning and profiling
- use cases

What is expected from you

During the semester, you will work on a practically oriented project

- important part of your total score in the class
- the same task for everybody, we will compare speed of your implementation
- 50 + 20 points of total score
  - working code: 25 points
  - efficient implementation: 25 points
  - speed of your code relative to your class mates: at most 20 points (only to improve your final grading, from 1 to 20 points will be granted for projects above average)

Exam (oral or written, depending on the number of students)

• 50 points

< ロ > < 同 > < 三 > < 三 >

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
| Grading         |            |                  |            |             |

For those finishing by exam:

- A: 92–100
- B: 86–91
- C: 78–85
- D: 72–77
- E: 66–71
- F: 0-65 pts

For those finishing by colloquium:

• 50 pts



CUDA documentation (installed as a part of CUDA Toolkit, downloadable from *developer.nvidia.com*)

- CUDA C Programming Guide (most important properties of CUDA)
- CUDA C Best Practices Guide (more detailed document focusing on optimizations)
- CUDA Reference Manual (complete description of C for CUDA API)
- other useful documents (nvcc guide, PTX language description, library manuals, ...)

CUDA article series, Supercomputing for the Masses

• http://www.ddj.com/cpp/207200659

 About The Class
 Motivation
 GPU Architecture
 C for CUDA

 000000
 00000000
 00000000
 0000000

## Materials – OpenCL

- OpenCL 1.1 Specification
- AMD Accelerated Parallel Processing Programming Guide
- Intel OpenCL SDK Programming Guide
- Writing Optimal OpenCL Code with Intel OpenCL SDK

- Ben-Ari M., Principles of Concurrent and Distributed Programming, 2nd Ed. Addison-Wesley, 2006
- Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Patterns for Parallel Programming, Addison-Wesley, 2004

< ロ > < 同 > < 三 > < 三 >

| Mativation      |                        | han at a sufawa  |            |             |
|-----------------|------------------------|------------------|------------|-------------|
| About The Class | Motivation<br>●0000000 | GPU Architecture | C for CUDA | Sample Code |





▲□▶ ▲□▶ ▲臣▶ ▲臣▶ 三臣 - のへで

 About The Class
 Motivation
 GPU Architecture
 C for CUDA
 Sample Code

 0000000
 00000000
 00000000
 000000000
 000000000

## Motivation – GPU memory bandwidth



ヘロア 人間 アメヨア 人間 アー

Ð,

OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?

- $\bullet$  well, it is more difficult comparing to writing serial C/C++ code...
- but can we compare it to serial code?

 About The Class
 Motivation
 GPU Architecture
 C for CUDA
 Sample Code

 Motivation - programming complexity
 Sample Code
 S

OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?

- $\bullet$  well, it is more difficult comparing to writing serial C/C++ code...
- but can we compare it to serial code?

#### Moore's Law

Number of transistors on a single chip doubles every 18 months

OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?

- $\bullet$  well, it is more difficult comparing to writing serial C/C++ code...
- but can we compare it to serial code?

#### Moore's Law

Number of transistors on a single chip doubles every 18 months

Corresponding growth of performance comes from

- **in the past:** frequency increase, instruction parallelism, out-of-order instruction processing, caches, etc.
- today: vector instructions, increase in number of cores

ヘロト ヘヨト ヘヨト ヘヨト

#### Motivation – paradigm change

Moore's Law consequences:

- in the past: changes were important for compiler developers; application developers didn't need to care much
- today: in order to utilize state-of-the-art processors, it is necessary to write parallel and vectorized code
  - it is necessary to find parallelism in the problem being solved. which is a task for a programmer, not for a compiler (at least for now)
  - writing efficient code for modern CPUs is similarly difficult as writing for GPUs

< ロ > < 同 > < 三 > < 三 >



Important problem from computational chemistry

- we have a molecule defined by position and charges of its atoms
- the goal is to compute charges at a 3D spatial grid around the molecule

In a given point of the grid, we have

$$V_i = \sum_j \frac{w_j}{4\pi\epsilon_0 r_{ij}}$$

Where  $w_j$  is charge of the *j*-th atom,  $r_{ij}$  is Euclidean distance between atom *j* and the grid point *i* and  $\epsilon_0$  is vacuum permittivity.



## Electrostatic Potential Map

Initial implementation

- suppose we know nothing about HW, just know C++  $\,$
- algorithm needs to process 3D grid such that it sums potential of all atoms for each grid point
- we will iterate over atoms in outer loop, as it allows to precompute positions of grid points and minimizes number of accesses into input/output array

 About The Class
 Motivation
 GPU Architecture
 C for CUDA
 Sample Code

 0000000
 00000000
 00000000
 000000000
 0000000000

#### Electrostatic Potential Map

```
void coulomb(const sAtom* atoms, const int nAtoms,
    const float gs, const int gSize, float *grid) {
 for (int a = 0; a < nAtoms; a++) {
    sAtom mvAtom = atoms[a]:
    for (int x = 0; x < gSize; x++) {
      float dx^2 = powf((float)x * gs - myAtom.x, 2.0f);
      for (int y = 0; y < gSize; y++) {
        float dy_2 = powf((float)y * gs - myAtom.y);
        for (int z = 0; z < gSize; z++) {
          float dz = (float)z * gs - myAtom.z;
          float e = myAtom.w / sqrtf(dx2 + dy2 + dz*dz);
          grid z*gSize*gSize + y*gSize + x += e;
     }
```

• □ ▶ • □ ▶ • □ ▶ • □ ▶



 naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)



- naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
- 476.9 Mevals/s when optimized cache:  $2.9 \times$  speedup



- naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
- 476.9 Mevals/s when optimized cache:  $2.9 \times$  speedup
- 2,577 Mevals/s when vectorized:  $15.6 \times$  speedup

- naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
- 476.9 Mevals/s when optimized cache:  $2.9 \times$  speedup
- 2,577 Mevals/s when vectorized:  $15.6 \times$  speedup
- **9,914** Mevals/s when parallelized:  $60.2 \times$  speedup

< ロ > < 同 > < 三 > < 三 >

Electrostatic Potential Map

Execution on 4-core CPU at 3.6GHz (Sandy Bridge) + GeForce GTX 1070 (Pascal)

- naive implementation 164.7 millions of atoms evaluated per second (MEvals/s)
- 476.9 Mevals/s when optimized cache:  $\textbf{2.9}\times$  speedup
- 2,577 Mevals/s when vectorized:  $15.6\times$  speedup
- **9,914** Mevals/s when parallelized:  $60.2 \times$  speedup
- 537,900 Mevals/s GPU version: 3266× speedup

GPU speedup over already tuned CPU code is  $54\times$ , but the optimization effort is similar for CPU and GPU. In this class, you will learn how to optimize the code.

< ロ > < 同 > < 三 > < 三 >

### Why are GPUs so powerful?

Types of Parallelism

- Task parallelism
  - decomposition of a task into the problems that may be processed in parallel
  - usually more complex tasks performing different actions
  - usually more frequent (and complex) synchronization
  - ideal for small number of high-performance processors
- Data parallelism
  - parallelism on the level of data structures
  - usually the same operations on many items of a data structure
  - finer-grained parallelism allows for simple construction of individual processors

### Why are GPUs so powerful?

From programmer's perspective

- some problems are rather data-parallel, some task-parallel (matrix multiplication vs. graph traversal)
- From hardware perspective
  - processors for data-parallel tasks may be simpler
  - it is possible to achieve higher arithmetic performance with the same size of a processor
  - simpler memory access patterns allow for high-throughput memory designs

About The Class

Motivation

GPU Architecture

C for CUDA

Sample Code

## **GPU** Architecture



| DRAM |  |  |  |  |  |  |
|------|--|--|--|--|--|--|
|      |  |  |  |  |  |  |
| GPU  |  |  |  |  |  |  |
|      |  |  |  |  |  |  |

ヘロン ヘロン ヘビン ヘビン

2

GPU Architecture

Main differences compared to CPU

- high parallelism: hundreds thousands threads needed to utilize high-end GPUs
- SIMT model: subsets of threads runs in lock-step mode
- distributed on-chip memory: subsets of threads shares their private memory
- restricted caching capabilities: small cache, often read-only

Algorithms usually need to be redesigned to be efficient on GPU.

GPU Architecture

Within the system:

- co-processor with dedicated memory (discrete GPU)
- asynchronous processing of instructions
- attached using PCI-E to the rest of the system (discrete GPU)

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
| CUDA            |            |                  |            |             |

CUDA (Compute Unified Device Architecture)

- architecture for parallel computations developed by NVIDIA
- provides a new programming model, allows efficient implementation of general GPU computations
- may be used in multiple programming languages



| About The Class | Motivation | GPU Architecture<br>○○○○○●○○ | C for CUDA | Sample Code |
|-----------------|------------|------------------------------|------------|-------------|
| C80 Procos      | cor        |                              |            |             |

#### G80

- the first CUDA processor
- 16 multiprocessors
- each multiprocessor
  - 8 scalar processors
  - 2 units for special functions
  - threads are grouped into warps by 32
    - SIMT
  - up to 768 threads
    - HW for thread switching and scheduling
  - native synchronization within the multiprocessor

| About The Class | Motivation | GPU Architecture<br>○○○○○○●○ | C for CUDA | Sample Code |
|-----------------|------------|------------------------------|------------|-------------|
| G80 Memo        | ry Model   |                              |            |             |

Memory model

- 8192 registers shared among all threads of a multiprocessor
- 16 kB of shared memory
  - local within the multiprocessor
  - as fast as registry (under certain constraints)
- constant memory
  - cached, read-only
- texture memory
  - cached with 2D locality, read-only
- global memory
  - non cached, read-write
- data transfers between global memory and system memory through PCI-E

 About The Class
 Motivation
 GPU Architecture
 C for CUDA
 Sample Code

 0000000
 0000000
 0000000
 00000000
 000000000
 0000000000

#### G80 Processor



Jiří Filipovič Introduction, CUDA Basics

C for CUDA is an extension of C for parallel computations

- $\bullet\,$  explicit separation of host (CPU) and device (GPU) code
- thread hierarchy
- memory hierarchy
- synchronization mechanisms
- API

Thread hierarchy

- threads are organized into blocks
- blocks form a grid
- all threads from a block run on the same multiprocessor
- problem is decomposed into sub-problems that can be run independently in parallel (blocks)
- individual sub-problems are divided into small pieces that can be run cooperatively in parallel (threads)

scales well

| About The Class | Motivation | GPU Architecture | C for CUDA<br>○○●○○ | Sample Code |
|-----------------|------------|------------------|---------------------|-------------|
| Thread Hiera    | archy      |                  |                     |             |

|                                         | Grid         |                     |             |                |                                 |         |                                       |
|-----------------------------------------|--------------|---------------------|-------------|----------------|---------------------------------|---------|---------------------------------------|
|                                         | Block        | (0, 0)              | Block       | ( <b>1, 0)</b> | Block                           | (2, 0)  |                                       |
|                                         | <u>}}}}}</u> | *****               | <u>}}}}</u> | *****          | <u>}}}}</u>                     |         |                                       |
|                                         | Block        | (0, 1) <sup>-</sup> | Block       | (1, 1)         | Block                           | (2, 1)  |                                       |
| , e e e e e e e e e e e e e e e e e e e |              |                     |             |                | ,<br>,<br>,<br>,<br>,<br>,<br>, | · * * * | · · · · · · · · · · · · · · · · · · · |
|                                         |              | 1                   | Block       | (1, 1)         |                                 |         |                                       |
| Threa                                   | d (0, 0)     | Thread              | (1, 0)      | Thread         | (2, 0)                          | Thread  | (3, 0)                                |
| Threa                                   | d (0, 1)     | Thread              | (1, 1)      | Thread         | (2, 1)                          | Thread  | (3, 1)                                |
| Threa                                   | d (0, 2)     | Thread              | (1, 2)      | Thread         | (2, 2)                          | Thread  | (3, 2)                                |

・ロト・日本・日本・日本・日本・日本
| About The Class | Motivation | GPU Architecture | C for CUDA<br>○○○●○ | Sample Code |
|-----------------|------------|------------------|---------------------|-------------|
| Memory Hi       | erarchy    |                  |                     |             |

More memory types:

- different visibility
- different lifetime
- different speed and behavior
- brings good scalability

| About The Class | Motivation | GPU Architecture | C for CUDA<br>○○○○● | Sample Code |
|-----------------|------------|------------------|---------------------|-------------|
| Memory Hie      | rarchy     |                  |                     |             |



◆□ ▶ ◆□ ▶ ◆ □ ▶ ◆ □ ▶ ● □ ● ● ● ●

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code<br>●○○○○○○○○○ |
|-----------------|------------|------------------|------------|---------------------------|
| An Example      | – Sum of   | Vectors          |            |                           |

# We want to sum vectors a and b and store the result in vector c

ヘロア 人間 アメヨア 人間 アー

Ð,

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
| An Example      | – Sum of   | Vectors          |            |             |

We want to sum vectors a and b and store the result in vector cWe need to find parallelism in the problem.

▲ 문 ▶ | ▲ 문 ▶



We want to sum vectors a and b and store the result in vector cWe need to find parallelism in the problem. Serial sum of vectors:

for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];</pre>

イロト イポト イヨト イヨト



We want to sum vectors a and b and store the result in vector cWe need to find parallelism in the problem. Serial sum of vectors:

for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];</pre>

Individual iterations are independent – it is possible to parallelize, scales with the size of the vector.

We want to sum vectors a and b and store the result in vector cWe need to find parallelism in the problem. Serial sum of vectors:

for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];</pre>

Individual iterations are independent – it is possible to parallelize, scales with the size of the vector. i-th thread sums i-th component of the vector:

c[i] = a[i] + b[i];

How do we find id of the thread?

イロト イポト イヨト イヨト

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
| Thread Hier     | archy      |                  |            |             |



◆□ ▶ ◆□ ▶ ◆ □ ▶ ◆ □ ▶ ● □ ● ● ● ●

# Thread and Block Identification

C for CUDA has built-in variables:

- threadIdx.{x, y, z} tells position of a thread in a block
- blockDim.{x, y, z} tells size of the block
- **blockIdx**.{**x**, **y**, **z**} tells position of the block in grid (z always equals 1)
- gridDim.{x, y, z} tells grid size (z always equals 1)

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
| An Example      | – Sum of   | Vectors          |            |             |

> < E > < E >

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
| An Example      | – Sum of \ | /ectors          |            |             |

```
int i = blockIdx.x*blockDim.x + threadIdx.x;
```

(A) (E) (A) (E) (A)

int i = blockIdx.x\*blockDim.x + threadIdx.x;

Whole function for parallel summation of vectors:

```
__global__ void addvec(float *a, float *b, float *c){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];
}
```

イロト イポト イヨト イヨト

int i = blockIdx.x\*blockDim.x + threadIdx.x;

Whole function for parallel summation of vectors:

```
__global__ void addvec(float *a, float *b, float *c){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];
}
```

The function defines so called kernel; we specify how meny threads and what structure will be run when calling.

Function Type Quantifiers

C syntax enhanced by quantifiers defining where the code is executed and from where it can be called:

- \_\_device\_\_ function is run on device (GPU) only and can be called from the device code only
- \_\_global\_\_ function is run on device (GPU) only and can be called from the host (CPU) code only
- \_\_host\_\_\_ function is run on host only and can be called from the host only
- \_\_host\_\_ and \_\_device\_\_ may be combined function is compiled for both then

< ロ > < 同 > < 三 > < 三 >

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code |
|-----------------|------------|------------------|------------|-------------|
|                 |            |                  |            |             |

-≣->

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code<br>○○○○●○○○○○ |
|-----------------|------------|------------------|------------|---------------------------|
|                 |            |                  |            |                           |

• allocate memory for vectors and fill it with data

| About The Class | Motivation<br>00000000 | GPU Architecture | C for CUDA | Sample Code<br>○○○○○●○○○○○ |
|-----------------|------------------------|------------------|------------|----------------------------|
|                 | /                      |                  |            |                            |

- allocate memory for vectors and fill it with data
- allocate memory on GPU

| About The Class | Motivation<br>0000000 | GPU Architecture                      | C for CUDA | Sample Code<br>○○○○○●○○○○○ |
|-----------------|-----------------------|---------------------------------------|------------|----------------------------|
|                 |                       | · · · · · · · · · · · · · · · · · · · |            |                            |

- allocate memory for vectors and fill it with data
- allocate memory on GPU
- copy vectors a and b to GPU

| About The Class                       | Motivation | GPU Architecture | C for CUDA | Sample Code<br>○○○○○●○○○○○ |
|---------------------------------------|------------|------------------|------------|----------------------------|
| · · · · · · · · · · · · · · · · · · · |            | /                |            |                            |

- allocate memory for vectors and fill it with data
- allocate memory on GPU
- copy vectors a and b to GPU
- compute the sum on GPU

| About The Class | Motivation | GPU Architecture | C for CUDA | Sample Code<br>○○○○○●○○○○○ |
|-----------------|------------|------------------|------------|----------------------------|
| 0000000         | 00000000   | 00000000         | 00000      | 0000000000                 |

- allocate memory for vectors and fill it with data
- allocate memory on GPU
- copy vectors a and b to GPU
- compute the sum on GPU
- store the result from GPU into c

|  | About The Class | Motivation<br>0000000 | GPU Architecture | C for CUDA | Sample Code |
|--|-----------------|-----------------------|------------------|------------|-------------|
|--|-----------------|-----------------------|------------------|------------|-------------|

- allocate memory for vectors and fill it with data
- allocate memory on GPU
- copy vectors a and b to GPU
- compute the sum on GPU
- store the result from GPU into c
- use the result in c :-)

When managed memory is used (requires GPU with computing capability 3.0 and CUDA 6.0 or better), steps written in italics are not required.

### CPU code that fills a and b and computes c

```
#include <stdio.h>
#define N 64
int main(){
  float *a. *b. *c:
  cudaMallocManaged(&a, N*sizeof(*a));
  cudaMallocManaged(&b, N*sizeof(*b));
  cudaMallocManaged(&c, N*sizeof(*c));
  for (int i = 0; i < N; i++) {
    a[i] = i;
   b[i] = i * 3;
  }
// GPU code will be here
  for (int i = 0; i < N; i++)
    printf("%f, ", c[i]);
  cudaFree(a); cudaFree(b); cudaFree(c);
  return 0:
```

イロト イヨト イヨト イヨト

2

Using managed memory, CUDA maintains memory transfers between CPU and GPU automatically.

• memory coherency is guaranteed

• GPU memory cannot be used when any GPU kernel is running Memory operations can be programmed explicitly

```
cudaMalloc(void** devPtr, size_t count);
cudaFree(void* devPtr);
cudaMemcpy(void* dst, const void* src, size_t count,
    enum cudaMemcpyKind kind);
```

< ロ > < 同 > < 三 > < 三 >

Running the kernel:

- kernel is called as a function; between the name and the arguments, there are triple angle brackets with specification of grid and block size
- we need to know block size and their count
- we will use 1D block and grid with fixed block size
- the size of the grid is determined in a way to compute the whole problem of vector sum

For vector size divisible by 32:

```
#define BLOCK 32
addvec<<</pre>N/BLOCK, BLOCK>>>(a, b, c);
```

How to solve a general vector size?

イロト イポト イヨト イヨト

 About The Class
 Motivation
 GPU Architecture
 C for CUDA
 Sample Code

 An Example – Sum of Vectors

#### We will modify the kernel source:

```
__global__ void addvec(float *a, float *b, float *c, int n){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}</pre>
```

And call the kernel with sufficient number of threads:

```
addvec \ll N/BLOCK + 1, BLOCK > >>(a, b, c, N);
```

Now we just need to compile it :-)

nvcc -o vecadd vecadd.cu

Where to work with CUDA?

- on a remote computer: airacuda.fi.muni.cz, barracuda.fi.muni.cz, accounts will be made
- your own machine: download and install CUDA toolkit and SDK from developer.nvidia.com

. . . . . . . .