Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo GPU Acceleration of General Computing Tasks J-W I- ■ I " " V in Fihpovic spring 2024 Jin Filipovic GPU Acceleration of General Computing Tasks 1/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion •ooo oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - arithmetic performance of GPUs Theoretical GFLOP/s 5750 5500 Apr-01 Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 Jin Filipovic GPU Acceleration of General Computing Tasks 2/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion o«oo oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - memory bandwidth of GPUs Theoretical GB/s 360 GeForce 780 Ti GeForce FX 5900 GeForce! 6800 GT 30 North Wood 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 < □ ► 4 S1 ► < ► 4 Jin Filipovic GPU Acceleration of General Computing Tasks Motivation oo«o GPU Architecture oooooooooo C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo Motivation - programming complexity OK, so GPUs are fast, but aren't much more difficult to program? • well, it's much more complicated than writing serial C++ code... • but is it fair comparison? Jin Fihpovic GPU Acceleration of General Computing Tasks 4/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oo«o oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - programming complexity OK, so GPUs are fast, but aren't much more difficult to program? • well, it's much more complicated than writing serial C++ code... • but is it fair comparison? Moore's Law The amount of transistors, which can be placed into single chip, doubles every 18 months Jin Filipovic GPU Acceleration of General Computing Tasks 4/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oo«o oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - programming complexity OK, so GPUs are fast, but aren't much more difficult to program? • well, it's much more complicated than writing serial C++ code... • but is it fair comparison? Moore's Law The amount of transistors, which can be placed into single chip, doubles every 18 months The performance grow is caused by: • in the past: higher frequency, instruction-level parallelism, out-of-order instruction execution, etc. • nowadays: wider vector instructions, more cores Jin Filipovic GPU Acceleration of General Computing Tasks 4/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion ooo« oooooooooo ooooo ooooooooooo oooooooooo oo Motivation - the paradigm shift Consequences of the Moore's Law: • in the past: the changes in processors architectures are relevant mainly for compilers developers • nowadays: we need to explicitly parallelize and vectorize the code to keep scaling the performance • still a lot of work for developers, compilers have very limited capabilities here • writing of really efficient code is similarly difficult for both GPUs and CPUs □ s> - = Jin Filipovic GPU Acceleration of General Computing Tasks 5/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO »000000000 ooooo ooooooooooo oooooooooo oo What makes GPU powerful? Parallelism types • Task parallelism the problem is decomposed to parallel tasks • tasks are typically complex, they can perform different jobs • complex synchronization • best for lower number of high-performance processors/cores • Data parallelism • the parallelism on a level of data structures • typically the same operation on multiple elements of a data structure • can be executed on simpler processors □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 6/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo o«oooooooo ooooo ooooooooooo oooooooooo oo What makes GPU powerful? Programmer point of view • some problems are more task-parallel, some more data-parallel (tree traversal vs. vector addition) Hardware designer point of view • processors for data-parallel computations can be simpler so we can get more arithmetic power per square centimeter (i.e., for the same amount of transistors) • simpler memory access patterns allows to create a memory with higher bandwidth Jin Filipovic GPU Acceleration of General Computing Tasks 7/43 Motivation oooo GPU Architecture oo«ooooooo C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo GPU Architecture Jirf Filipovic GPU Acceleration of General Computing Tasks 8/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOO0OOOOOO ooooo ooooooooooo oooooooooo oo GPU Architecture CPU vs. GPU • hundreds ALU in tens of cores vs. tens of thousands ALU in tens of multiprocessors • out-of-order vs. in-order o MIMD, SIMD for short vectors vs. SIMT for long vectors big cache vs. small cache, often read-only GPUs use more transistors for ALUs than for cache and instruction control => higher peak performance, less universal Jin Filipovic GPU Acceleration of General Computing Tasks 9/43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo oo GPU Architecture High-end GPU: • co-processor with dedicated memory • asynchronous instructions execution o connected via PCI-E to the rest of the system Jin Filipovic GPU Acceleration of General Computing Tasks 10 /43 Motivation oooo GPU Architecture OOOOO0OOOO C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo CUDA CUDA (Compute Unified Device Architecture) • architecture for parallel computations developed by NVIDIA • a programming model allowing to implement general programs on GPUs 9 can be used with multiple programming languages Jin Filipovic GPU Acceleration of General Computing Tasks 11 /43 Motivation oooo GPU Architecture OOOOOO0OOO C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion oo Processor G80 9 the first CUDA processor • contains 16 multiprocessors • a multiprocessor 8 scalar processors • 2 special function units • up to 768 threads • HW switching and scheduling groups of 32 threads are organized into warps • SIMT • native synchronization within a multiprocessor Jin Filipovic GPU Acceleration of General Computing Tasks 12 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOO0OO ooooo ooooooooooo oooooooooo oo Memory model of G80 Memory model • 8192 registers shared among all threads within a multiprocessor • 16 KB shared memory • local within a multiprocessor close to the registers' speed (under some circumstances) 9 constant memory • cached, optimized for broadcast, read-only • texture memory • cached, 2D spatial locality, read-only • global memory • read-write, not cached o transfers between system and global memory via PCI-E < □ ► 4 S1 ► • - = Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooo«ooooooo oooooooooo oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): int i = blockldx.x*blockDim.x + threadldx.x; □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooo«ooooooo oooooooooo oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): int i = blockldx.x*blockDim.x + threadldx.x; The complete function for the parallel vector addition: __global__ void addvec(float *a, float *b , float *c){ int i = biockldx.x*biockDim.x + threadldx.x; c[i] = a[i] + b[i]; } □ S> - = Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooo«ooooooo oooooooooo oo Example - vector addition We need to compute a global position of the thread (using ID blocks and grid): int i = blockldx.x*blockDim.x + threadldx.x; The complete function for the parallel vector addition: __global__ void addvec(float *a, float *b , float *c){ int i = biockldx.x*biockDim.x + threadldx.x; c[i] = a[i] + b[i]; } The code defines a kernel (a parallel function executed on GPU). When executing kernel, the size of block and number of blocks has to be defined. Jin Filipovic GPU Acceleration of General Computing Tasks 24 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo oooo«oooooo oooooooooo oo Function type quantifiers The syntax of C is extended by function type quantifiers, determining from where the function can be called and where it is executed • __device__ function is executed on device (GPU) and called from device code 9 __global__ function is executed on device and called from host (CPU) • __host__ function is executed on host, and called from host • __host__ and __device__ can be combined, the function is then compiled for both host and device and also can be called from both host and device Jin Filipovic GPU Acceleration of General Computing Tasks 25 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOO0OOOOO oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOO0OOOOO oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data □ r3> - = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOO0OOOOO oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooo«ooooo oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOO0OOOOO oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory o compute vector addition on GPU □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOO0OOOOO oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory o compute vector addition on GPU • copy back the result from GPU memory into c □ rS1 ~ = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOO0OOOOO oooooooooo oo Example - vector addition For complete computation of vector addition, we need to: • allocate memory for the vectors, and fill it with some data • allocate GPU memory • copy vectors a a b to GPU memory o compute vector addition on GPU • copy back the result from GPU memory into c • use c somehow :-) When managed memory is used (supported from compute capability 3.0 and CUDA 6.0), we don't need to perform steps printed in italic. □ rgi - = Jin Filipovic GPU Acceleration of General Computing Tasks 26 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOO0OOOO oooooooooo oo Example - vector addition CPU code fills a a b, and prints c: #include #define N 64 int main(){ float *a , *b, *c ; cudaMallocManaged(&a, N*sizeof(*a)) cudaMallocManaged(&b, N*sizeof(*b)) cudaMallocManaged(<^c , N*sizeof (* c ) ) for (int i = 0; i < N; i++) { a[i] = i; b[i] = i*2; } // placeholder for GPU computation for (int i = 0; i < N; i++) printf("%f, " , c[i]); cudaFree(a); cudaFree(b); cudaFree(c); return 0; } < □ ► 4 S1 ► < ► 4 J in Fihpovic GPU Acceleration of General Computing Tasks Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOO0OOO oooooooooo oo GPU memory management We use managed memory, so CUDA automatically copies data between CPU and GPU. • memory coherency is automatically ensured • we cannot access managed memory while any GPU kernel is running (even if it does not touch the buffer we want to use) Alternatively, we can allocate and copy memory explicitly: cudaMalloc(void** devPtr, size_t count); cudaFree(void* devPtr); cudaMemcpy(void* dst , const void* src , size_t count, enum cudaMemcpyKind kind ) ; Jin Filipovic GPU Acceleration of General Computing Tasks 28 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOO0OO oooooooooo oo Example - vector addition Kernel execution: • the kernel is called as a C-function; between the name and the arguments, there are triple angle brackets with specification of grid and block size 9 we need to know block size and their count • we will use ID block and grid with fixed block size • the size of the grid is determined in a way to compute the whole problem of vector sum For vector size divisible by 32: #define BLOCK 32 addvec«(a, b, c); cudaDeviceSynchronize(); The synchronization after kernel call ensures that c is going to be accessed by host code after the called kernel finishes. Jin Filipovic GPU Acceleration of General Computing Tasks 29 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooo«o oooooooooo oo Example - vector addition How to solve a general vector size? We will modify the kernel source: __global__ void addvec(float *a, float *b , float *c, int n){ int i = biockldx.x*biockDim.x + threadldx.x; if (i < n) c[i] = a[i] + b[i]; } And call the kernel with sufficient number of threads: addvec«»(a, b, c, N) ; Jin Filipovic GPU Acceleration of General Computing Tasks 30 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo oooooooooo« CUDA: more details oooooooooo Conclusion oo Compilation Now we just need to compile it :-). nvcc -o vecadd vecadd.cu Jin Filipovic GPU Acceleration of General Computing Tasks 31 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo •ooooooooo oo Thread-local memory Registers • the fastest memory, directly used by instructions • local variables and intermediate results are stored into registers • if there is enough registers • if compiler can determine array indexes in compile time • life-time of a thread Jin Filipovic GPU Acceleration of General Computing Tasks 32 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo •ooooooooo oo Thread-local memory Registers • the fastest memory, directly used by instructions • local variables and intermediate results are stored into registers • if there is enough registers • if compiler can determine array indexes in compile time • life-time of a thread Local memory • what cannot fit into registers, goes to the local memory o physically stored in global memory, have longer latency and lower bandwidth • life-time of a thread Jin Filipovic GPU Acceleration of General Computing Tasks 32 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO O0OOOOOOOO oo Block-local memory Shared memory • the speed is close to registers • if there are no bank-conflicts typically requires some load/store instructions • declared by shared— • can have dynamic size (determined during kernel execution), if declared as extern without specification of the array size • life-time of a thread block <□► < rS1 ► < -E ► < -E ► -E -0 0,0 Jin Filipovic GPU Acceleration of General Computing Tasks 33 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO OO^OOOOOOO oo GPU-local memory Global memory • order-of-magnitude lower bandwidth compared to the shared memory o latency in hundreds of GPU cycles o coalesced access necessary for efficient access • life-time of an application • can be cached (depending on GPU architecture) Dynamic allocation with cudaMalloc, static allocation by using __c/ewce__ Jin Filipovic GPU Acceleration of General Computing Tasks 34 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo ooooooooooo CUDA: more details 000*000000 Conclusion oo Other memories • constant memory • texture memory • system memory Jin Filipovic GPU Acceleration of General Computing Tasks 35 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO OOOO0OOOOO oo Thread block-scope synchronization • native barrier • has to be visited by all threads within a thread-block o only one instruction, very fast if not reduce parallelism • __syncthreads() Jin Filipovic GPU Acceleration of General Computing Tasks 36 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo ooooo«oooo oo Atomic operations • perform read-modify-write operations using shared or global memory o no interference with other threads • for 32-bit and 64-bit integers (compute capability > 1.2, float add with c.c. > 2.0) • arithmetic (Add, Sub, Exch, Min, Max, Inc, Dec, CAS) and bitwise (And, Or, Xor) operations Jin Filipovic GPU Acceleration of General Computing Tasks 37 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooo«ooo oo Synchronization of memory operations Compiler can optimize access into shared and global memory by placing intermediate results into registers, and it can change order of memory operations: • —threadfence() and —threadfence-block() can be used to ensure data we are storing are visible for others • variables declared as volatile are always read/written from/to global or shared memory Jin Filipovic GPU Acceleration of General Computing Tasks 38 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion OOOO OOOOOOOOOO OOOOO OOOOOOOOOOO OOOOOOO0OO oo Thread-block synchronization Thread blocks communication • global memory visible for all blocks 9 but weak possibilities to synchronize between blocks • in general no global barrier (can be implemented if all blocks are persistent on GPU) • using atomic operations can solve some problems • generic global barrier only by kernel invocation • harder to program, but allows better scaling □ g - = Jin Filipovic GPU Acceleration of General Computing Tasks 39 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo 00000000*0 oo Global synchronization via atomic operations Alternative implementation of vector reduction • each thread-block reduces a subvector • the last running thread-block adds results of all thread-blocks • implementation of weak global barrier: after finishing blocks 1..A7 — 1, blocks n continues Jin Filipovic GPU Acceleration of General Computing Tasks 40 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo ooooooooo* oo __device__ unsigned int count = 0; __shared__ bool isLastBlockDone; __global__ void sum(const float* array, unsigned int N float* result) { float partialSum = calculatePartialSum(array, N); if (threadldx.x = 0) { result[blockldx.x] = partialSum; __threadfence(); unsigned int value = atomicInc(&count , gridDim.x); isLastBlockDone = (value = (gridDim.x — 1)); } __syncthreads () ; if (isLastBlockDone) { float totalSum = calculateTotalSum(result ) ; if (threadldx.x = 0) { result[0] = totalSum; count = 0; } } } Jin Filipovic GPU Acceleration of General Computing Tasks 41 /43 Motivation oooo GPU Architecture oooooooooo C for CUDA ooooo Demo ooooooooooo CUDA: more details oooooooooo Conclusion •o Materials CUDA documentation (part of CUDA Toolkit, downloadable from developer, n vidia. com) • CUDA C Programming Guide (CUDA essentials) • CUDA C Best Practices Guide (more details on optimization) • CUDA Reference Manual (complete C for CUDA API reference) • a lot of other useful documents (nvcc manual, documentation of PTX and assembly, documentation for various accelerated libraries, etc.) CUDA, Supercomputing for the Masses • http://www.ddj.com/cpp/207200659 <□► < rS1 ► < -E ► < -E ► -E -O O Jin Filipovic GPU Acceleration of General Computing Tasks 42 /43 Motivation GPU Architecture C for CUDA Demo CUDA: more details Conclusion oooo oooooooooo ooooo ooooooooooo oooooooooo o« Today, we learned • what is CUDA good for • basic GPU architecture 9 basic C for CUDA programming In the next lecture, we will focus • how to write efficient GPU code Jin Filipovic GPU Acceleration of General Computing Tasks 43 /43