About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	ooooooooooo
Introduction, CUDA Basics
Jiří Fi li povič Fall 2015
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
•ooooooo	ooooo	ooooooooo	OOOOO	ooooooooooo
Language				
I will speak English if there is any foreign student in the class
• expect Slovak :-) But I understand that
• my English is not perfect
• your English may or may not be perfect
• so feel free to interrupt me and ask me, if you do not understand
If you do not feel comfortable to ask me in English
• ask me in Czech/Slovak a (but English is preferred)
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
osoooooo	ooooo	ooooooooo	OOOOO	ooooooooooo
About the	class			
The class is focused on algorithm design and programming of general purpose computing applications on vector processors
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
osoooooo	ooooo	ooooooooo	OOOOO	ooooooooooo
About the	class			
The class is focused on algorithm design and programming of general purpose computing applications on vector processors We will focus to CUDA GPU first:
• C for CUDA is good for teaching (easy API, lot of examples available, mature compilers and tools)
• GPUs are wide-spread and powerful
• restricted to NVIDIA GPUs and x86 CPUs
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
o»oooooo ooooo ooooooooo ooooo ooooooooooo
About the class
The class is focused on algorithm design and programming of general purpose computing applications on vector processors We will focus to CUDA GPU first:
• C for CUDA is good for teaching (easy API, lot of examples available, mature compilers and tools)
• GPUs are wide-spread and powerful
• restricted to NVIDIA GPUs and x86 CPUs After learning CUDA, we focus to OpenCL
• programming model very similar to CUDA, easy to learn when you already know CUDA
• can be used also with AMD GPUs, Intel MIC (Xeon Phi), Cell
• we will focus on code optimizations for x86, Intel MIC and AMD GPUs
The class is practically oriented - besides efficient parallelization, we will focus on writing efficient code.
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
00900000	ooooo	ooooooooo	OOOOO	ooooooooooo
What is	offered			
You will learn:
• architecture of NVIDIA and AMD GPUs, Xeon Phi
• architecture-aware design of data-parallel algorithms
• programming in C for CUDA and OpenCL
• performance tuning and profiling
• basic tools and libraries for CUDA GPUs
• use cases
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooaoooo	Motivation                   GPU Architecture ooooo ooooooooo	C for CUDA OOOOO	Sample Code ooooooooooo
What is	expected from you		
During the semester, you will work on a practically oriented project
• important part of your total score in the class
• the same task for everybody, we will compare speed of your implementation
• 50 + 20 points of total score
• working code: 25 points
• efficient implementation: 25 points
• speed of your code relative to your class mates: 20 points (only to improve your final grading)
Exam (oral or written, depending on the number of students)
• 50 points
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooo«ooo	ooooo	OOOOOOOOO	ooooo	OOOOOOOOOOO
Grading				
For those finishing by exam:
9	A:	92-100
a	B:	86-91
o	C:	78-85
0	D:	72-77
0	E:	66-71
O	F:	0-65 pts
For those finishing by colloquium: • 50 pts
fiJ ► < 1 ►     =        * -OQ.O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooaoo	ooooo	ooooooooo	OOOOO	ooooooooooo
Materials -	CUDA			
CUDA documentation (installed as a part of CUDA Toolkit, downloadable from developer.nvidia.com)
• CUDA C Programming Guide (most important properties of CUDA)
• CUDA C Best Practices Guide (more detailed document focusing on optimizations)
• CUDA Reference Manual (complete description of C for CUDA API)
• other useful documents (nvcc guide, PTX language description, library manuals, ...)
CUDA article series, Supercomputing for the Masses
• http: //www.ddj.com/cpp/207200659
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
00000090	ooooo	ooooooooo	OOOOO	ooooooooooo
Materials -	OpenCL			
• OpenCL 1.1 Specification
• AMD Accelerated Parallel Processing Programming Guide
• Intel OpenCL SDK Programming Guide
• Writing Optimal OpenCL Code with Intel OpenCL SDK
Jiří Filipovič        Introduction, CUDA Basics
About The Class 0000000«	Motivation                   GPU Architecture ooooo ooooooooo	C for CUDA ooooo	Sample Code ooooooooooo
Materials -	Parallel Programming		
• Ben-Ari M., Principles of Concurrent and Distributed Programming, 2nd Ed. Addison-Wesley, 2006
• Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Patterns for Parallel Programming, Addison-Wesley, 2004
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo «0000 ooooooooo ooooo ooooooooooo
Motivation - GPU arithmetic performance
Theoretical GFLOP/s
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo o«ooo ooooooooo ooooo ooooooooooo
Motivation - GPU memory bandwidth
Theoretical GB/s
GeForce 780 Ti
2003  2004  2005 2006   2007 2008  2009 2010 2011   2012 2013
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                   GPU Architecture                   C for CUDA oo«oo                    ooooooooo ooooo	Sample Code ooooooooooo
Motivation	- programming complexity	
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
• well, it is more difficult comparing to writing serial C/C++ code...
• but can we compare it to serial code?
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo oo«oo ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
• well, it is more difficult comparing to writing serial C/C++ code...
• but can we compare it to serial code?
Moore's Law
Number of transistors on a single chip doubles every 18 months
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo oo«oo ooooooooo ooooo ooooooooooo
Motivation - programming complexity
OK, GPUs are more powerful, but GPU programming is substantially more difficult, right?
• well, it is more difficult comparing to writing serial C/C++ code...
• but can we compare it to serial code?
Moore's Law
Number of transistors on a single chip doubles every 18 months
Corresponding growth of performance comes from
• in the past: frequency increase, parallelism of instructions, of-of-order instruction processing, caches, etc.
• today: vector instructions, increase in number of cores
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation 00090	GPU Architecture ooooooooo	C for CUDA ooooo	Sample Code ooooooooooo
Motivation	- paradigm	change		
Moore's Law consequences:
• in the past:changes were important for compiler developers; application developers didn't need to worry
• today: in order to utilize state-of-the-art processors, it is necessary to write parallel and vectorized code
• it is necessary to find parallelism in the problem being solved, which is a task for a programmer, not for a compiler (at least for now)
• writing efficient code for modern CPUs is similarly difficult as writing for GPUs
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                   GPU Architecture oooo» ooooooooo	C for CUDA OOOOO	Sample Code OOOOOOOOOOO
Motivation	- Applications		
Use of GPU for general computations is a dynamically developing field with broad applicability
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                   GPU Architecture oooo» ooooooooo	C for CUDA ooooo	Sample Code ooooooooooo
Motivation	- Applications		
Use of GPU for general computations is a dynamically developing field with broad applicability
• high-performance scientific calculations
• computational chemistry
• physical simulations
• image processing
• medicine
• and others. ..
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                   GPU Architecture oooo» ooooooooo	C for CUDA OOOOO	Sample Code ooooooooooo
Motivation	- Applications		
Use of GPU for general computations is a dynamically developing field with broad applicability
• high-performance scientific calculations
• computational chemistry
• physical simulations
• image processing
• medicine
• and others. ..
• performance-hungry home and desktop applications
• encoding/decoding of multimedia data
• game physics
• image editing, 3D rendering
• etc.
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                   GPU Architecture OOOOO »00000000	C for CUDA ooooo	Sample Code ooooooooooo
Why are	GPUs so powerful?		
Types of Parallelism
• Task parallelism
• decomposition of a task into the problems that may be processed in parallel
• usually more complex tasks performing different actions
• usually more frequent (and complex) synchronization
• ideal for small number of high-performance processors
• Data parallelism
• parallelism on the level of data structures
• usually the same operations on many items of a data structure
• finer-grained parallelism allows for simple construction of individual processors
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation                   GPU Architecture ooooo o»ooooooo	C for CUDA OOOOO	Sample Code OOOOOOOOOOO
Why are	GPUs so powerful?		
From programmer's perspective
• some problems are rather data-parallel, some task-parallel (graph traversal vs. edge processing)
From hardware perspective
• processors for data-parallel tasks may be simpler
• it is possible to achieve higher arithmetic performance with the same size of processor
• simpler memory access patterns allow for high-throughput memory designs
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation oooooooo ooooo	GPU Architecture ooaoooooo	C for CUDA OOOOO	Sample Code ooooooooooo
GPU Architecture			
Control ALU ALU
ALU ALU
CPU GPU
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation oooooooo ooooo	GPU Architecture 000*00000	C for CUDA ooooo	Sample Code ooooooooooo
GPU Architecture			
Main differences compared to CPU
• high parallelism: tens thousands threads needed to utilize high-end GPUs
• SIMT model: subsets of threads runs in lock-step mode
• distributed on-chip memory: subsets of threads shares their private memory
• restricted caching capabilities: small cache, often read-only Algorithms usually need to be redesigned to be efficient on GPU.
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo oooo«oooo ooooo ooooooooooo
GPU Architecture
Within the system:
• co-processor with dedicated memory (discrete GPU)
• asynchronous processing of instructions
• attached using PCI-E to the rest of the system (discrete GPU)
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooo^ooo ooooo ooooooooooo
CUDA
CUDA (Compute Unified Device Architecture)
• architecture for parallel computations developed by NVIDIA
• provides a new programming model, allows efficient implementation of general GPU computations
• may be used in multiple programming languages
c	OpenCL	Fortran	C++	DX11 Compute	■
					
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo oooooo«oo ooooo ooooooooooo
G80 Processor
G80
• first CUDA processor
• 16 multiprocessors
• each multiprocessor
• 8 scalar processors
• 2 units for special functions
• up to 768 threads
• HW for thread switching and scheduling
• threads are grouped into warps by 32
• SIMT
• native synchronization within the multiprocessor
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooo»o ooooo ooooooooooo
G80 Memory Model
Memory model
• 8192 registers shared among all threads of a multiprocessor
• 16 kB of shared memory
• local within the multiprocessor
• as fast as registry (under certain constraints)
• constant memory
• cached, read-only
• texture memory
• cached with 2D locality, read-only
• global memory
• non cached, read-write
• data transfers between global memory and system memory through PCI-E
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo oooooooo* ooooo ooooooooooo
G80 Processor
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation oooooooo ooooo	GPU Architecture ooooooooo	C for CUDA •oooo	Sample Code ooooooooooo
C for CUDA			
C for CUDA is extension of C for parallel computations
• explicit separation of host (CPU) and device (GPU) code
• thread hierarchy
• memory hierarchy
• synchronization mechanisms
• API
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation oooooooo ooooo	GPU Architecture ooooooooo	C for CUDA o«ooo	Sample Code ooooooooooo
Thread Hierarchy			
Thread hierarchy
• threads are organized into blocks
• blocks form a grid
• problem is decomposed into sub-problems that can be run independently in parallel (blocks)
• individual sub-problems are divided into small pieces that can be run cooperatively in parallel (threads)
• scales well
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo oo«oo ooooooooooo
Thread Hierarchy
Block (0,0)	Block (1,0)	Block (2,0)
ffflll	Jflfff	ffffff
	Block (1,1)	-Block (2,1)
Block (1,1)
Thread (0, 0)	Thread (1, 0) i	Thread (2, 0)	Thread (3, 0)
Thread (0,1)	Thread (1,1)	Thread (2,1)	Thread (3,1)
Thread (0, 2) I	Thread (1, 2) i	Thread (2, 2)	Thread (3, 2)
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	ooo«o	ooooooooooo
Memory	Hierarchy			
More memory types:
• different visibility
• different lifetime
• different speed and behavior
• brings good scalability
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	oooo»	ooooooooooo
Memory	Hierarchy			
Per-thread local memory
Per-block shared
Global memory
Block (0,1) Block (1,1)
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code •oooooooooo
An Example	- Sum	of Vectors		
We want to sum vectors a and b and store the result in vector c
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA ooooo	Sample Code •oooooooooo
An Example	- Sum	of Vectors		
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem.
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code •oooooooooo
An Example	- Sum	of Vectors		
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i — 0;   i < N; i++) c[i]  = a[i] + b[i];
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code •oooooooooo
An Example	- Sum	of Vectors		
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i — 0;   i < N; i++) c[i]  = a[i] + b[i];
Individual iterations are independent - it is possible to parallelize, scales with the size of the vector.
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code •oooooooooo
An Example	- Sum	of Vectors		
We want to sum vectors a and b and store the result in vector c We need to find parallelism in the problem. Serial sum of vectors:
for   (int  i — 0;   i < N; i++) c[i]  = a[i] + b[i];
Individual iterations are independent - it is possible to parallelize,
scales with the size of the vector.
i-th thread sums i-th component of the vector:
c[i]  = a[i]  + b[i];
How do we find which thread we are?
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo ooooo o«ooooooooo
Thread Hierarchy
Block (0,0)	Block (1,0)	Block (2,0)
ffflll	Jflfff	ffffff
	Block (1,1)	-Block (2,1)
Block (1,1)
Thread (0, 0)	Thread (1, 0) i	Thread (2, 0)	Thread (3, 0)
Thread (0,1)	Thread (1,1)	Thread (2,1)	Thread (3,1)
Thread (0, 2) I	Thread (1, 2) i	Thread (2, 2)	Thread (3, 2)
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo ooooo oo»oooooooo
Thread and Block Identification
C for CUDA has built-in variables:
• threadldx.jx, y, z} tells position of a thread in a block
• blockDim. (x, y, z} tells size of the block
• blockldx. (x, y, z} tells position of the block in grid (z always equals 1)
• gridDim.jx, y, z} tells grid size (z always equals 1)
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code 00090000000
An Example	- Sum	of Vectors		
Thus we calculate the position of the thread (grid and block are one-dimensional):
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code 00090000000
An Example	- Sum	of Vectors		
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo ooooo ooo»ooooooo
An Example - Sum of Vectors
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Whole function for parallel summation of vectors:
__global__  void  addvec(float  *a,   float  *b,   float *c){ int   i = blockldx.x*blockDim.x + threadldx.x; c[i]  = a[i] + b[i];
}
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code 00090000000
An Example	- Sum	of Vectors		
Thus we calculate the position of the thread (grid and block are one-dimensional):
int   i = blockldx.x*blockDim.x + threadldx.x;
Whole function for parallel summation of vectors:
__global__  void  addvec(float  *a,   float  *b,   float *c){ int   i = blockldx.x*blockDim.x + threadldx.x; c[i]  = a[i] + b[i];
}
The function defines so called kernel; we specify how meny threads and what structure will be run when calling.
Jiří Filipovič        Introduction, CUDA Basics
About The Class                   Motivation                   GPU Architecture oooooooo                      ooooo ooooooooo	C for CUDA OOOOO	Sample Code 00009000000
Function Type Quantifiers		
C syntax enhanced by quantifiers defining where the code is run and from where it may be called:
• __device__ function is run on device (GPU) only and may be called from the device code only
• __global__ function is run on device (GPU) only and may be called from the host (CPU) code only
• __host__ function is run on host only and may be called from the host only
• __host__ and __device__ may be combined - function is compiled for both then
-E -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation:
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation: • allocate memory for vectors and fill it with data
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation:
• allocate memory for vectors and fill it with data
• allocate memory on GPU
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation:
• allocate memory for vectors and fill it with data
• allocate memory on GPU
• copy vectors a a b to GPU
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation:
• allocate memory for vectors and fill it with data
• allocate memory on GPU
• copy vectors a a b to GPU
• compute the sum on GPU
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation:
• allocate memory for vectors and fill it with data
• allocate memory on GPU
• copy vectors a a b to GPU
• compute the sum on GPU
• store the result from GPU into c
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class	Motivation	GPU Architecture	C for CUDA	Sample Code
oooooooo	ooooo	ooooooooo	OOOOO	00000900000
The following steps are needed for the full computation:
• allocate memory for vectors and fill it with data
• allocate memory on GPU
• copy vectors a a b to GPU
• compute the sum on GPU
• store the result from GPU into c
• use the result in c :-)
When managed memory is used (requires GPU with computing capability 3.0 and CUDA 6.0 or better), steps written in italics are not required.
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo ooooo oooooo»oooo
An Example - Sum of Vectors
CPU code that fills a and b and computes c
#include <stdio.h> Sdefine  N 64 int   main(){
float  a[N] ,   b[N] ,   c [ N ] ; for   (int  i = 0;   i < N;   i++) { a[i]  = i; b [ i ]  = i *3;
}
// GPU code   will   be here
for   (int  i = 0;   i < N; i++)
printf("%f ,   " ,   c[i]); return 0;
}
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo ooooo ooooooo»ooo
GPU Memory Management
Using managed memory, CUDA maintains memory transfers between CPU and GPU automatically.
• memory coherency is guaranteed
• GPU memory cannot be used when any GPU kernel is running Memory operations can be programmed explicitly
cudaMalloc(void**   devPtr,   size_t count); cudaFree(void*   devPtr ) ;
cudaMemcpy(void*  dst,   const   void*   src,   size_t count, enum   cudaMemcpyKind kind);
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code oooooooo»oo
An Example	- Sum	of Vectors		
Running the kernel:
• kernel is called as a function; between the name and the arguments, there are triple angle brackets with specification of grid and block size
• we need to know block size and their count
• we will use ID block and grid with fixed block size
• the size of the grid is determined in a way to compute the whole problem of vector sum
For vector size divisible by 32:
Sdefine BLOCK 32
addvec«<N/BLOCK ,   BL0CK»>(d_a ,   d_b ,   d_c ) ;
How to solve a general vector size?
Jiří Filipovič        Introduction, CUDA Basics
About The Class oooooooo	Motivation ooooo	GPU Architecture ooooooooo	C for CUDA OOOOO	Sample Code ooooooooo»
An Example	- Sum	of Vectors		
We will modify the kernel source:
__global__   void   addvec(float   *a,   float   *b,   float   *c,   int n){ int   i = blockldx.x*blockDim.x + threadldx.x; if   (i < n)  c[i]  = a[i] + b[i];
}
And call the kernel with sufficient number of threads:
addvec«<N/BLOCK +  1,   BL0CK»>(d_a ,   d_b ,   d_c , N);
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics
About The Class Motivation GPU Architecture C for CUDA Sample Code
oooooooo ooooo ooooooooo ooooo oooooooooo*
An Example - Running It
Now we just need to compile it :-)
nvcc -o vecadd vecadd.cu
Where to work with CUDA?
• on a remote computer: airacuda.fi.muni.cz, accounts will be made
• your own machine: download and install CUDA toolkit and SDK from developer.nvidia.com
• source code used in lectures will be published as a part of course materials
-č -O^O
Jiří Filipovič        Introduction, CUDA Basics