#### CS-E400201 - Special Course in Computer Science D: Modern High-performance Computing Tools

#### Hybrid computing using GPUs

Maarit Käpylä <u>maarit.kapyla@aalto.fi</u>







The two trajectories resulting from the power wall

# Multicore processors (core==CPU)

# Multi-thread processors (e.g. processors with GPUs)



# **Schematic comparison**



#### Leading idea of large-scale computation

Execute sequential parts on CPUs and parallel parts faster on GPUs; Communications between GPUs using MPI



## Schematic model of a GPU



# **Memory hierarchy**

- Memory transfers between host and device global memory have the highest latency (as bad as 100x the smem); to be minimized
- Access to **shared memory** and **registers** have much lower latency
  - Registers are seen by single threads
  - Shared memory is for fast communication between threads in a block
- The sizes of the shared memory and registers are very limited.



Chapter 4 of Programming parallel computers teaches you how to make efficient code by optimizing the memory usage; please read through

**2** Aalto University School of Science https://ppc.cs.aalto.fi/ch4/v1/, .../v2 and .../v3

# The GPUs in Triton

Δ?

| Card          | total<br>amount   | nodes                      | arch | nitecture | compute<br>threads<br>per GPU | memory<br>per card                         | CUDA<br>compute<br>capability | Slurm<br>feature<br>name | Slurm gres<br>name  |  |
|---------------|-------------------|----------------------------|------|-----------|-------------------------------|--------------------------------------------|-------------------------------|--------------------------|---------------------|--|
| Tesla<br>K80* | 12                | gpu[20-<br>22]             | Кер  | oler      | 2x2496                        | 2x12GB                                     | 3.7                           | kepler                   | teslak80            |  |
| Tesla<br>P100 | 20                | gpu[23-<br>27]             | Pas  | cal       | 3854                          | 16GB                                       | 6.0                           | pascal                   | teslap100           |  |
| Tesla<br>V100 | 40                | gpu[1-<br>10]              | Vol  | ta        | 5120                          | 32GB                                       | 7.0                           | volta                    | v100                |  |
| Tesla<br>V100 | 40                | gpu[28-<br>37]             |      | ta        | 5120                          | 32GB                                       | 7.0                           | volta                    | v100                |  |
| Tesla<br>V100 | 16                | dgx[1-<br>7]               |      | ta        | 5120                          | 16GB                                       | 7.0                           | volta                    | v100                |  |
| Tesla<br>A100 | 28 gpu[11-<br>17] |                            | Am   | pere      | 7936                          | 80GB                                       | 8.0                           |                          | a100                |  |
| gpuam<br>Az   | d1 1              | Dell<br>PowerEdge<br>R7525 |      | 2021      | rome avx<br>avx2<br>mi100     | 2x8 core<br>AMD<br>EPYC<br>7262<br>@3.2GHz | 250GB<br>DDR4-3200            | EDR                      | 3x<br>MI100<br>32GB |  |

# The GPUs in Triton

Α?

| Card              | total<br>amount | nodes                  | architecture |      | compute<br>threads<br>per GPU | memory<br>per card                         | CUDA (*)<br>compute<br>capability | Slurm<br>feature<br>name | Slurm gres<br>name  |  |
|-------------------|-----------------|------------------------|--------------|------|-------------------------------|--------------------------------------------|-----------------------------------|--------------------------|---------------------|--|
| Tesla<br>K80*     | 12              | gpu[20-<br>22]         | Kepl         | er   | 2x2496                        | 2x12GB                                     | 3.7                               | kepler                   | teslak80            |  |
| Tesla<br>P100     | 20              | gpu[23-<br>27]         | Pascal       |      | 3854                          | 16GB                                       | 6.0                               | pascal                   | teslap100           |  |
| Tesla<br>V100     | 40              | gpu[1-<br>10]          |              | à    | 5120                          | 32GB                                       | 7.0                               | volta                    | v100                |  |
| Tesla<br>V100     | 40              | gpu[28-<br>37]         | Volta        |      | 5120                          | 32GB                                       | 7.0                               | volta                    | v100                |  |
| Tesla<br>V100     | 16              | dgx[1-<br>7]           | Volta        |      | 5120                          | 16GB                                       | 7.0 volta                         |                          | v100                |  |
| Tesla<br>A100     | 28              | 28 gpu[11-<br>17]      |              | ere  | 7936                          | 80GB                                       | 8.0                               |                          | a100                |  |
| gpuam<br>Aa<br>Sc | id1 1           | Dell<br>Power<br>R7525 | Edge         | 2021 | rome avx<br>avx2<br>mi100     | 2x8 core<br>AMD<br>EPYC<br>7262<br>@3.2GHz | 250GB<br>DDR4-3200                | EDR                      | 3x<br>MI100<br>32GB |  |

(\*) https://en.wikipedia.org/wiki/CUDA#Version\_features\_and\_specifications

## Kepler architecture



Aalto Uni School of

# Kepler SM

- Each SM has its own control units, registers, execution pipelines, caches
- Many cores per SM; how many is architecture dependent
- Special-function units (cos/sin/tan, etc.)
- Shared memory/L1 cache
- Thousands of 32-bit registers
- Double precision units with architecture variable ratio; in Kepler 3:1, nowadays more DPUs.

| SMX               | MX                                                                               |      |                   |                |      |      |                   |       |                |      |                   |      |                |      |      |      |         |       |     |
|-------------------|----------------------------------------------------------------------------------|------|-------------------|----------------|------|------|-------------------|-------|----------------|------|-------------------|------|----------------|------|------|------|---------|-------|-----|
| Instruction Cache |                                                                                  |      |                   |                |      |      |                   |       |                |      |                   |      |                |      |      |      |         |       |     |
| Warp Scheduler    |                                                                                  |      |                   | Warp Scheduler |      |      |                   |       | Warp Scheduler |      |                   |      | Warp Scheduler |      |      |      |         |       |     |
| Dispatch Dispatch |                                                                                  |      | Dispatch Dispatch |                |      |      | Dispatch Dispatch |       |                |      | Dispatch Dispatch |      |                |      | ch   |      |         |       |     |
|                   |                                                                                  |      |                   |                |      |      |                   |       |                |      |                   |      |                |      |      |      |         |       |     |
|                   | Register File (65,536 x 32-bit GK110)   (131,072 x32-bit GK210)                  |      |                   |                |      |      |                   |       |                |      |                   |      |                |      |      |      |         |       |     |
| +                 | +                                                                                | +    | +                 | +              | +    | +    | +                 | +     | +              | +    | +                 | +    | +              | +    | +    | +    | +       | +     | +   |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            |       | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core              | Core                                                                             | Core | DP Unit           | Core           | Core | Core | DP Uni            | LD/ST | SFU            | Core | Core              | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
|                   | (64 KB Shared Memory / L1 Cache GK110)   (128 KB Shared Memory / L1 Cache GK210) |      |                   |                |      |      |                   |       |                |      |                   |      |                |      |      |      |         |       |     |
|                   | 48 KB Read-Only Data Cache                                                       |      |                   |                |      |      |                   |       |                |      |                   |      |                |      |      |      |         |       |     |
|                   | Tex                                                                              |      | Tex               |                |      | Tex  |                   | Tex   |                | Tex  |                   |      | Tex            |      | Tex  |      |         | Tex   |     |
|                   | Tex                                                                              |      | Tex               |                |      | Tex  |                   | Te>   | (              |      | Tex               |      | Tex            | (    |      | Tex  |         | Tex   |     |



## **Pascal architecture**



# **Programming models**

- CUDA (NVIDIA)
- Radeon Open Compute (ROCm) (AMD)
- HIP
- OpenCL
- OpenACC OpenMP
- ...

For openCL examples, please refer to

https://ppc.cs.aalto.fi/ch4/v0opencl/



# **CUDA Execution model**

- Main program is executed by the CPU
- CPU needs to communicate with the GPU (Part I)
  - Upload the data to the GPU memory
  - Upload program to the GPU
- Wait (or do something useful) for the GPU to finish computations (Part 2)
- Fetch the results back from the GPU memory (Part 3)

hool of Science



Memory transfers between host and device can be the bottleneck

## **CUDA programming model**

#### Block

- threads that run on the same streaming multiprocessor (SM) form blocks;
- they communicate with each other through shared memory located on the SM;

#### Grid

• Blocks are grouped into a grid; both threads and blocks have a unique identification number

#### Kernel

- Is a function that gets executed in parallel on each thread;
- Are executed as a grid of thread blocks

How to Identify who is who and operating on which part of the data?





## **CUDA programming model**

#### Block

Phone number

- threads that run on the same streaming multiprocessor (SM) form blocks;
- they communicate with each other through shared memory located on the SM;
- Grid Phonebook
- Blocks are grouped into a grid; both threads and blocks have a unique identification number
   Kernel Call the number
- Is a function that gets executed in parallel on each thread;



Area code

#### Grid

### **CUDA concept of warps**

- Thread blocks are divided into warps; can be implementation dependent. In NVIDIA GPUs warps have 32 threads, in AMD's 64.
- Warps are physically executed in parallel on the SMs in "SIMD"-like manner.



https://ppc.cs.aalto.fi/ch4/v1/



# **Programming model in practise**

Let us illustrate the difference of a normal C program and a CUDA one by adding together to numbers

C program (full)

```
// Compute vector sum h_C =
h_A+h_B
void vecAdd(float* h_A, float*
h_B, float* h_C, int n)
{
for (int i = 0; i < n; i++)
    h_C[i] = h_A[i] + h_B[i];
}</pre>
```

```
int main() {
   // Memory allocation for h_A,
   h_B, and h_C // I/O to read h_A
   and h_B, N elements each ...
   vecAdd(h_A, h_B, h_C, N);
```



Cuda (host code)

```
// Compute vector sum h C = h A+h B
void vecAdd(float* h A, float* h B, float* h C, int
n) // "h "refers to host
int size = n * sizeof(float);
float *d A, *d B, *d C; //Pointers to device mem, hence start
with "d"
cudaMalloc((void **) &d A, size); // Allocating device mem
cudaMemcpy(d A,h A,size,cudaMemcpyHostToDevice);
IICopying data over to device mem
cudaMalloc((void **) &d B, size); // Same stuff for B
cudaMemcpy(d B,h B,size,cudaMemcpyHostToDevice);
cudaMalloc((void **) &d C, size); // Allocation C that'll hold the
result
vecAddKernel<<<256,256>>>(d A,d B,d C, n);
cudaMemcpy(h C,d C,size,cudaMemcpyDeviceToHost);
//Copying result to host
cudaFree(d A); cudaFree(d B); cudaFree(d C);
```

# Prgramming model in practise

CUDA (device code)

**Kernel function** 

- // Compute vector sum C = A+B
- **// Each thread performs one pair-wise addition**

```
__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
```

int i = blockDim.x\*blockIdx.x + threadIdx.x;

if(i<n) C[i] = A[i] + B[i];

blockDim.x=dimensi on of the blocks requested blockIdx.x=Block ID amongst all blocks reserved threadIdx.x=Unique identified of the thread in a block

Allocation of global device memory

cudaMalloc((void\*\*) &DevPtr, size\_t size)

- Address of a pointer to the allocated object in device memory
- ° Size of allocated object in terms of bytes

cudaFree(DevPtr) Frees object from device global memory

° Pointer to freed object



Transferring data to/from device global memory

cudaMemcpy(void\* dst, const void\* src, size\_t count, cudaMemcpyKind

kind)

- o Pointer to destination
- o Pointer to source
- o Number of bytes copied
- o Type/Direction of transfer:

cudaMemcpyHostToDevice cudaMemcpyDeviceToHost





1. exec. config param.: Number of blocks

Aalto University School of Science 2. exec. config param.: Number of threads

Again works like a phonenumber: areacode-number

#### **Construction of the kernel function**

Cuda (device code)

Kernel function The order of execution is random

- // Compute vector sum C = A+B
- // Each thread performs one pair-wise addition

#### 

blockDim.x=dimensi on of the blocks requested blockIdx.x=Block ID amongst all blocks reserved threadIdx.x=Unique identified of the thread in a block

- int i = blockDim.x\*blockIdx.x + threadIdx.x;
- if(i<n) C[i] = A[i] + B[i];

Two *built-in variables* that enable threads to identify themselves amongst others and know their own data area.

With the ceil function we might have reserved extra threads, hence now we need to prevent their execution with this if



### **CUDA C keywords for function declaration.**

|                           | Executed on the: | Only callable from the: |
|---------------------------|------------------|-------------------------|
| device float DeviceFunc() | device           | device                  |
| global void KernelFunc()  | device           | host                    |
| host float HostFunc()     | host             | host                    |



### vecAdd is a bad candidate for a CUDA code

- You do a lot of data transfers between the host and device
- Very little computations
- Your CUDA code will be performing worse than a sequential code; there should always be more to compute than communicate to make a reasonable application on GPUs. Remember the ACC model!
- You are REALLY encouraged try this out.

GPU/vecAdd\_CPU.c GPU/vecAdd\_GPU.cu



#### Generalization to multidimensional grids

The autopsied example case was dealing with one-dimensional thread blocks. Generally, however, the exec. config params

KernelFunction<<<dimGrid, dimBlock>>>(...);

dimGrid and dimBlock are dim3 type, which is a C struct with three unsigned integer fields: *x*, *y*, and *z* specifying the sizes of the three dimensions. Less than three dimensions are chosen by setting the size of the unused dimensions to 1.

dim3 dimGrid(2, 2, 1); dim3 dimBlock(4, 2, 2);



#### Generalisation to multidimensional grids

dim3 dimGrid(2, 2, 1); device host dim3 dimBlock(4, 2, 2); Grid 1 Block Block KernelFunction<<<dimGrid, dimBlock>>>(...); Kernel 1 (0, 0)(0, 1) Launch of a kernel makes the Block Block following structures available (1, 0)(1, 1)blockDim.x, blockDim.y, blockDim.z threadIdx.x, threadIdx.y, threadIdx.z Grid 2 / blockldx.x, blockldx.y, blockldx.z Block (1,1) (1,0,0) (1,0,1) (1,0,2) (1,0,3) which tell the placement of the thread Kernel 2 in the hierarchy. Thread Thread Thread Thread (0,0,0)(0,0,1) (0,0,2) (0,0,3)Thread Thread Thread Thread (0,1,0) (0,1,1) (0,1,2)(0.1.3) Aalto University School of Science

Adapted from [1]

#### Brief intro to shared mem programming model

#### Static shared memory device code

```
__global___
void staticReverse(int *d, int n) {
    __shared___ int s[64];
    int t = threadIdx.x;
    int tr = n-t-1;
    s[t] = d[t];
    __syncthreads();
    d[t] = s[tr];
    }
```



Dynamic shared memory device code

#### Calling this kernel from the host:

dynamicReverse<<<1, n, n\*sizeof(int)>>>(d\_d, n);

Third execution configuration parameter allocating the shared memory

cudaMemcpy(d\_a, a, numBytes, cudaMemcpyHostToDevice); increment<<<1,N>>>(d\_a); cudaMemcpy(a, d\_a, numBytes, cudaMemcpyDeviceToHost);

- Kernel calls are asynchronous; after the kernel is launched, the code returns to the host
- CUDA calls are blocking or synchronous, such as cudaMemcpy
- All device operations run in a stream; if no stream is specified, the default (or "null") stream is used.



cudaMemcpy(d\_a, a, numBytes, cudaMemcpyHostToDevice); increment<<<1,N>>>(d\_a); DoSmtghOnHost(); cudaMemcpy(a, d\_a, numBytes, cudaMemcpyDeviceToHost);

- Overlapping host and device tasks is trivial due to the asynchronous nature of the kernel calls.
- How to make CUDA calls concurrently, f. ex. the computation and data transfers in the above example, requires further techniques with the concept of streams.



cudaStream\_t stream1; cudaError\_t result; result = cudaStreamCreate(&stream1); result = cudaStreamDestroy(stream1);

#### Non-default streams in CUDA are

- **Declared** (1<sup>st</sup> line),
- Created (3<sup>rd</sup> line), and
- **Destroyed** (4<sup>th</sup> line)

in host code as above.



 Async data transfers can be accomplished by CUDA functions such as cudaMemcpyAsync, cudaMemcpy2DAsync(), and

cudaMemcpy3DAsync(), where the 5<sup>th</sup> argument is the stream

identifier.

increment<<<1,N,0,streamX>>>(d\_a)

 Kernel calls to be executed on non-default stream will have to specify the stream identifier as the 4<sup>th</sup> argument. The third argument is to declare the allocation of shared memory, here none is requested, hence 0.



- 1. cudaDeviceSynchronize();
- 2. cudaStreamSynchronize(stream);
- 3. cudaEventSynchronize(event) (ADVANCED)
- Since all operations in non-default streams are non-blocking with respect to the host code, you need to synchronize the host code with stream operations.
- Ways relevant to us:
  - 1. the host code is blocked until **all previously issued operations on the device** have completed
  - 2. The host thread is blocked **until all previously issued operations in the specified stream** have completed



## How to run on multiple GPUs?

- Nowadays commonly possible, also in Triton.
- You ask for multiple GPUs using --gres=gpu:N, where N stands for number of requested GPUs. Use N>1 to reserve more than one GPU. See example codes and scripts in code git repo GPU/X. Here, for short:

#### srun -p courses -A courses --gres=gpu:teslap100:1 ./exec1 srun -p courses -A courses --gres=gpu:teslap100:4 ./exec2



How to setup a code for multiple GPUs and MPI? GPU/sheet6/src/main.cu, reduce-multi.cu and reducempi.cu

## How to run on multiple GPUs?

Two general cases:

- GPUs within a single network node: data transfers through peer-to-peer or shared host memory
  - peer-to-peer: cudaDeviceEnablePeerAccess(...), cudaDeviceCanAccessPeer(...), cudaMemcpyPeerAsync(...)
     [advanced, not needed to solve Sheet 6].
  - Host launches streams on different devices and collects the results.
- GPUs across network nodes
  - Communication through CUDA-aware MPI



## How to run on multiple GPUs?

cudaError\_t cudaGetDeviceCount(int\* count)

Returns the number of devices

#### cudaError\_t cudaSetDevice(int device)

Device on which the active host thread should execute

the device code.

#### cudaError\_t cudaGetDevice(int\* device)

Returns the device on which the active host thread executes the device code.



## **CUDA-aware MPI**

## The most likely case, as MPI tends to be SOOO complicated

//MPI rank 0 cudaMemcpy(s buf h,s buf d,size, cudaMemcpyDeviceToHost); MPI Send(s buf h, size, MPI CHAR, 1, 100, MPI COMM WORLD); //MPI rank 1 MPI Recv(r buf h,size,MPI CHAR,0, 100, MPI COMM WORLD, &status); cudaMemcpy(r\_buf\_d,r\_buf\_h,size, cudaMemcpyHostToDevice);

## Or can we perhaps do this, and life becomes wonderful?

#### //MPI rank 0

MPI\_Send(s\_buf\_d,size,MPI\_CHAR,1,100, MPI\_COMM\_WORLD);

#### //MPI rank 1

MPI\_Recv(r\_buf\_d,size,MPI\_CHAR,0,100, MPI\_COMM\_WORLD, &status);

#### Yes, this is how it works!!!!!!

Thanks to Unified Virtual Addressing (UVA) feature in CUDA; read more from [5]

Aalto University School of Science

srun -p courses -A courses --gres=gpu:teslap100:4 -n 4 -N 1 ./exec\_comp\_with\_MPI

# **Useful reading**

[1] David Kirk & Wen-Mei Whu: "Programming massively parallel processors", third edition, 2017, Morgan Kaufmann, Cambridge, USA

[2] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

[3] https://images.nvidia.com/content/pdf/tesla/whitepaper/pascalarchitecture-whitepaper.pdf

[4] https://ppc.cs.aalto.fi/ch4/v1/, .../v2 and .../v3

[5] https://developer.nvidia.com/blog/introductioncuda-aware-mpi/

