Using CUDA in C

This example demonstrates:

  1. how to compile a simple CUDA program

  2. how to request GPU resources and run the program

  3. how to monitor the GPU utilization

In this example we will use CUDA to facilitate offloading of a simple vector addition to be performed by a GPU, and we will try to verify that the code is actually executed on the device. We will compile and run the following code on Saga:

#include <cuda.h>
#include <cuda_runtime_api.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

// CUDA kernel, callable from host due to `__global__`
__global__ void add(const float* a, const float* b, float* c, const size_t n) {
    // Calculate the array index of this thread
    const int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < n) {
        c[id] = a[id] + b[id];
    }
}

int main(int argc, char* argv[]) {
    printf("ENTER MAIN\n");
    // Number of elements to compute over
    const size_t num_elements = 1000000;

    // Allocate memory that can be accessed both on host and device
    float* a;
    float* b;
    float* c;
    // Should ideally catch errors here, but skip for brevity
    cudaMallocManaged(&a, num_elements * sizeof(float));
    cudaMallocManaged(&b, num_elements * sizeof(float));
    cudaMallocManaged(&c, num_elements * sizeof(float));

    // Fill our input arrays, on host, with some data to calculate
    for (int i = 0; i < num_elements; i++) {
        a[i] = sinf(i) * sinf(i);
        b[i] = cosf(i) * cosf(i);
    }

    // Define how many threads to launch on CUDA device
    const int block_size = 1024; // Number of threads in each thread block
    // Number of thread blocks in a grid
    const int grid_size = (int) ceil((float) num_elements / block_size);
	
    // Call CUDA kernel to run on device
    add<<<grid_size, block_size>>>(a, b, c, num_elements);
    // Wait for computation before doing anything with data on host
    cudaDeviceSynchronize();

    // Should print 1.0 at all entries
    printf("c[0]  : %f\n", c[0]);
    printf("c[1]  : %f\n", c[1]);
    printf("c[42] : %f\n", c[42]);
	
    // Free memory
    cudaFree(a);
    cudaFree(b);
    cudaFree(c);

    printf("EXIT SUCCESS\n");
    return EXIT_SUCCESS;
}

vec_add_cuda.cu

Note

The purpose of this example is not to understand the details in the code snippet above, but rather to have a working code example that we can compile, run and verify on a GPU.

Step 1: Compiling the code

In order to compile this code we need a CUDA-aware compiler, and on Saga we get this by loading a CUDA module (choosing here the most recent version at the time of writing):

[me@login.SAGA]$ module load CUDA/11.1.1-GCC-10.2.0

After the module is loaded you should have the nvcc CUDA compiler available:

[me@login.SAGA]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

We can now compile the code with the following command (never mind optimization flags etc, they are not important at this point):

[me@login.SAGA]$ nvcc vec_add_cuda.cu -o vec_add_cuda

This command should hopefully finish without any error/warning. We can try to run the resulting executable (which we called vec_add_cuda):

[me@login.SAGA]$ ./vec_add_cuda
ENTER MAIN
Segmentation fault (core dumped)

But it will fail because we are here still running on the login node, and there are no GPU hardware and drivers available here. The next step is thus to request GPU resources for running our program.

Note

In order to run a CUDA program you must have CUDA hardware drivers installed on the machine. On Saga, these are only available on the GPU compute nodes, not on the login nodes. However, the drivers are not necessary for the compilation step (only the CUDA library, which comes with module load CUDA/...), so this can be done on the login node.

Step 2: Running the code

We will first test the code in an interactive session, so we ask for a single GPU:

[me@login.SAGA]$ salloc --nodes=1 --gpus=1 --time=0:10:00 --mem=1G --partition=accel --account=<your-account>
salloc: Pending job allocation 4320527
salloc: job 4320527 queued and waiting for resources
salloc: job 4320527 has been allocated resources
salloc: Granted job allocation 4320527
salloc: Waiting for resource configuration
salloc: Nodes c7-8 are ready for job

Remember to load the CUDA module if not already loaded from Step 1. You can also verify that you actually have access to a GPU using the nvidia-smi command. If all goes well, your program should now run and exit successfully:

[me@c7-8]$ ./vec_add_cuda
ENTER MAIN
c[0]  : 1.000000
c[1]  : 1.000000
c[42] : 1.000000
EXIT SUCCESS

We here see the expected output of $c[i] = sin^2(i) + cos^2(i) = 1$ for any $i$, which means that the code runs correctly.

Note

For this particular example we have actually now already verified that the code was executed on the GPU. As the code is written, there is no “fallback” implementation that runs on the CPU in case no GPU is found, which means that EXIT SUCCESS == “the code executed on the GPU”.

Step 3: Monitor the GPU utilization

We will now try to capture some stats from the execution using the nvidia-smi tool to verify that we were able to utilize a few percent of the GPUs capacity. To get a reasonable reading from this tool we need an application that runs for at least a few seconds, so we will first make the following change to our source code:

    for (int i = 0; i < 100000; i++) {
        // Call CUDA kernel to run on device
        add<<<grid_size, block_size>>>(a, b, c, num_elements);
        // Wait for computation before doing anything with data on host
        cudaDeviceSynchronize();
    }

loop_add_cuda.cu

i.e. we loop over the vector addition 100 000 times. This should hopefully give sufficient run time to be picked up by our tool. We then compile and run our new code with the following job script:

#!/bin/bash
#SBATCH --job-name=CUDA-test
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=1G
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1

## Set up job environment:
set -o errexit  # Exit the script on any error
set -o nounset  # Treat any unset variables as an error

module --quiet purge  # Reset the modules to the system default
module load CUDA/11.1.1-GCC-10.2.0
module list

# Compile our code
nvcc loop_add_cuda.cu -o loop_add_cuda

# Run our computation
./loop_add_cuda

exit 0

run.sh

Submit the job using sbatch (remember to set the --account option, and note that we are back on the login node):

[me@login.SAGA]$ sbatch run.sh
Submitted batch job 4320512

Wait for the job to finish and verify from the slurm-xxxxx.out file that the calculation still finished successfully, and that it ran for at least a few seconds.

We can then add the following lines to the script in order to monitor the GPU utilization using nvidia-smi:

# Compile our code
nvcc loop_add_cuda.cu -o loop_add_cuda

# Setup monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory \
           --format=csv --loop=1 > "monitor-$SLURM_JOB_ID.csv" &
NVIDIA_MONITOR_PID=$!  # Capture PID of monitoring process

# Run our computation
./loop_add_cuda

# After computation stop monitoring
kill -SIGINT "$NVIDIA_MONITOR_PID"

exit 0

monitor.sh

Submit the job:

[me@login.SAGA]$ sbatch monitor.sh
Submitted batch job 4320513

Wait for the job to complete and inspect the monitor-xxxx.csv file we just created:

[me@login.SAGA]$ cat monitor-4320513.csv
timestamp, utilization.gpu [%], utilization.memory [%]
2021/11/03 21:42:44.210, 0 %, 0 %
2021/11/03 21:42:45.211, 82 %, 76 %
2021/11/03 21:42:46.211, 82 %, 69 %
2021/11/03 21:42:47.211, 82 %, 69 %

We see here that the GPU utilization reached 82% of the GPUs capacity.