Using CUDA in C
This example demonstrates:
how to compile a simple CUDA program
how to request GPU resources and run the program
how to monitor the GPU utilization
In this example we will use CUDA to facilitate offloading of a simple vector addition to be performed by a GPU, and we will try to verify that the code is actually executed on the device. We will compile and run the following code on Saga:
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
// CUDA kernel, callable from host due to `__global__`
__global__ void add(const float* a, const float* b, float* c, const size_t n) {
// Calculate the array index of this thread
const int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n) {
c[id] = a[id] + b[id];
}
}
int main(int argc, char* argv[]) {
printf("ENTER MAIN\n");
// Number of elements to compute over
const size_t num_elements = 1000000;
// Allocate memory that can be accessed both on host and device
float* a;
float* b;
float* c;
// Should ideally catch errors here, but skip for brevity
cudaMallocManaged(&a, num_elements * sizeof(float));
cudaMallocManaged(&b, num_elements * sizeof(float));
cudaMallocManaged(&c, num_elements * sizeof(float));
// Fill our input arrays, on host, with some data to calculate
for (int i = 0; i < num_elements; i++) {
a[i] = sinf(i) * sinf(i);
b[i] = cosf(i) * cosf(i);
}
// Define how many threads to launch on CUDA device
const int block_size = 1024; // Number of threads in each thread block
// Number of thread blocks in a grid
const int grid_size = (int) ceil((float) num_elements / block_size);
// Call CUDA kernel to run on device
add<<<grid_size, block_size>>>(a, b, c, num_elements);
// Wait for computation before doing anything with data on host
cudaDeviceSynchronize();
// Should print 1.0 at all entries
printf("c[0] : %f\n", c[0]);
printf("c[1] : %f\n", c[1]);
printf("c[42] : %f\n", c[42]);
// Free memory
cudaFree(a);
cudaFree(b);
cudaFree(c);
printf("EXIT SUCCESS\n");
return EXIT_SUCCESS;
}
Note
The purpose of this example is not to understand the details in the code snippet above, but rather to have a working code example that we can compile, run and verify on a GPU.
Step 1: Compiling the code
In order to compile this code we need a CUDA-aware compiler, and on Saga we get this
by loading a CUDA
module (choosing here the most recent version at the time of writing):
[me@login.SAGA]$ module load CUDA/11.1.1-GCC-10.2.0
After the module is loaded you should have the nvcc
CUDA compiler available:
[me@login.SAGA]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
We can now compile the code with the following command (never mind optimization flags etc, they are not important at this point):
[me@login.SAGA]$ nvcc vec_add_cuda.cu -o vec_add_cuda
This command should hopefully finish without any error/warning. We can try to run the
resulting executable (which we called vec_add_cuda
):
[me@login.SAGA]$ ./vec_add_cuda
ENTER MAIN
Segmentation fault (core dumped)
But it will fail because we are here still running on the login
node, and there are
no GPU hardware and drivers available here. The next step is thus to request GPU resources
for running our program.
Note
In order to run a CUDA program you must have CUDA hardware drivers installed on the
machine. On Saga, these are only available on the GPU compute nodes, not on the
login nodes. However, the drivers are not necessary for the compilation step (only the
CUDA library, which comes with module load CUDA/...
), so this can be done on the login node.
Step 2: Running the code
We will first test the code in an interactive session, so we ask for a single GPU:
[me@login.SAGA]$ salloc --nodes=1 --gpus=1 --time=0:10:00 --mem=1G --partition=accel --account=<your-account>
salloc: Pending job allocation 4320527
salloc: job 4320527 queued and waiting for resources
salloc: job 4320527 has been allocated resources
salloc: Granted job allocation 4320527
salloc: Waiting for resource configuration
salloc: Nodes c7-8 are ready for job
Remember to load the CUDA
module if not already loaded from Step 1. You can also verify
that you actually have access to a GPU using the nvidia-smi
command. If all goes well,
your program should now run and exit successfully:
[me@c7-8]$ ./vec_add_cuda
ENTER MAIN
c[0] : 1.000000
c[1] : 1.000000
c[42] : 1.000000
EXIT SUCCESS
We here see the expected output of $c[i] = sin^2(i) + cos^2(i) = 1$ for any $i$, which means that the code runs correctly.
Note
For this particular example we have actually now already verified that the code was executed
on the GPU. As the code is written, there is no “fallback” implementation that runs
on the CPU in case no GPU is found, which means that EXIT SUCCESS
== “the code executed
on the GPU”.
Step 3: Monitor the GPU utilization
We will now try to capture some stats from the execution using the nvidia-smi
tool
to verify that we were able to utilize a few percent of the GPUs capacity. To get a
reasonable reading from this tool we need an application that runs for at least a few
seconds, so we will first make the following change to our source code:
for (int i = 0; i < 100000; i++) {
// Call CUDA kernel to run on device
add<<<grid_size, block_size>>>(a, b, c, num_elements);
// Wait for computation before doing anything with data on host
cudaDeviceSynchronize();
}
i.e. we loop over the vector addition 100 000 times. This should hopefully give sufficient run time to be picked up by our tool. We then compile and run our new code with the following job script:
#!/bin/bash
#SBATCH --job-name=CUDA-test
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=1G
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load CUDA/11.1.1-GCC-10.2.0
module list
# Compile our code
nvcc loop_add_cuda.cu -o loop_add_cuda
# Run our computation
./loop_add_cuda
exit 0
Submit the job using sbatch
(remember to set the --account
option, and note that we are
back on the login
node):
[me@login.SAGA]$ sbatch run.sh
Submitted batch job 4320512
Wait for the job to finish and verify from the slurm-xxxxx.out
file that the
calculation still finished successfully, and that it ran for at least a few seconds.
We can then add the following lines to the script in order to monitor the GPU
utilization using nvidia-smi
:
# Compile our code
nvcc loop_add_cuda.cu -o loop_add_cuda
# Setup monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory \
--format=csv --loop=1 > "monitor-$SLURM_JOB_ID.csv" &
NVIDIA_MONITOR_PID=$! # Capture PID of monitoring process
# Run our computation
./loop_add_cuda
# After computation stop monitoring
kill -SIGINT "$NVIDIA_MONITOR_PID"
exit 0
Submit the job:
[me@login.SAGA]$ sbatch monitor.sh
Submitted batch job 4320513
Wait for the job to complete and inspect the monitor-xxxx.csv
file we just created:
[me@login.SAGA]$ cat monitor-4320513.csv
timestamp, utilization.gpu [%], utilization.memory [%]
2021/11/03 21:42:44.210, 0 %, 0 %
2021/11/03 21:42:45.211, 82 %, 76 %
2021/11/03 21:42:46.211, 82 %, 69 %
2021/11/03 21:42:47.211, 82 %, 69 %
We see here that the GPU utilization reached 82% of the GPUs capacity.