Offloading to GPUs
In high-performance computing offloading is the act of moving a computation
from the main processor to one or more accelerators. In many cases the
computation does not need to be explicitly programmed but can be a standard
for
(or do
in Fortran) loop.
This document shows how to use the standard compilers available on Saga and Betzy to offload computation to the attached GPUs. This document is not considered a comprehensive guide for how to perform offloading, but rather as a compendium on the different compiler flags required with different compilers. For guidance on different programming models for offloading please see our guides.
Below we have listed the necessary flags to enable GPU offloading for the different systems NRIS users have access to. Both Saga and Betzy are Nvidia systems, while LUMI is an AMD based system.
A brief description of their GPU architectures is given in the tabs below.
Betzy has Nvidia A100
accelerators which support CUDA version 8.0
. The
generational identifier for the GPU is either sm_80
or cc80
depending on
the compiler.
Saga has Nvidia P100
accelerators which support CUDA version 6.0
. The
generational identifier for the GPU is either sm_60
or cc60
depending on
the compiler.
LUMI-G has AMD MI250X
accelerators which is supported by ROCm. The identifier
for the GPU is gfx90a
.
OpenMP
OpenMP gained support for accelerator offloading in version 4.0
. Most
compilers that support version 4.5
and above should be able to run on
attached GPUs. However, their speed can vary widely so it is recommended to
compare the performance.
If you are interested in learning more about OpenMP offloading we have a beginner tutorial on the topic here.
Warning
NVHPC does not support OpenMP offloading on Saga as the generation of GPUs on Saga is older than what NVHPC supports. Thus, NVHPC only supports OpenMP offloading on Betzy.
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_<XX>
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx<XXX>
-fopenmp -foffload=nvptx-none="-misa=sm_35"
-fopenmp -foffload=amdgcn-amdhsa="-march=gfx<XXX>
-mp=gpu -Minfo=mp,accel -gpu=cc<XX>
OpenACC
OpenACC is another open standard for supporting offloading to accelerators. Since OpenACC was initially developed by Nvidia the best support for OpenACC is found using Nvidia’s compilers. However, several other compilers also support OpenACC to some extent.
If you are interested in learning more about OpenACC offloading we have a beginner tutorial on the topic here.
-fopenacc -foffload=nvptx-none="-misa=sm_35"
-fopenacc -foffload=amdgcn-amdhsa="-march=gfx<XXX>
-acc -Minfo=accel -gpu=cc<XX>
Standard Parallelism
Nvidia additionally supports offloading based on “Standard Parallelism” which
is capable of accelerating C++ std::algorithms
and Fortran’s do concurrent
loops.
You can read more about accelerating Fortran using do concurrent
in our guide.
-stdpar=gpu -Minfo=stdpar