Offloading to GPUs

In high-performance computing offloading is the act of moving a computation from the main processor to one or more accelerators. In many cases the computation does not need to be explicitly programmed but can be a standard for (or do in Fortran) loop.

This document shows how to use the standard compilers available on Saga and Betzy to offload computation to the attached GPUs. This document is not considered a comprehensive guide for how to perform offloading, but rather as a compendium on the different compiler flags required with different compilers. For guidance on different programming models for offloading please see our guides.

Below we have listed the necessary flags to enable GPU offloading for the different systems NRIS users have access to. Both Saga and Betzy are Nvidia systems, while LUMI is an AMD based system.

A brief description of their GPU architectures is given in the tabs below.

Betzy has Nvidia A100 accelerators which support CUDA version 8.0. The generational identifier for the GPU is either sm_80 or cc80 depending on the compiler.


OpenMP gained support for accelerator offloading in version 4.0. Most compilers that support version 4.5 and above should be able to run on attached GPUs. However, their speed can vary widely so it is recommended to compare the performance.

If you are interested in learning more about OpenMP offloading we have a beginner tutorial on the topic here.


NVHPC does not support OpenMP offloading on Saga as the generation of GPUs on Saga is older than what NVHPC supports. Thus, NVHPC only supports OpenMP offloading on Betzy.

-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_<XX>


OpenACC is another open standard for supporting offloading to accelerators. Since OpenACC was initially developed by Nvidia the best support for OpenACC is found using Nvidia’s compilers. However, several other compilers also support OpenACC to some extent.

If you are interested in learning more about OpenACC offloading we have a beginner tutorial on the topic here.

-fopenacc -foffload=nvptx-none="-misa=sm_35"

Standard Parallelism

Nvidia additionally supports offloading based on “Standard Parallelism” which is capable of accelerating C++ std::algorithms and Fortran’s do concurrent loops.

You can read more about accelerating Fortran using do concurrent in our guide.

-stdpar=gpu -Minfo=stdpar