# Calling GPU accelerated libraries

One of the best ways to get the benefit of GPU acceleration is to call an
external library that is already accelerated. All of the major GPU hardware
vendors create such libraries and the advantage of their use is that you will
get the best performance possible for the available hardware. Examples of GPU
accelerated libraries include BLAS (Basic Linear Algebra Subprograms) libraries
such as `cuBLAS`

from Nvidia, `rocBLAS`

from AMD and `oneMKL`

from
Intel.

One challenge with calling an external library is related to its integration with user accelerated code and how to compile the code so that everything is linked. To address these issues this tutorial will go through:

How to call different GPU accelerated libraries from both C/C++ and Fortran.

How to combine external accelerated libraries and custom offloading code.

Focusing on OpenACC and OpenMP offloading

How to compile your code so that the external libraries are linked.

## Calling `cuBLAS`

from OpenACC

The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. - netlib

As noted in the introduction to this tutorial, all of the major GPU hardware
vendors provide specialised BLAS routines for their own hardware. These
libraries offers the best in class performance and thanks to the shared
interface, one can easily abstract over multiple libraries from different
vendors. Here we will show how to integrate OpenACC with `cuBLAS`

from
Nvidia. The `cuBLAS`

library is a BLAS
implementation for Nvidia GPUs which is compatible with the hardware found on
Saga and Betzy.

As an example we will use `cuBLAS`

to perform a simple vector addition and then
calculate the sum of the vector in our own custom loop. The example allows us
to show how to combine `cuBLAS`

and OpenACC, and our recommendation is to
always use BLAS libraries when performing mathematical computations.

```
/**
* Example program to show how to combine OpenACC and cuBLAS library calls
*/
#include <cublas_v2.h>
#include <math.h>
#include <openacc.h>
#include <stdio.h>
#include <stdlib.h>
#define N 10000
int main() {
printf("Starting SAXPY + OpenACC program\n");
// Allocate vectors which we will use for computations
float* a = (float*) calloc(N, sizeof(float));
float* b = (float*) calloc(N, sizeof(float));
float sum = 0.0;
const float alpha = 2.0;
if (a == NULL || b == NULL) {
printf("Could not allocate compute vectors!");
return EXIT_FAILURE;
}
// Initialize input arrays, this is done on CPU host
printf(" Initializing vectors on CPU\n");
for (int i = 0; i < N; i++) {
a[i] = 1.0;
b[i] = 2.0;
}
// Create cuBLAS handle for interacting with cuBLAS routines
printf(" Creating cuBLAS handle\n");
cublasHandle_t handle;
cublasStatus_t status; // Variable to hold return status from cuBLAS routines
status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("Could not initialize cuBLAS handle!\n");
return EXIT_FAILURE;
}
// Create OpenACC data region so that our compute vectors are accessible on
// GPU device for cuBLAS
printf(" Starting calculation\n");
#pragma acc data copy(b[0:N]) copyin(a[0:N])
{
// To allow cuBLAS to interact with our compute vectors we need to make
// them available as pointers. NOTE however that these pointers point to
// areas in the GPU memory so they cannot be dereferenced on the CPU,
// however, by using the 'host_data' directive we can use the pointers from
// CPU code passing them to other functions that require pointers to GPU
// memory
#pragma acc host_data use_device(a, b)
{
status = cublasSaxpy(handle, N, &alpha, a, 1, b, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("SAXPY failed!\n");
// NOTE we cannot exit here since this is within an accelerated region
}
}
// We can now continue to use a and b in OpenACC kernels and parallel loop
#pragma acc kernels
for (int i = 0; i < N; i++) {
sum += b[i];
}
}
// After the above OpenACC region has ended 'a' has not changed, 'b' contains
// the result of the SAXPY routine and 'sum' contains the sum over 'b'
// To ensure everything worked we can check that the sum is as we expected
if (fabs(sum - 4.0 * (float) N) < 0.001) {
printf(" Calculation produced the correct result of '4 * %d == %.0f'!\n", N, sum);
} else {
printf(" Calculation produced _incorrect_ result, expected '4 * %d == %.3f'\n", N, sum);
}
// Free cuBLAS handle
cublasDestroy(handle);
// Free computation vectors
free(a);
free(b);
// Indicate to caller that everything worked as expected
printf("Ending SAXPY + OpenACC program\n");
return EXIT_SUCCESS;
}
```

The main focus of our changes are in the following lines, where we call the SAXPY routine within the already established OpenACC data region.

```
#pragma acc data copy(b[0:N]) copyin(a[0:N])
{
// To allow cuBLAS to interact with our compute vectors we need to make
// them available as pointers. NOTE however that these pointers point to
// areas in the GPU memory so they cannot be dereferenced on the CPU,
// however, by using the 'host_data' directive we can use the pointers from
// CPU code passing them to other functions that require pointers to GPU
// memory
#pragma acc host_data use_device(a, b)
{
status = cublasSaxpy(handle, N, &alpha, a, 1, b, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("SAXPY failed!\n");
// NOTE we cannot exit here since this is within an accelerated region
}
}
// We can now continue to use a and b in OpenACC kernels and parallel loop
#pragma acc kernels
for (int i = 0; i < N; i++) {
sum += b[i];
}
}
```

In the above section one can see that we first create an OpenACC data region
(`#pragma acc data`

) so that our compute vectors are available on the GPU
device. Within this region we would normally have accelerated loops that do
calculations on the data, but when integrating with `cuBLAS`

we only need the
address of the memory (`#pragma acc host_data`

). After the SAXPY routine is
called we use the data to calculate the sum as a normal OpenACC kernel.

Combining `cuBLAS`

and OpenACC in this manner allows us to call accelerated
libraries without having to perform low-level memory handling as one would
normally do with such a library.

To compile this code we will first need to load a few modules.

```
$ module load NVHPC/21.11 CUDA/11.4.1
```

```
$ module load NVHPC/21.7 CUDA/11.4.1
```

We first load `NVHPC`

which contains the OpenACC C compiler (`nvc`

), then we
load `CUDA`

which contains the `cuBLAS`

library which we will need to link to.

To compile we can use the following:

```
$ nvc -acc -Minfo=acc -gpu=cc60 -lcublas -o cublas_acc cublas_openacc.c
```

```
$ nvc -acc -Minfo=acc -gpu=cc80 -lcublas -o cublas_acc cublas_openacc.c
```

Finally, we can run the program using the `srun`

command which works on both
Saga and Betzy:

```
$ srun --account=nn<XXXX>k --ntasks=1 --time=02:00 --mem=1G --partition=accel --gpus=1 ./cublas_acc
srun: job <NNNNNN> queued and waiting for resources
srun: job <NNNNNN> has been allocated resources
Starting SAXPY + OpenACC program
Initializing vectors on CPU
Creating cuBLAS handle
Starting calculation
Calculation produced the correct result of '4 * 10000 == 40000'!
Ending SAXPY + OpenACC program
```

## Calling `cuBLAS`

from OpenMP offloading

OpenMP support offloading to GPUs in the same way as OpenACC. We will therefore use the same example as above, but this time use OpenMP’s offloading capabilities.

Since the program has not changed much from above we have highlighted the major differences from the OpenACC version.

```
/**
* Example program to show how to combine OpenMP offload and cuBLAS library calls
*/
#include <cublas_v2.h>
#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 10000
int main() {
printf("Starting SAXPY + OpenMP offload program\n");
// Allocate vectors which we will use for computations
float* a = (float*) calloc(N, sizeof(float));
float* b = (float*) calloc(N, sizeof(float));
float sum = 0.0;
const float alpha = 2.0;
if (a == NULL || b == NULL) {
printf("Could not allocate compute vectors!");
return EXIT_FAILURE;
}
// Initialize input arrays, this is done on CPU host
printf(" Initializing vectors on CPU\n");
for (int i = 0; i < N; i++) {
a[i] = 1.0;
b[i] = 2.0;
}
// Create cuBLAS handle for interacting with cuBLAS routines
printf(" Creating cuBLAS handle\n");
cublasHandle_t handle;
cublasStatus_t status; // Variable to hold return status from cuBLAS routines
status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("Could not initialize cuBLAS handle!\n");
return EXIT_FAILURE;
}
// Create OpenMP data region so that our compute vectors are accessible on
// GPU device for cuBLAS
printf(" Starting calculation\n");
#pragma omp target data map(tofrom:b[0:N]) map(to:a[0:N])
{
// To allow cuBLAS to interact with our compute vectors we need to make
// them available as pointers. NOTE however that these pointers point to
// areas in the GPU memory so they cannot be dereferenced on the CPU,
// however, by using the 'host_data' directive we can use the pointers from
// CPU code passing them to other functions that require pointers to GPU
// memory
#pragma omp target data use_device_ptr(a, b)
{
status = cublasSaxpy(handle, N, &alpha, a, 1, b, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("SAXPY failed!\n");
// NOTE we cannot exit here since this is within an accelerated region
}
}
// We can now continue to use a and b in OpenMP offloading code
#pragma omp target teams distribute parallel for schedule(nonmonotonic:static,1) reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += b[i];
}
}
// After the above OpenMP region has ended 'a' has not changed, 'b' contains
// the result of the SAXPY routine and 'sum' contains the sum over 'b'
// To ensure everything worked we can check that the sum is as we expected
if (fabs(sum - 4.0 * (float) N) < 0.001) {
printf(" Calculation produced the correct result of '4 * %d == %.0f'!\n", N, sum);
} else {
printf(" Calculation produced _incorrect_ result, expected '4 * %d == %.3f'\n", N, sum);
}
// Free cuBLAS handle
cublasDestroy(handle);
// Free computation vectors
free(a);
free(b);
// Indicate to caller that everything worked as expected
printf("Ending SAXPY + OpenMP program\n");
return EXIT_SUCCESS;
}
```

As can be seen in the code above, our interaction with the `cuBLAS`

library did
not have to change, we only had to change the directives we used to make the
compute vectors available. As with OpenACC, in OpenMP we start by creating a
data region to make our compute vectors accessible to the GPU (done with
`#pragma omp target data map(...)`

). We then make the pointers to this data
available for our CPU code so that we can pass valid pointers to `cuBLAS`

(pointers made available with `#pragma omp target data use_device_ptr(...)`

).
Finally we show that we can also use the vectors we uploaded in custom
offloading loops.

To compile the above OpenMP code we first need to load the necessary modules:

```
$ module load Clang/13.0.1-GCCcore-11.2.0-CUDA-11.4.1
```

Since the GPUs on Saga are a couple of generation older we can’t use `NVHPC`

for OpenMP offloading. We instead use `Clang`

to show that it works on Saga as
well.

```
$ module load NVHPC/21.7 CUDA/11.4.1
```

And then we compile with:

```
$ clang -o cublas_omp cublas_omp.c -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60 -lcublas
```

Since the GPUs on Saga are a couple of generation older we can’t use `NVHPC`

for OpenMP offloading. We instead use `Clang`

to show that it works on Saga as
well.

```
$ nvc -mp=gpu -Minfo=mp -gpu=cc80 -lcublas -o cublas_omp cublas_omp.c
```

Finally we can run the program with the following call to `srun`

(note that
this call works on both Saga and Betzy):

```
$ srun --account=nn<XXXX>k --ntasks=1 --time=02:00 --mem=1G --partition=accel --gpus=1 ./cublas_omp
srun: job <NNNNNN> queued and waiting for resources
srun: job <NNNNNN> has been allocated resources
Starting SAXPY + OpenMP offload program
Initializing vectors on CPU
Creating cuBLAS handle
Starting calculation
Calculation produced the correct result of '4 * 10000 == 40000'!
Ending SAXPY + OpenMP program
```

## Calling `cuFFT`

from OpenACC

### Summary

In this section we provide an overview on how to implement a GPU-accelerated library FFT (Fast Fourier Transform) in an OpenACC application and the serial version of the FFTW library. Here we distinguish between two GPU-based FFT libraries: `cuFFT`

and `cuFFTW`

. The `cuFFT`

library is the NVIDIA-GPU based design, while `cuFFTW`

is a porting version of the existing `FFTW`

library. In this tutorial, both libraries will be addressed with a special focus on the implementation of the `cuFFT`

library. Specifically, the aim of this tutorial is to:

Show how to incorporate the

`FFTW`

library in a serial code.Describe how to use the

`cuFFTW`

library.Show how to incorporate the

`cuFFT`

library in an OpenACC application interface.Describe briefly how to enable

`cuFFT`

to run on OpenACC stream.Describe the compilation process of

`FFTW`

and`cuFFT`

.

The implementation will be illustrated for a one-dimensional (1D) scenario and will be further described for 2D and 3D cases.

### Generality of FFT

In general, the implementation of an FFT library is based on three major steps as defined below:

Creating plans (initialization).

Executing plans (create a configuration of a FFT plan having a specified dimension and data type).

Destroying plans (to free the ressources associated with the FFT plans).

These steps necessitate specifying the direction, in which the FFT algorithm should be performed: forward or backward (or also inverse of FFT), and the dimension of the problem at hands as well as the precision (i.e. double or single precision); this is in addition to the nature of the data (real or complex) to be transformed.

In the following, we consider a one-dimensional (1D) scenario, in which the execution is specified for a double precision complex-to-complex transform plan in the forward and backward directions. The implementation is illustrated via a Fortran code. The latter can be adjusted to run calculations of a single precision as well as of real-to-real/complex transform and can be further extended to multi-dimension cases (i.e. 2D and 3D). We first start with the FFT implementation in a serial-CPU scheme and further extend it to a GPU-accelerated case. The implementation is illustrated for a simple example of a function defined in time-domain. Here we choose a sinus function (i.e. f(t)=sin(ωt) with ω is fixed at the value 2), and its FFT should result in a peak around the value ω=2 in the frequency domain.

### Implementation of `FFTW`

The implementation of the `FFTW`

library is shown below and a detailed description of the library can be found here.

As described in the code, one needs to initialize the FFT by creating plans. Executing the plans requires specifying the transform direction: *FFTWFORWARD* for the forward direction or *FFTWBACKWARD* for the backward direction (inverse FFT). These two parameters should be defined as an integer parameter. An alternative is to include the `fftw3.f`

file as a header (i.e. `include "fftw3.f"`

), which contains all parameters required for a general use of `FFTW`

. In the case the file is included, the value of the direction parameter does not need to be defined.

The argument *FFTW_MEASURE* in the function `dfftw_plan_dft_1d`

means that `FFTW`

measures the execution time of several FFTs in order to find the optimal way to compute the FFT, which might be time-consuming. An alternative is to use *FFTW_ESTIMATE*, which builds a reasonable plan without any computation. This procedure might be less optimal (see here for further details).

Note that when implementing the `FFTW`

library, the data obtained from the backward direction need to be normalized by dividing the output array by the size of the data, while those of forward direction do not. This is only valid when using the `FFTW`

library.

To check the outcome of the result in the forward direction, one can plot the function in the frequency-domain, which should display a peak around the value ω=+2 and -2 as the function is initially symmetric. By performing the backward FFT of the obtained function, one should obtain the initial function displayed in time-domain (i.e. sin(2t)). This checking procedure holds also when implementing a GPU version of the FFT library.

For completeness, porting the `FFTW`

library to `cuFFTW`

does not require modifications in the code - it is done by replacing the file `fftw3.h`

with `cufftw.h`

.

```
module parameter_kind
implicit none
public
integer, parameter :: FFTW_FORWARD=-1,FFTW_BACKWARD=+1
integer, parameter :: FFTW_MEASURE=0
integer, parameter :: sp = selected_real_kind(6, 37) !Single precision
integer, parameter :: dp = selected_real_kind(15, 307) !Double precision
integer, parameter :: fp = dp
real(fp), parameter :: pi = 4.0_fp*atan(1.0_fp),dt=0.25_fp
end module parameter_kind
program fftw_serial
use parameter_kind
implicit none
!include "fftw3.f"
integer, parameter :: nt=512
integer :: i,ierr
integer*8 :: plan_forward,plan_backward
complex(fp), allocatable :: in(:),out(:),f(:)
real(fp), allocatable :: t(:),w(:)
allocate(t(nt),w(nt)); allocate(f(nt))
call grid_1d(nt,t,w)
!Example of sine function
do i=1,nt
f(i) = cmplx(sin(2.0_fp*t(i)),0.0_fp)
enddo
print*,"--sum before FFT", sum(real(f(1:nt/2)))
!Creating 1D plans
allocate(in(nt),out(nt))
call dfftw_plan_dft_1d(plan_forward,nt,in,out,FFTW_FORWARD,FFTW_MEASURE)
call dfftw_plan_dft_1d(plan_backward,nt,in,out,FFTW_BACKWARD,FFTW_MEASURE)
!Forward FFT
in(:) = f(:)
call dfftw_execute_dft(plan_forward, in, out)
f(:) = out(:)
!Backward FFT
call dfftw_execute_dft(plan_backward, out, in)
!The data on the backforward are unnormalized, so they should be divided by N.
in(:) = in(:)/real(nt)
!Destroying plans
call dfftw_destroy_plan(plan_forward)
call dfftw_destroy_plan(plan_backward)
print*,"--sum iFFT", sum(real(in(1:nt/2)))
!Printing the FFT of sin(2t)
do i=1,nt/2
write(204,*)w(i),dsqrt(cdabs(f(i))**2)
enddo
deallocate(in); deallocate(out); deallocate(f)
end
subroutine grid_1d(nt,t,w)
use parameter_kind
implicit none
integer :: i,nt
real(fp) :: t(nt),w(nt)
!Defining a uniform temporal grid
do i=1,nt
t(i) = (-dble(nt-1)/2.0_fp + (i-1))*dt
enddo
!Defining a uniform frequency grid
do i=0,nt/2-1
w(i+1) = 2.0_fp*pi*dble(i)/(nt*dt)
enddo
do i=nt/2,nt-1
w(i+1) = 2.0_fp*pi*dble(i-nt)/(nt*dt)
enddo
end subroutine grid_1d
```

### Compilation process of `FFTW`

The `FFTW`

library should be linked with fftw3 (i.e. `-lfftw3`

) for the double precision, and fftw3f (i.e. `-lfftw3f`

) for the single precision case.

Here is an example of a module to be loaded.

On Saga:

```
$ module load FFTW/3.3.9-intel-2021a
```

The same module is available on Betzy.

To compile:

```
$ ifort -lfftw3 -o fftw.serial fftw_serial.f90
```

In the case of using the `cuFFTW`

library, the linking in the compilation syntaxt should be provided for both `cuFFT`

and `cuFFTW`

libraries.

### Implementation of `cuFFT`

We consider the same scenario as described in the previous section but this time the implementation involves the communication between a CPU-host and GPU-device by calling the `cuFFT`

library. The `cuFFT`

implementation is shown below.

Similarly to the `FFTW`

library, the implementation of the GPU-accelerated `cuFFT`

library is conceptually based on creating plans, executing and destroying them. The difficulty here however is how to call the `cuFFT`

library, which is written using a low-level programming model, from an OpenACC application interface. In this scenario, there are steps that are executed by the `cuFFT`

library and other steps are executed by OpenACC kernels. Executing all these steps requires sharing data. In other words, it requires making OpenACC aware of the GPU-accelerated `cuFFT`

library. This is done in OpenACC by specifying the directive `host_data`

together with the clause `use_device(list-of-arrays)`

. This combination permits to access the device address of the listed arrays in the `use_device()`

clause from the host. The arrays, which should be already present on the device memory, are in turn passed to the `cuFFT`

functions (i.e. `cufftExecZ2Z()`

in our example). The output data of these functions is not normalized, and thus it requires to be normalized by dividing by the size of the array. The normalisation may be followed by the function `cufftDestroy()`

to free all GPU resources associated with a `cuFFT`

plan and destroy the internal plan data structure.

It is worth noting that the `cuFFT`

library uses CUDA streams for an asynchronous execution, which is not the case for OpenACC. It is therefore necessary to make the `cuFFT`

runs on OpenACC streams. This is done by calling the routine `cufftSetStream()`

, which is part of the `cuFFT`

module. The routine includes the function `acc_get_cuda_stream()`

, which enables identifying the CUDA stream.

Note that the use of the OpenACC runtime routines and the `cuFFT`

routines requires including the header lines `use openacc`

and `use cufft`

.

The tables below summarize the calling functions in the case of a multi-dimension data having a simple or double complex data type (see here for more details).

```
module parameter_kind
implicit none
public
integer, parameter :: sp = selected_real_kind(6, 37) !Single precision
integer, parameter :: dp = selected_real_kind(15, 307) !Double precision
integer, parameter :: fp = dp
real(fp), parameter :: pi = 4.0_fp*atan(1.0_fp),dt=0.25_fp
end module parameter_kind
program cufft_acc
use parameter_kind
use cufft
use openacc
implicit none
integer, parameter :: nt=512
integer :: i,ierr,plan
complex(fp), allocatable :: in(:),out(:)
real(fp), allocatable :: t(:),w(:)
allocate(t(nt),w(nt)); allocate(in(nt),out(nt))
call grid_1d(nt,t,w)
!Example of a sinus function
do i=1,nt
in(i) = cmplx(sin(2.0_fp*t(i)),0.0_fp)
enddo
print*,"--sum before FFT", sum(real(in(1:nt/2)))
!cufftExecZ2Z executes a double precision complex-to-complex transform plan
ierr = cufftPlan1D(plan,nt,CUFFT_Z2Z,1)
!acc_get_cuda_stream: tells the openACC runtime to identify the CUDA
!stream used by CUDA
ierr = ierr + cufftSetStream(plan,acc_get_cuda_stream(acc_async_sync))
!$acc data copy(in) copyout(out)
!$acc host_data use_device(in,out)
ierr = ierr + cufftExecZ2Z(plan, in, out, CUFFT_FORWARD)
ierr = ierr + cufftExecZ2Z(plan, out, in, CUFFT_INVERSE)
!$acc end host_data
!$acc kernels
out(:) = out(:)/nt
in(:) = in(:)/nt
!$acc end kernels
!$acc end data
ierr = ierr + cufftDestroy(plan)
print*,""
if(ierr.eq.0) then
print*,"--Yep it works :)"
else
print*,"Nop it fails, I stop :("
endif
print*,""
print*,"--sum iFFT", sum(real(in(1:nt/2)))
!printing the fft of sinus
do i=1,nt/2
write(204,*)w(i),sqrt(cabs(out(i))**2)
enddo
deallocate(in); deallocate(out)
end
subroutine grid_1d(nt,t,w)
use parameter_kind
implicit none
integer :: i,nt
real(fp) :: t(nt),w(nt)
!Defining a uniform temporal grid
do i=1,nt
t(i) = (-dble(nt-1)/2.0_fp + (i-1))*dt
enddo
!Defining a uniform frequency grid
do i=0,nt/2-1
w(i+1) = 2.0_fp*pi*dble(i)/(nt*dt)
enddo
do i=nt/2,nt-1
w(i+1) = 2.0_fp*pi*dble(i-nt)/(nt*dt)
enddo
end subroutine grid_1d
```

Dimension |
Creating a FFT plan |
---|---|

1D |
cufftPlan1D( plan, nx, FFTtype,1) |

2D |
cufftPlan2d( plan, ny, nx, FFTtype) |

3D |
cufftPlan3d( plan, nz, ny, nx, FFTtype) |

**Table 1.** *Creating FFT plans in 1D, 2D and 3D dimensions. nx is the size of a 1D array, nx and ny the size of a 2D array, and nx, ny, nz define the size of a 3D array. The FFTtype specifies the data type stored as described in the Table 2.*

Precision of the transformed plan |
subroutine |
FFTtype |
---|---|---|

Double precision complex-to-complex |
cufftExecZ2Z( plan, in, out, direction ) |
”CUFFT_Z2Z” |

Single precision complex-to-complex |
cufftExecC2C( plan, in, out, direction ) |
”CUFFT_C2C” |

**Table 2.** *Executing a double precision/single-precision complex-to-complex transform plan in a FFT direction to be specified: “CUFFT_FORWARD” for forward FFT and “CUFFT_INVERSE” for backward FFT. The input data are stored in the array in, and the results of FFT for a specific direction are stored in the array out.*

### Compilation process of `cuFFT`

The `cuFFT`

library is part of the CUDA toolkit, and thus it is supported by the NVIDIA-GPU compiler. Therefore, the only modules are required to be load are NVHPC and CUDA modules.

Modules to be loaded:

```
$ module load NVHPC/21.11 CUDA/11.4.1
```

```
$ module load NVHPC/21.7 CUDA/11.4.1
```

We compile using the NVIDIA Fortran compiler `nvfortran`

. The compilation process requires linking the `cuFFT`

library (`-lcufft`

) and adding the CUDA version library to the syntax of the compilation (`-cudalib=cufft`

).

```
$ nvfortran -lcufft -cudalib=cufft -acc -Minfo=accel -o cufft.acc cufft_acc.f90
```

Here the flag `-acc`

enables OpenACC on NVIDIA-GPU. It is possible to specify the compute capability e.g. `-gpu=cc80`

for Betzy and `-gpu=cc60`

for Saga.

To run:

```
$ srun --partition=accel --gpus=1 --time=00:01:00 --account=nnXXXXX --qos=devel --mem-per-cpu=1G ./cufft.acc
```

# Conclusion

In conclusion, we have provided a description of the implementation of the GPU-accelerated `cuBLAS`

and `cuFFT`

libraries targeting NVIDIA-GPU. The implementation illustrates the capability of calling a GPU-accelerated library written in a low-level programming model from an OpenACC or OpenMP application interface. We have also documented the implementation of the `FFTW`

library for a serial case scenario and emphasized its porting version referred to as the `cuFFTW`

library. For the `FFTW`

and `cuFFT`

libraries, although the implementation has been done for a 1D problem, an extension to 2D and 3D scenarios is straightforward.