Calling GPU accelerated libraries
One of the best ways to get the benefit of GPU acceleration is to call an
external library that is already accelerated. All of the major GPU hardware
vendors create such libraries and the advantage of their use is that you will
get the best performance possible for the available hardware. Examples of GPU
accelerated libraries include BLAS (Basic Linear Algebra Subprograms) libraries
such as cuBLAS
from Nvidia, rocBLAS
from AMD and oneMKL
from
Intel.
One challenge with calling an external library is related to its integration with user accelerated code and how to compile the code so that everything is linked. To address these issues this tutorial will go through:
How to call different GPU accelerated libraries from both C/C++ and Fortran.
How to combine external accelerated libraries and custom offloading code.
Focusing on OpenACC and OpenMP offloading
How to compile your code so that the external libraries are linked.
Calling cuBLAS
from OpenACC
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. - netlib
As noted in the introduction to this tutorial, all of the major GPU hardware
vendors provide specialised BLAS routines for their own hardware. These
libraries offers the best in class performance and thanks to the shared
interface, one can easily abstract over multiple libraries from different
vendors. Here we will show how to integrate OpenACC with cuBLAS
from
Nvidia. The cuBLAS
library is a BLAS
implementation for Nvidia GPUs which is compatible with the hardware found on
Saga and Betzy.
As an example we will use cuBLAS
to perform a simple vector addition and then
calculate the sum of the vector in our own custom loop. The example allows us
to show how to combine cuBLAS
and OpenACC, and our recommendation is to
always use BLAS libraries when performing mathematical computations.
/**
* Example program to show how to combine OpenACC and cuBLAS library calls
*/
#include <cublas_v2.h>
#include <math.h>
#include <openacc.h>
#include <stdio.h>
#include <stdlib.h>
#define N 10000
int main() {
printf("Starting SAXPY + OpenACC program\n");
// Allocate vectors which we will use for computations
float* a = (float*) calloc(N, sizeof(float));
float* b = (float*) calloc(N, sizeof(float));
float sum = 0.0;
const float alpha = 2.0;
if (a == NULL || b == NULL) {
printf("Could not allocate compute vectors!");
return EXIT_FAILURE;
}
// Initialize input arrays, this is done on CPU host
printf(" Initializing vectors on CPU\n");
for (int i = 0; i < N; i++) {
a[i] = 1.0;
b[i] = 2.0;
}
// Create cuBLAS handle for interacting with cuBLAS routines
printf(" Creating cuBLAS handle\n");
cublasHandle_t handle;
cublasStatus_t status; // Variable to hold return status from cuBLAS routines
status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("Could not initialize cuBLAS handle!\n");
return EXIT_FAILURE;
}
// Create OpenACC data region so that our compute vectors are accessible on
// GPU device for cuBLAS
printf(" Starting calculation\n");
#pragma acc data copy(b[0:N]) copyin(a[0:N])
{
// To allow cuBLAS to interact with our compute vectors we need to make
// them available as pointers. NOTE however that these pointers point to
// areas in the GPU memory so they cannot be dereferenced on the CPU,
// however, by using the 'host_data' directive we can use the pointers from
// CPU code passing them to other functions that require pointers to GPU
// memory
#pragma acc host_data use_device(a, b)
{
status = cublasSaxpy(handle, N, &alpha, a, 1, b, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("SAXPY failed!\n");
// NOTE we cannot exit here since this is within an accelerated region
}
}
// We can now continue to use a and b in OpenACC kernels and parallel loop
#pragma acc kernels
for (int i = 0; i < N; i++) {
sum += b[i];
}
}
// After the above OpenACC region has ended 'a' has not changed, 'b' contains
// the result of the SAXPY routine and 'sum' contains the sum over 'b'
// To ensure everything worked we can check that the sum is as we expected
if (fabs(sum - 4.0 * (float) N) < 0.001) {
printf(" Calculation produced the correct result of '4 * %d == %.0f'!\n", N, sum);
} else {
printf(" Calculation produced _incorrect_ result, expected '4 * %d == %.3f'\n", N, sum);
}
// Free cuBLAS handle
cublasDestroy(handle);
// Free computation vectors
free(a);
free(b);
// Indicate to caller that everything worked as expected
printf("Ending SAXPY + OpenACC program\n");
return EXIT_SUCCESS;
}
The main focus of our changes are in the following lines, where we call the SAXPY routine within the already established OpenACC data region.
#pragma acc data copy(b[0:N]) copyin(a[0:N])
{
// To allow cuBLAS to interact with our compute vectors we need to make
// them available as pointers. NOTE however that these pointers point to
// areas in the GPU memory so they cannot be dereferenced on the CPU,
// however, by using the 'host_data' directive we can use the pointers from
// CPU code passing them to other functions that require pointers to GPU
// memory
#pragma acc host_data use_device(a, b)
{
status = cublasSaxpy(handle, N, &alpha, a, 1, b, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("SAXPY failed!\n");
// NOTE we cannot exit here since this is within an accelerated region
}
}
// We can now continue to use a and b in OpenACC kernels and parallel loop
#pragma acc kernels
for (int i = 0; i < N; i++) {
sum += b[i];
}
}
In the above section one can see that we first create an OpenACC data region
(#pragma acc data
) so that our compute vectors are available on the GPU
device. Within this region we would normally have accelerated loops that do
calculations on the data, but when integrating with cuBLAS
we only need the
address of the memory (#pragma acc host_data
). After the SAXPY routine is
called we use the data to calculate the sum as a normal OpenACC kernel.
Combining cuBLAS
and OpenACC in this manner allows us to call accelerated
libraries without having to perform low-level memory handling as one would
normally do with such a library.
To compile this code we will first need to load a few modules.
$ module load NVHPC/21.11 CUDA/11.4.1
$ module load NVHPC/21.7 CUDA/11.4.1
We first load NVHPC
which contains the OpenACC C compiler (nvc
), then we
load CUDA
which contains the cuBLAS
library which we will need to link to.
To compile we can use the following:
$ nvc -acc -Minfo=acc -gpu=cc60 -lcublas -o cublas_acc cublas_openacc.c
$ nvc -acc -Minfo=acc -gpu=cc80 -lcublas -o cublas_acc cublas_openacc.c
Finally, we can run the program using the srun
command which works on both
Saga and Betzy:
$ srun --account=nn<XXXX>k --ntasks=1 --time=02:00 --mem=1G --partition=accel --gpus=1 ./cublas_acc
srun: job <NNNNNN> queued and waiting for resources
srun: job <NNNNNN> has been allocated resources
Starting SAXPY + OpenACC program
Initializing vectors on CPU
Creating cuBLAS handle
Starting calculation
Calculation produced the correct result of '4 * 10000 == 40000'!
Ending SAXPY + OpenACC program
Calling cuBLAS
from OpenMP offloading
OpenMP support offloading to GPUs in the same way as OpenACC. We will therefore use the same example as above, but this time use OpenMP’s offloading capabilities.
Since the program has not changed much from above we have highlighted the major differences from the OpenACC version.
/**
* Example program to show how to combine OpenMP offload and cuBLAS library calls
*/
#include <cublas_v2.h>
#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 10000
int main() {
printf("Starting SAXPY + OpenMP offload program\n");
// Allocate vectors which we will use for computations
float* a = (float*) calloc(N, sizeof(float));
float* b = (float*) calloc(N, sizeof(float));
float sum = 0.0;
const float alpha = 2.0;
if (a == NULL || b == NULL) {
printf("Could not allocate compute vectors!");
return EXIT_FAILURE;
}
// Initialize input arrays, this is done on CPU host
printf(" Initializing vectors on CPU\n");
for (int i = 0; i < N; i++) {
a[i] = 1.0;
b[i] = 2.0;
}
// Create cuBLAS handle for interacting with cuBLAS routines
printf(" Creating cuBLAS handle\n");
cublasHandle_t handle;
cublasStatus_t status; // Variable to hold return status from cuBLAS routines
status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("Could not initialize cuBLAS handle!\n");
return EXIT_FAILURE;
}
// Create OpenMP data region so that our compute vectors are accessible on
// GPU device for cuBLAS
printf(" Starting calculation\n");
#pragma omp target data map(tofrom:b[0:N]) map(to:a[0:N])
{
// To allow cuBLAS to interact with our compute vectors we need to make
// them available as pointers. NOTE however that these pointers point to
// areas in the GPU memory so they cannot be dereferenced on the CPU,
// however, by using the 'host_data' directive we can use the pointers from
// CPU code passing them to other functions that require pointers to GPU
// memory
#pragma omp target data use_device_ptr(a, b)
{
status = cublasSaxpy(handle, N, &alpha, a, 1, b, 1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf("SAXPY failed!\n");
// NOTE we cannot exit here since this is within an accelerated region
}
}
// We can now continue to use a and b in OpenMP offloading code
#pragma omp target teams distribute parallel for schedule(nonmonotonic:static,1) reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += b[i];
}
}
// After the above OpenMP region has ended 'a' has not changed, 'b' contains
// the result of the SAXPY routine and 'sum' contains the sum over 'b'
// To ensure everything worked we can check that the sum is as we expected
if (fabs(sum - 4.0 * (float) N) < 0.001) {
printf(" Calculation produced the correct result of '4 * %d == %.0f'!\n", N, sum);
} else {
printf(" Calculation produced _incorrect_ result, expected '4 * %d == %.3f'\n", N, sum);
}
// Free cuBLAS handle
cublasDestroy(handle);
// Free computation vectors
free(a);
free(b);
// Indicate to caller that everything worked as expected
printf("Ending SAXPY + OpenMP program\n");
return EXIT_SUCCESS;
}
As can be seen in the code above, our interaction with the cuBLAS
library did
not have to change, we only had to change the directives we used to make the
compute vectors available. As with OpenACC, in OpenMP we start by creating a
data region to make our compute vectors accessible to the GPU (done with
#pragma omp target data map(...)
). We then make the pointers to this data
available for our CPU code so that we can pass valid pointers to cuBLAS
(pointers made available with #pragma omp target data use_device_ptr(...)
).
Finally we show that we can also use the vectors we uploaded in custom
offloading loops.
To compile the above OpenMP code we first need to load the necessary modules:
$ module load Clang/13.0.1-GCCcore-11.2.0-CUDA-11.4.1
Since the GPUs on Saga are a couple of generation older we can’t use NVHPC
for OpenMP offloading. We instead use Clang
to show that it works on Saga as
well.
$ module load NVHPC/21.7 CUDA/11.4.1
And then we compile with:
$ clang -o cublas_omp cublas_omp.c -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60 -lcublas
Since the GPUs on Saga are a couple of generation older we can’t use NVHPC
for OpenMP offloading. We instead use Clang
to show that it works on Saga as
well.
$ nvc -mp=gpu -Minfo=mp -gpu=cc80 -lcublas -o cublas_omp cublas_omp.c
Finally we can run the program with the following call to srun
(note that
this call works on both Saga and Betzy):
$ srun --account=nn<XXXX>k --ntasks=1 --time=02:00 --mem=1G --partition=accel --gpus=1 ./cublas_omp
srun: job <NNNNNN> queued and waiting for resources
srun: job <NNNNNN> has been allocated resources
Starting SAXPY + OpenMP offload program
Initializing vectors on CPU
Creating cuBLAS handle
Starting calculation
Calculation produced the correct result of '4 * 10000 == 40000'!
Ending SAXPY + OpenMP program
Calling cuFFT
from OpenACC
Summary
In this section we provide an overview on how to implement a GPU-accelerated library FFT (Fast Fourier Transform) in an OpenACC application and the serial version of the FFTW library. Here we distinguish between two GPU-based FFT libraries: cuFFT
and cuFFTW
. The cuFFT
library is the NVIDIA-GPU based design, while cuFFTW
is a porting version of the existing FFTW
library. In this tutorial, both libraries will be addressed with a special focus on the implementation of the cuFFT
library. Specifically, the aim of this tutorial is to:
Show how to incorporate the
FFTW
library in a serial code.Describe how to use the
cuFFTW
library.Show how to incorporate the
cuFFT
library in an OpenACC application interface.Describe briefly how to enable
cuFFT
to run on OpenACC stream.Describe the compilation process of
FFTW
andcuFFT
.
The implementation will be illustrated for a one-dimensional (1D) scenario and will be further described for 2D and 3D cases.
Generality of FFT
In general, the implementation of an FFT library is based on three major steps as defined below:
Creating plans (initialization).
Executing plans (create a configuration of a FFT plan having a specified dimension and data type).
Destroying plans (to free the ressources associated with the FFT plans).
These steps necessitate specifying the direction, in which the FFT algorithm should be performed: forward or backward (or also inverse of FFT), and the dimension of the problem at hands as well as the precision (i.e. double or single precision); this is in addition to the nature of the data (real or complex) to be transformed.
In the following, we consider a one-dimensional (1D) scenario, in which the execution is specified for a double precision complex-to-complex transform plan in the forward and backward directions. The implementation is illustrated via a Fortran code. The latter can be adjusted to run calculations of a single precision as well as of real-to-real/complex transform and can be further extended to multi-dimension cases (i.e. 2D and 3D). We first start with the FFT implementation in a serial-CPU scheme and further extend it to a GPU-accelerated case. The implementation is illustrated for a simple example of a function defined in time-domain. Here we choose a sinus function (i.e. f(t)=sin(ωt) with ω is fixed at the value 2), and its FFT should result in a peak around the value ω=2 in the frequency domain.
Implementation of FFTW
The implementation of the FFTW
library is shown below and a detailed description of the library can be found here.
As described in the code, one needs to initialize the FFT by creating plans. Executing the plans requires specifying the transform direction: FFTWFORWARD for the forward direction or FFTWBACKWARD for the backward direction (inverse FFT). These two parameters should be defined as an integer parameter. An alternative is to include the fftw3.f
file as a header (i.e. include "fftw3.f"
), which contains all parameters required for a general use of FFTW
. In the case the file is included, the value of the direction parameter does not need to be defined.
The argument FFTW_MEASURE in the function dfftw_plan_dft_1d
means that FFTW
measures the execution time of several FFTs in order to find the optimal way to compute the FFT, which might be time-consuming. An alternative is to use FFTW_ESTIMATE, which builds a reasonable plan without any computation. This procedure might be less optimal (see here for further details).
Note that when implementing the FFTW
library, the data obtained from the backward direction need to be normalized by dividing the output array by the size of the data, while those of forward direction do not. This is only valid when using the FFTW
library.
To check the outcome of the result in the forward direction, one can plot the function in the frequency-domain, which should display a peak around the value ω=+2 and -2 as the function is initially symmetric. By performing the backward FFT of the obtained function, one should obtain the initial function displayed in time-domain (i.e. sin(2t)). This checking procedure holds also when implementing a GPU version of the FFT library.
For completeness, porting the FFTW
library to cuFFTW
does not require modifications in the code - it is done by replacing the file fftw3.h
with cufftw.h
.
module parameter_kind
implicit none
public
integer, parameter :: FFTW_FORWARD=-1,FFTW_BACKWARD=+1
integer, parameter :: FFTW_MEASURE=0
integer, parameter :: sp = selected_real_kind(6, 37) !Single precision
integer, parameter :: dp = selected_real_kind(15, 307) !Double precision
integer, parameter :: fp = dp
real(fp), parameter :: pi = 4.0_fp*atan(1.0_fp),dt=0.25_fp
end module parameter_kind
program fftw_serial
use parameter_kind
implicit none
!include "fftw3.f"
integer, parameter :: nt=512
integer :: i,ierr
integer*8 :: plan_forward,plan_backward
complex(fp), allocatable :: in(:),out(:),f(:)
real(fp), allocatable :: t(:),w(:)
allocate(t(nt),w(nt)); allocate(f(nt))
call grid_1d(nt,t,w)
!Example of sine function
do i=1,nt
f(i) = cmplx(sin(2.0_fp*t(i)),0.0_fp)
enddo
print*,"--sum before FFT", sum(real(f(1:nt/2)))
!Creating 1D plans
allocate(in(nt),out(nt))
call dfftw_plan_dft_1d(plan_forward,nt,in,out,FFTW_FORWARD,FFTW_MEASURE)
call dfftw_plan_dft_1d(plan_backward,nt,in,out,FFTW_BACKWARD,FFTW_MEASURE)
!Forward FFT
in(:) = f(:)
call dfftw_execute_dft(plan_forward, in, out)
f(:) = out(:)
!Backward FFT
call dfftw_execute_dft(plan_backward, out, in)
!The data on the backforward are unnormalized, so they should be divided by N.
in(:) = in(:)/real(nt)
!Destroying plans
call dfftw_destroy_plan(plan_forward)
call dfftw_destroy_plan(plan_backward)
print*,"--sum iFFT", sum(real(in(1:nt/2)))
!Printing the FFT of sin(2t)
do i=1,nt/2
write(204,*)w(i),dsqrt(cdabs(f(i))**2)
enddo
deallocate(in); deallocate(out); deallocate(f)
end
subroutine grid_1d(nt,t,w)
use parameter_kind
implicit none
integer :: i,nt
real(fp) :: t(nt),w(nt)
!Defining a uniform temporal grid
do i=1,nt
t(i) = (-dble(nt-1)/2.0_fp + (i-1))*dt
enddo
!Defining a uniform frequency grid
do i=0,nt/2-1
w(i+1) = 2.0_fp*pi*dble(i)/(nt*dt)
enddo
do i=nt/2,nt-1
w(i+1) = 2.0_fp*pi*dble(i-nt)/(nt*dt)
enddo
end subroutine grid_1d
Compilation process of FFTW
The FFTW
library should be linked with fftw3 (i.e. -lfftw3
) for the double precision, and fftw3f (i.e. -lfftw3f
) for the single precision case.
Here is an example of a module to be loaded.
On Saga:
$ module load FFTW/3.3.9-intel-2021a
The same module is available on Betzy.
To compile:
$ ifort -lfftw3 -o fftw.serial fftw_serial.f90
In the case of using the cuFFTW
library, the linking in the compilation syntaxt should be provided for both cuFFT
and cuFFTW
libraries.
Implementation of cuFFT
We consider the same scenario as described in the previous section but this time the implementation involves the communication between a CPU-host and GPU-device by calling the cuFFT
library. The cuFFT
implementation is shown below.
Similarly to the FFTW
library, the implementation of the GPU-accelerated cuFFT
library is conceptually based on creating plans, executing and destroying them. The difficulty here however is how to call the cuFFT
library, which is written using a low-level programming model, from an OpenACC application interface. In this scenario, there are steps that are executed by the cuFFT
library and other steps are executed by OpenACC kernels. Executing all these steps requires sharing data. In other words, it requires making OpenACC aware of the GPU-accelerated cuFFT
library. This is done in OpenACC by specifying the directive host_data
together with the clause use_device(list-of-arrays)
. This combination permits to access the device address of the listed arrays in the use_device()
clause from the host. The arrays, which should be already present on the device memory, are in turn passed to the cuFFT
functions (i.e. cufftExecZ2Z()
in our example). The output data of these functions is not normalized, and thus it requires to be normalized by dividing by the size of the array. The normalisation may be followed by the function cufftDestroy()
to free all GPU resources associated with a cuFFT
plan and destroy the internal plan data structure.
It is worth noting that the cuFFT
library uses CUDA streams for an asynchronous execution, which is not the case for OpenACC. It is therefore necessary to make the cuFFT
runs on OpenACC streams. This is done by calling the routine cufftSetStream()
, which is part of the cuFFT
module. The routine includes the function acc_get_cuda_stream()
, which enables identifying the CUDA stream.
Note that the use of the OpenACC runtime routines and the cuFFT
routines requires including the header lines use openacc
and use cufft
.
The tables below summarize the calling functions in the case of a multi-dimension data having a simple or double complex data type (see here for more details).
module parameter_kind
implicit none
public
integer, parameter :: sp = selected_real_kind(6, 37) !Single precision
integer, parameter :: dp = selected_real_kind(15, 307) !Double precision
integer, parameter :: fp = dp
real(fp), parameter :: pi = 4.0_fp*atan(1.0_fp),dt=0.25_fp
end module parameter_kind
program cufft_acc
use parameter_kind
use cufft
use openacc
implicit none
integer, parameter :: nt=512
integer :: i,ierr,plan
complex(fp), allocatable :: in(:),out(:)
real(fp), allocatable :: t(:),w(:)
allocate(t(nt),w(nt)); allocate(in(nt),out(nt))
call grid_1d(nt,t,w)
!Example of a sinus function
do i=1,nt
in(i) = cmplx(sin(2.0_fp*t(i)),0.0_fp)
enddo
print*,"--sum before FFT", sum(real(in(1:nt/2)))
!cufftExecZ2Z executes a double precision complex-to-complex transform plan
ierr = cufftPlan1D(plan,nt,CUFFT_Z2Z,1)
!acc_get_cuda_stream: tells the openACC runtime to identify the CUDA
!stream used by CUDA
ierr = ierr + cufftSetStream(plan,acc_get_cuda_stream(acc_async_sync))
!$acc data copy(in) copyout(out)
!$acc host_data use_device(in,out)
ierr = ierr + cufftExecZ2Z(plan, in, out, CUFFT_FORWARD)
ierr = ierr + cufftExecZ2Z(plan, out, in, CUFFT_INVERSE)
!$acc end host_data
!$acc kernels
out(:) = out(:)/nt
in(:) = in(:)/nt
!$acc end kernels
!$acc end data
ierr = ierr + cufftDestroy(plan)
print*,""
if(ierr.eq.0) then
print*,"--Yep it works :)"
else
print*,"Nop it fails, I stop :("
endif
print*,""
print*,"--sum iFFT", sum(real(in(1:nt/2)))
!printing the fft of sinus
do i=1,nt/2
write(204,*)w(i),sqrt(cabs(out(i))**2)
enddo
deallocate(in); deallocate(out)
end
subroutine grid_1d(nt,t,w)
use parameter_kind
implicit none
integer :: i,nt
real(fp) :: t(nt),w(nt)
!Defining a uniform temporal grid
do i=1,nt
t(i) = (-dble(nt-1)/2.0_fp + (i-1))*dt
enddo
!Defining a uniform frequency grid
do i=0,nt/2-1
w(i+1) = 2.0_fp*pi*dble(i)/(nt*dt)
enddo
do i=nt/2,nt-1
w(i+1) = 2.0_fp*pi*dble(i-nt)/(nt*dt)
enddo
end subroutine grid_1d
Dimension |
Creating a FFT plan |
---|---|
1D |
cufftPlan1D( plan, nx, FFTtype,1) |
2D |
cufftPlan2d( plan, ny, nx, FFTtype) |
3D |
cufftPlan3d( plan, nz, ny, nx, FFTtype) |
Table 1. Creating FFT plans in 1D, 2D and 3D dimensions. nx is the size of a 1D array, nx and ny the size of a 2D array, and nx, ny, nz define the size of a 3D array. The FFTtype specifies the data type stored as described in the Table 2.
Precision of the transformed plan |
subroutine |
FFTtype |
---|---|---|
Double precision complex-to-complex |
cufftExecZ2Z( plan, in, out, direction ) |
”CUFFT_Z2Z” |
Single precision complex-to-complex |
cufftExecC2C( plan, in, out, direction ) |
”CUFFT_C2C” |
Table 2. Executing a double precision/single-precision complex-to-complex transform plan in a FFT direction to be specified: “CUFFT_FORWARD” for forward FFT and “CUFFT_INVERSE” for backward FFT. The input data are stored in the array in, and the results of FFT for a specific direction are stored in the array out.
Compilation process of cuFFT
The cuFFT
library is part of the CUDA toolkit, and thus it is supported by the NVIDIA-GPU compiler. Therefore, the only modules are required to be load are NVHPC and CUDA modules.
Modules to be loaded:
$ module load NVHPC/21.11 CUDA/11.4.1
$ module load NVHPC/21.7 CUDA/11.4.1
We compile using the NVIDIA Fortran compiler nvfortran
. The compilation process requires linking the cuFFT
library (-lcufft
) and adding the CUDA version library to the syntax of the compilation (-cudalib=cufft
).
$ nvfortran -lcufft -cudalib=cufft -acc -Minfo=accel -o cufft.acc cufft_acc.f90
Here the flag -acc
enables OpenACC on NVIDIA-GPU. It is possible to specify the compute capability e.g. -gpu=cc80
for Betzy and -gpu=cc60
for Saga.
To run:
$ srun --partition=accel --gpus=1 --time=00:01:00 --account=nnXXXXX --qos=devel --mem-per-cpu=1G ./cufft.acc
Conclusion
In conclusion, we have provided a description of the implementation of the GPU-accelerated cuBLAS
and cuFFT
libraries targeting NVIDIA-GPU. The implementation illustrates the capability of calling a GPU-accelerated library written in a low-level programming model from an OpenACC or OpenMP application interface. We have also documented the implementation of the FFTW
library for a serial case scenario and emphasized its porting version referred to as the cuFFTW
library. For the FFTW
and cuFFT
libraries, although the implementation has been done for a 1D problem, an extension to 2D and 3D scenarios is straightforward.