Offloading to GPU using Fortran 2008
Introduction
In 2010 the ISO Standard Fortran 2008 introduced the do concurrent
construct
which allows to express loop-level parallelism. The current compilers support multicore
parallel execution, but presently only NVIDIA offer a compiler that support offload to their GPUs.
The NVIDIA HPC Fortran (Formerly PGI Fortran) supports using the do concurrent
construct to offload to NVIDIA GPU accelerators.
It provides a simple way of using accelerators without any extra libraries nor deviation from the standard language. No compiler directives or any other kind of libraries are needed. By using plain standard Fortran the portability is not an issue. The code will use whatever means of parallel execution available on the current platform, it might be multicore or in this example offloading to highly parallel GPU execution.
The 2008 standard will be supported in the foreseeable future, just like many compiler still support the Fortran 66 standard.
This approach provides a simple, user friendly, future proof and portable approach to offloading to accelerators.
Example code used can be found NRIS example repo.
Example code using SAXPY
Writing the actual Fortran 2008 standard code is surprisingly easy. Here is a simple example using SAXPY (Single precision Z = A*X + Y).
There are two Fortran approaches one using indexing addressing element by element :
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
end do
or vector syntax introduced in the Fortran 90 standard:
do concurrent (i = 1:1)
y = y + a*x
end do
The vector syntax does not actually need a loop, but in order to use the
parallel do concurrent
it needs to have a loop, but in this usage only a
single pass.
The parallel loop can be compiled for a threaded multicore architecture using:
$ nvfortran -o saxpy.x -stdpar=multicore saxpy.f90
or for GPU offload by using:
$ nvfortran -o saxpy.x -stdpar=gpu saxpy.f90
As SAXPY is mostly data movement and little computation the gain in using GPU is small as copying of data from main memory to device memory is a limiting factor.
Example with more computation
Here we use an example with a bit more computation.
Using indexed syntax:
do concurrent (i=1:M, j=1:M, k=1:M)
Z(k,j,i)=X(k,j,i) * Z(k,j,i)**2.01_real64
Z(k,j,i)=sin(log10(X(k,j,i) / Z(k,j,i)))
end do
or Fortran 90 vector syntax, where the loop has only one iteration:
do concurrent (i=1:1)
Z = X * Z**2.01_real64
Z = sin(log10(X / Z))
end do
Run |
Run time [seconds] |
---|---|
Indexed syntax CPU 1 core |
14.5285 |
Vector syntax CPU 1 core |
14.5234 |
Indexed syntax GPU A100 |
0.4218 |
Vector syntax GPU A100 |
0.4149 |
With more flops per byte transferred the speedup by offloading to GPU is higher. A speedup of 34 compared to a single core is nice.
The NVfortran compiler is capable of generating code to offload both using the index addressing syntax as well as the vector syntax.
Old legacy code example
We can also look at a matrix-matrix multiplication reference implementation (DGEMM) code from 8-February-1989. This is found at: Basic Linear Algebra, level 3 matrix/matrix operations or download the Fortran 77 reference implementation which contains DGEMM and also contain support files needed.
Download the legacy code, change the comment character to fit Fortran 90 standard:
$ cat dgemm.f | sed s/^*/\!/>dgemm.f90
The BLAS routines multiplication comes in 4 flavors:
S single (32 bit) precision
D double (64 bit) precision
C complex single precision
Z complex double precision
Assume well behaved matrices C := alpha*A*B + beta*C
and a call to dgemm like:
call dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
Locate the line below the line highlighted above, about line 228. Change :
DO 90 J = 1,N
with:
DO concurrent (J = 1 : N)
and change the
90 CONTINUE
with
end do
This is all that is needed to use GPU for offloading, the rest is up to the compiler.
Building this code needs some extra files lsame.f
and xerrbla.c
.
One example of how to build a threaded version for a multicore CPU could look
like:
$ nvfortran -O3 -stdpar=multicore dgemm-test.f90 dgemm.f90 xerrbla.o lsame.f
or to build an offloaded GPU version:
$ nvfortran -O3 -stdpar=gpu dgemm-test.f90 dgemm.f90 xerrbla.o lsame.f
Run |
Build flags |
Cores |
Performance |
---|---|---|---|
Reference f77 |
-O3 |
1 |
4.41 Gflops/s |
Reference f90 |
-O3 -stdpar=multicore |
2 |
6.27 Gflops/s |
Reference f90 |
-O3 -stdpar=multicore |
16 |
24.67 Gflops/s |
Reference f90 |
-O3 -stdpar=gpu |
RTX2080 |
43.81 Gflops/s |
Reference f90 |
-O3 -stdpar=gpu |
A100 |
112.23 Gflops/s |
The results are stunning: changing only one line in the old legacy
code from do
to do concurrent
can speed up from 4 Gflops/s to 112
Gflops/s a 25x increase in performance.
An intersting test is to compare this more then 30 year old reference code with a call to a modern library, the syntax is still the same. The scientific application fortran code will probably behave like the 30 year old example while libraries generally show far higher performance.