Calling fortran routines from Python

Introduction

While Python an effective language for development it is not very fast at executing code. There are several tricks available to get high numerical performance of which calling fortran routines is one.

While libraries functions in both numpy and scipy perform nicely in many cases, one often need to write routines for which no library exist. Either writing from scratch or use fortran routines from co-workers or other sources. In any case it’s a good way of getting high performance for time consuming part of the run.

Below is covered usage of :

Plain fortran with GNU fortran (the default)
Fortran with calls to math library (MKL)
The Intel fortran compiler to compile your fortran source code
Optimising performance for fortran code, compiler flags.
Intel fortran and MKL
Intel, fortran and multithreaded MKL
Python with multithreaded OpenMP fortran routines

Note

A short disclaimer With regards to matrix matrix multiplication the library in numpy is comparable in performance to the Intel MKL.

Note

Another disclaimer is that this have been tested on Saga. There might some minor issues on Betzy with AMD processors, not having 512 bits avx.

Using the numpy interface

The package numpy contains tools to facilitate calling fortran routines directly from Python. The utility f2py3 can be used or more indirectly by launching Python with a module and processing the fortran source code. In both cases the fortran code containing definitions of subroutines will be compiled using a fortran compiler into object files which subsequently are linked into a single shared object library file (an .so file).

A nice introduction by NTNU is available. It cover some basics and should be read as an introduction. Issues with arrays arguments and assumed shapes are explained.

Modern fortran uses «magic» constants (they can any number, but often they are equal to the number of bytes, but not always, don’t rely on this) to set the attributes like size or range of variables. Normally specified in the number of bits for a given variable. This can be done using self specified ranges with the help of the kind function.

subroutine foo
	implicit none
	int32 = selected_int_kind(8)
	int64 = selected_int_kind(16)
	real32  = selected_real_kind(p=6,r=20)
	real64  = selected_real_kind(p=15,r=307)
	
	integer(int32) :: int
	integer(int64) :: longint

or a simpler solution is to use a standard fortran module:

subroutine foo
	use iso_fortran_env
	implicit none
	
	real(real32) :: float
	real(real64) :: longfloat

While the first one is more pedagogic, the second one is simpler and iso_fortran_env contain a lot more information.

Python support both 32 and 64 bit integers and floats. However, the mapping between fortran specification and Python/Numpy it not set by default. In order to map from fortran standard naming to C naming map need to be provided. The map file need to reside in the working directory and must have the name .f2py_f2cmap. An example mapping fortran syntax to C syntax for simple integers and floats can look like :

dict(real=dict(real64='double', real32='float'),
	 complex=dict(real32='complex_float', real64='complex_double'),
     integer=dict(int32='int', int64='long')
	 )

This helps the f2py3 to map the fortran data types into the corresponding C data types. Alternative is to use C mapping directly.

For complex variables the same logic applies, the size is measured in bits to fit two numbers (real and imaginary parts) occupying 64 bits each, hence 128 bits.

x=np.zeros((n), dtype=np.complex128, order='F')
y=np.zeros((n), dtype=np.complex128, order='F')

and corresponding fortran code, there each number is specified as 64 bits each:

complex(real64), dimension(n), intent(in) :: x
complex(real64), dimension(n), intent(inout):: y

The importance of keeping control over data types and their ranges cannot be stressed more than pointing to Ariane-5 failure or even worse, killing people, the Therac-25 incident.

compiling fortran code

To start using Python with fortran code a module need to be loaded, module load Python/3.9.6-GCCcore-11.2.0

The command line to generate the Python importable module can be one of the following, with the second could be used if f2py3 is not available.

f2py3 -c pi.f90 -m pi
python3 -m numpy.f2py -c pi.f90 -m pi In both cases a module will be generated which could be imported as a normal Python module. The -m pi is the given name for the module, here it’s identical to the name of the subroutine, but don’t need to be.

A simple fortran routine to calculate Pi :

subroutine pi(p,n)
  use iso_fortran_env
  implicit none
  real(real64), intent(out) :: p
  integer(int64), intent(in) :: n
  
  integer(int64) :: j
  real(real64) :: h, x, sum
  
  sum=0.0_real64 ! set accumulating vars to 0.
  h = 1.0_real64/n
  do j = 1,n
      x = h*(j-0.5_real64)
      sum = sum + (4.0_real64/(1.0_real64+x*x))
   end do
   p = h*sum
   return
 end subroutine pi

Be aware that intention of parameters is important. Also that variables are not initiated during repeated calls, hence set accumulating variables to zero in the body, not during declaration . Once the routine is loaded into memory the variables reside in memory. There is no magic initialisation for each subsequent call (look into the save statement in fortran).

This fortran routine can be called from a Python script like:

import pi

p=pi.pi(1000)

print("Pi calculated ",p)

With a result like:

Pi calculated  3.1415927369231227

We import the module generated, the name is pi which correspond to the last -m <name> argument, while the function call to pi is the same name as the fortran routine.

Performance issues

While Python is easy to write and has many very nice features and applications, numerical performance is not among them.

It the following examples matrix matrix multiplication is used as an example, this is a well known routine making it a good candidate for performance comparison.

The following code is used to illustrate the performance using Python:

print("Matrix multiplication example")
x=np.zeros((n, n), dtype=np.float64, order='F')
y=np.zeros((n, n), dtype=np.float64, order='F')
z=np.zeros((n, n), dtype=np.float64, order='F')
x.fill(1.1)
y.fill(2.2)

start = time.perf_counter()
for j in range(n):
    for l in range(n):
        for i in range(n):
            z[i,j] = z[i,j] + x[i,l]*y[l,j]
print(f"Python code {time.perf_counter() - start:2.4f} secs")
print(z)

The following fortran code is used for matrix matrix multiplication:

subroutine mxm(a,b,c,n) 
  implicit none
  integer, parameter :: real64  = selected_real_kind(p=15,r=307)
  integer, parameter :: int32 = selected_int_kind(8)

  real(real64), dimension(n,n), intent(in)  :: a,b
  real(real64), dimension(n,n), intent(inout) :: c
  integer(int32), intent(in)  :: n
  integer(int32) :: i,j,l

  do j = 1,n
     do l = 1,n
        do i = 1,n
           c(i,j) = c(i,j) + a(i,l)*b(l,j)
        enddo
     enddo
  enddo
  
end subroutine mxm

Comparing Python with fortran using the following commands:

f2py3 --opt="-Ofast -fomit-frame-pointer -march=skylake-avx512"  -c mxm.f90 -m mxm

and running the Python script python3 mxm.py

The Python script used to call the fortran code is:

a=np.zeros((n, n), dtype=np.float64, order='F')
b=np.zeros((n, n), dtype=np.float64, order='F')
c=np.zeros((n, n), dtype=np.float64, order='F')
a.fill(1.1)
b.fill(2.2)
start = time.perf_counter()
mxm.mxm(a,b,c,n)
print(f"f90 mxm {time.perf_counter() - start:2.4f} secs")

The results are staggering, for the matrix matrix multiplication the simple fortran implementation perform over 2000 times faster than the fortran code.

Language	Run time in seconds
Python	757.2706
f90	0.3099

This expected as the compiled fortran code is quite efficient while Python is interpreted.

Using libraries, MKL

The Intel Math Kernel Library is assumed to be well known for its performance. It contains routines that, in most cases, exhibit very high performance. The routines are also for the most part threaded to take advantage of multiple cores.

In addition to the module already loaded module load Python/3.9.6-GCCcore-11.2.0 one more module is needed to use Intel MKL: module load imkl/2022.2.1 (This module set many environment variables, we use $MKLROOT to set the correct path for MKL library files.)

As f2py3 is a wrapper some extra information is needed to link with the MKL libraries. The simplest is to use static linking:

f2py3 --opt="-Ofast -fomit-frame-pointer -march=skylake-avx512"\
 ${MKLROOT}/lib/intel64/libmkl_gf_lp64.a\
 ${MKLROOT}/lib/intel64/libmkl_sequential.a\
 ${MKLROOT}/lib/intel64/libmkl_core.a\
 -c mxm.f90 -m mxm 

The above commands link in the dgemm routine from MKL.

subroutine mlib(c,a,b,n) 
  implicit none
  integer, parameter :: real32  = selected_real_kind(p=6,r=20)
  integer, parameter :: real64  = selected_real_kind(p=15,r=307)
  integer, parameter :: int32 = selected_int_kind(8)
  integer, parameter :: int64 = selected_int_kind(16)

  real(real64), dimension(n,n), intent(in)  :: a,b
  real(real64), dimension(n,n), intent(out) :: c  
  integer(int32), intent(in)  :: n
  real(real64) :: alpha=1.0_real64, beta=1.0_real64
  
  call dgemm('n', 'n', n, n, n, alpha, a, n, b, n, beta, c, n)

end subroutine mlib

and a Python script to call it :

a=np.zeros((n, n), dtype=np.float64, order='F')
b=np.zeros((n, n), dtype=np.float64, order='F')
c=np.zeros((n, n), dtype=np.float64, order='F')
a.fill(1.1)
b.fill(2.2)
c=np.zeros((n, n), dtype=float64, order='F')
start = time.perf_counter()
mxm.mlib(a,b,c,n)
print(f"mxm MKL lib {time.perf_counter() - start:2.4f} secs")

Running the Python script with n=5000 we get the results below.

Routine	Run time in seconds
Fortran code	88.566
MKL library	2.90

Using different fortran compiler, intel

While the gfortran used by default generate nice executable code it does not always match the intel fortran compiler when it comes to performance. It might be beneficial to switch to the intel compiler.

In order to have Python, Intel compiler and MKL together load the module: SciPy-bundle/2022.05-intel-2022a

Then we build compile the fortran code,

f2py3  --fcompiler=intelem  --opt="-O3 -xcore-avx512"\
 -c mxm.f90 -m mxm

Running the same Python script with n=5000 we arrive at the following run times:

Compiler/library	Run times seconds
GNU fortran	88.566
Intel ifort	9.5695

The Intel compiler is known for its performance when compiling the matrix matrix multiplication.

We can also use the MKL library on conjunction with the Intel compiler, but it’s a bit more work. First static linking:

f2py3  --fcompiler=intelem --opt="-O3 -xcore-avx512"\
 ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a\
 ${MKLROOT}/lib/intel64/libmkl_sequential.a\
 ${MKLROOT}/lib/intel64/libmkl_core.a\
 -c mxm.f90 -m mxm

Compiler/library	Run times seconds
GNU fortran	88.566
Intel ifort	9.5695
MKL dgemm	2.712

It’s also possible to use dynamic linking,

f2py3  --fcompiler=intelem --opt="-O3 -xcore-avx512"\
 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_avx512\
 -c mxm.f90 -m mxm

Then it’s just to launch as before. Performance is comparable as it’s the same library.

Testing for even higher performance using the Intel compiler ifort we can try more optimising flags (runs with n=10000):

ifort flags	Run time
Defaults (no flags given)	1122 secs.
-O2	1110 secs.
-O3	153 secs.
-O3 -xavx2	81.8 secs.
-O3 -xcore-avx512	72.5 secs.
-O3 -xcore-avx512 -qopt-zmm-usage=high	54.1 secs.
-Ofast -xcore-avx512 -qopt-zmm-usage=high	53.9 secs.
-Ofast -unroll -xcore-avx512 -qopt-zmm-usage=high -heap-arrays -fno-alias	53.7 secs.
-fast -unroll -xcore-avx512 -qopt-zmm-usage=high	53.6 secs.

Selecting the right flags can have dramatic affect on performance. Adding to this what’s optimal flag for one routine might not be right for other.

Using many cores with MKL library

As the MKL libraries are multithreaded they can be run on multiple cores.

To achieve this it just to build using multithreaded versions of the library, using static linking :

f2py3  --fcompiler=intelem --opt="-O3 -xcore-avx512"\
${MKLROOT}/lib/intel64/libmkl_intel_lp64.a\
${MKLROOT}/lib/intel64/libmkl_intel_thread.a\
${MKLROOT}/lib/intel64/libmkl_core.a\
-c mxm.f90 -m mx

or dynamic linking:

f2py3  --fcompiler=intelem --opt="-O3 -xcore-avx512"\
 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_avx512 -liomp5\
 -c mxm.f90 -m mxm

The OpenMP OMP_NUM_THREADS environment variable can the be used to control the number of cores to use.

This time we run the Python script with a bit larger size, n=10000, export OMP_NUM_THREADS=2 and larger.

Threads	Run times in seconds
1	21.2914
2	12.5923
4	7.0082
8	4.1504

While scaling is not perfect there is a significant speedup by using extra cores.

Using many cores with fortran with OpenMP

It’s possible to call fortran functions with OpenMP directives getting speedup using several cores. A nice alternative when dealing with real world code for which no library exist.

Consider the following fortran OpenMP code:

subroutine piomp(p, n)
  use iso_fortran_env	
  real(real64), intent(out) :: p
  integer(int64), intent(in) :: n
  integer(int64) :: i
  real(real64) ::  sum, x, h
  
  h = 1.0_real64/n
  sum = 0.0_real64
!$omp parallel do private(i) reduction(+:sum)
!This OpenMP inform the compiler to generate a multi threaded loop
  do i = 1,n
     x = h*(i-0.5_real64)
     sum = sum + (4.0_real64/(1.0_real64+x*x))
  enddo
  p = h*sum

Building the module for Python using :

f2py3  --fcompiler=intelem --opt="-qopenmp -O3 -xcore-avx512"\
 -D__OPENMP -liomp5  -c pi.f90 -m pi

The openmp library is linked explicitly -liomp5 (for GNU it’s -lgomp).

Running using the following Python script :

import time
import pi

n=50000000000

start = time.perf_counter()
p=pi.pi(n)
print("Pi calculated ",p," ",time.perf_counter() - start," seconds")

start = time.perf_counter()
p=pi.piomp(n)
print("Pi calculated ",p," ",time.perf_counter() - start," seconds")

Scaling performance is nice:

Cores	Run time in seconds
1	31.26
2	16.28
4	8.528
8	4.217
16	2.547
32	1.900