Getting started Betzy
This page is a quick intro for the projects invited to participate in the pilot phase and relevant information will be eventually merged into main documentation.
Access during pilot
During the pilot phase, access to Betzy is allowed only from the UH-sector.
You may login either by connecting via VPN or SSH tunnel to the university network and from there to Betzy via SSH, or from one of the other HPC machines (Fram or Saga), or NIRD.
The address for Betzy login nodes is
Each of the compute nodes have 128 cores (with logical cores it looks like 256). There are two sockets each with a 64 core AMD processor. The AMD processors are not recognied by the Intel compiler and this may cause issues (se later). The AMD processors support 256 bits Vector instruction, AVX and AVX2. Each compute node has 256 GiB of memory split up into 8 NUMA nodes (numactl -H).
Please use the regular support channel to submit requests and mark the case with Betzy.
Betzy uses the SLURM queuing system. For more information about the queue system on Betzy, please check out the job types page.
NOTE: During the pilot phase there will be no CPU core limitations imposed.
For generic information about job submission, please check out our documentation here.
Betzy uses a DDN powered 2.5 PB Lustre parallel file system with 51 GB/s bandwidth and 500k+ metadata operations per second.
The Lustre file system is mounted under
Betzy: NO BACKUP
Please note that NO BACKUP is taken during the pilot phase.
The `/cluster` file system might be reformatted at the end of the pilot phase, before machine being placed into production.
NIRD project file systems -
/trd-project[1-4] - will be mounted on the Betzy
login nodes, similarly to Fram and Saga HPC clusters.
As on Fram and Saga, scientific software on Betzy will be installed using the EasyBuild system, and the Lmod modules tool will be used for changing environment setup via modulefiles.
The two common toolchains
intel will be installed on Betzy.
- GCC compilers (
- Open MPI library
- OpenBLAS (including LAPACK) + ScaLAPACK
- FFTW library
- Intel compilers (
- Intel MPI library
- Intel MKL library (including BLAS, LAPACK, ScaLAPACK, FFT)
Regarding compiler optimization this needs to be investigated case by case. Aggressive optimization should be added only to files where it makes a difference, as it increases the probability of the compiler generating wrong code or exposing floating-point issues in the application. A few starting suggestions are:
A set of suggested compiler flags are given below, these have been tested and used, but users are advised to read the documentation and try other combinations as not all code are alike.
- -O3 -march=znver2 -mtune=znver2 (recommended)
- -O3 -march=znver2 -mtune=znver2 -mfma -mavx2 -m3dnow -fomit-frame-pointer
- -O3 -march=core-avx2
- -O3 -xavx
- -O3 -xavx2 (recommended, for all execpt for main() )
- -O3 -xcore-avx2
The above applies to GCC 9.3 and Intel 2019b, and the choices are listed in increased performance as obtained with a DGEMM matrix multiplication test (which show significant performance improvement), it is also verified using one of the major benchmarks used. Please notice that ifort performs substantially better (2-4x) than gfortran with the dgemm.f test. Also, building the main routine in the program file with -xcore-avx2, -xavx or -xavx2 is not recommended. It's known that building the main() (C and Fortran) with these flags trigger the Intel processor run time check, causing the application to abort.
Both OpenMPI and Intel MPI are installed. Built both for GNU and Intel. Experience have shown that performance varies. Both OpenMPI and Intel MPI are supported. OpenMPI is built with support for GNU compiler and Intel.
For running with OpenMPI the processor binding is beneficial, adding -bind-to core is generally a good idea.
For running Intel MPI, the settings
- I_MPI_PIN_PROCESSOR_EXCLUDE_LIST=128-255 (run only on physical cores)
are a good starting points.
For a correct SLURM job script file the only command needed to launch MPI programs are for :
- OpenMPI : mpirun -bind-to core ./a.out
- Intel MPI: mpirun ./a.out
Running Hybrid models
For hybrid models it's important to set up SLURM to provide access to all vailable cores. An example could look like this:
#SBATCH --ntasks=2048 #SBATCH --nodes=64 #SBATCH --ntasks-per-node=32 #SBATCH --cpus-per-task=4 #SBATCH --exclusive
This will request 32 MPI ranks per node, and leave cores for 4 threads per rank, e.g. each of the 32 ranks can shedule 4 threads yielding a total of 128 cores which is the maximum number of cores on each compute node. The exclusive is important, if not set SLURM will only allow the 32 cores allocated to be used (this will place all 4 threads onto one core). In order to have free access to all cores the exclusive need to be set.
The MKL perform run time check to select correct code to invoke. This test fail to find any Intel processor and hence select a code compatible with all x86-64 processors. Setting the environment flag MKL_DEBUG_CPU_TYPE=5 will force MKL to select code that uses AVX2 instructions, hence increase performance significantly.
It is possible to use Intel MKL library with the GNU compilers, this require some work resolve the symbol names at link time and library path at run time, but can provide a nice performance boost to both linear algebra and Fourier transforms.