Memory and Core Utilization
This page contains info about special features related to the Gaussian install made on NRIS machines, but also general issues related to Gaussian only vaguely documented elsewhere.
Gaussian over Infiniband
First note that the installed Gaussian suites are currently
Linda parallel versions, so they scale out of single nodes. In additidon, our
Gaussian installation is done with a little trick, where loading of the executable
is intercepted before launch and an alternative socket library is loaded. We have also taken care of the rsh/ssh setup in our installation procedure, to avoid .tsnet.config
dependency on user level. This
enables Gaussian to run on Infiniband network with native IB protocol, giving us two
advantages:
The parallel fraction of the code scales to more cores.
The shared memory performance is significantly enhanced (small scale performance).
To run Gaussian in parallel on two or more nodes requires the
additional keywords %LindaWorkers
and %NProcshared
in the Link 0 part of
the input file. In addition, since we do the interception-trick, we need to add the specific IB network node address into the input file. This is taken care of by a wrapper script (g16.ib
) around
the original binary in each individual version folder. Please use this example(s) as starting point(s) when submitting jobs.
Advised job submit syntax using the wrapper:
g16.ib $input.com > g16_$input.out
Parallel scaling
Gaussian is a rather large program system with a range of different binaries, and users need to verify whether the functionality they use is parallelized and how it scales both in terms of core and memory utilization.
Core utilization
Due to the preload Infiniband trick, we have a somewhat more generous policy when it comes to allocating cores/nodes to Gaussian jobs.
We strongly advice users to first study the scaling of the code for a representative system.
Please do not reuse scripts inherited from others without studying the performance and scaling of your job. We recommend to take our Gaussian NRIS machines Job Examples as a starting point.
Due to its versatility, it is hard to be general about Gaussian binary scaling. We do know that plain DFT-jobs with rather non-complicated molecules like caffeine scales easily up to the core-count limit on Saga and into the range of 16 nodes on Fram. On the other side, jobs with transition-metal containing molecules like Fe-corroles scales moderately outside of 2 full nodes on Saga. On a general note, the range on 2-4 nodes seems to be within decent scaling behaviour for most linda-executables (the LXYZ.exel-binaries, see the Gaussian home folder on each NRIS machine). Also note that due to different node-sharing policies for Fram and Saga there will be the need for setting an additional slurm-flag when running Gaussian jobs on Saga.
Memory utilization:
The %mem
allocation of memory in the Gaussian input file means two things:
In general, it means memory/node – for share between
nprocshared
, and additional to the memory allocated per process. This is also documented by Gaussian.For parallel jobs it also means the memory allocation hte LindaWorker will make, but from Linda9 and onwards the need for a memory allocation on the master node has been removed.
However, setting %mem to a value of less than 80% of the physical memory (the actual number depends on the actual node since we have standard-, medium-memory-, and big-memory nodes, see Fram and Saga) is good practice since other system buffers, disk I/O and others can avoid having the system swap more than necessary. This is especially true for post-SCF calculations. To top this, the %mem
tag is also influencing on performance; too high makes the job go slower, too low makes the job
fail.
Please consider the memory size in your input if jobs fail. Our job example is
set up with 500MB (which is actually a bit on the small side), test-jobs were
ran with 2000MB. Memory demand also increases with an increasing number of
cores. But this would also limit the size of problems possible to run at a
certain number of cores. For a DFT calculation with 500-1500 basis sets, %mem=25GB
can be a reasonable setup.
Note that “heap size” is also eating from the memory pool of the node (see below).
Management of large files
On Fram
As commented in the Optimizing storage performance-page, there is an issue with very
large temporary output files (termed RW files in Gaussian). It is advisable to
slice them into smaller parts using the lfs setstripe
command.
On Saga:
The corresponding situtation for Saga is described here; BeeGFS filesystem (Saga).
Important aspects of Gaussian NRIS setup
On Fram:
On Fram, we have not allocated swap space on the nodes, meaning that the heap size for the Linda processes in Gaussian is very important for making parallel jobs run. The line
export GAUSS_LFLAGS2="--LindaOptions -s 20000000"
contains info about the heap size for the Linda communication of Gaussian
and the setting 20 GB (the number above) is sufficient for most calculations. This is the double of the standard amount for Gaussian, but after testing this seems necessary when allocating more than 4 nodes (more than 4 Linda-workers) and sufficient up to 8 nodes. Above 8 nodes, Linda-communication seems to need 30 - which amounts to half of the standard node physical memory and reducing the available amount for %mem
accordingly.