Packaging smaller parallel jobs into one large
There are several ways to package smaller parallel jobs into one large parallel job. The preferred way is to use Array Jobs. Here we want to present a more pedestrian alternative which can give a lot of flexibility, but can also be a little more complicated to get right.
Note that how to use this mechanism has changed since Slurm 19.05.x (and might change again later).
In this example we imagine that we wish to run a job with 5 MPI job steps at the same time, each using 4 tasks, thus totalling to 20 tasks:
#!/bin/bash
#SBATCH --account=YourProject # Substitute with your project name
#SBATCH --job-name=parallel_tasks_cpu
#SBATCH --ntasks=20
#SBATCH --time=0-00:05:00
#SBATCH --mem-per-cpu=2000M
# Safety settings
set -o errexit
set -o nounset
# Load MPI module
module --quiet purge
module load OpenMPI/4.1.1-GCC-11.2.0
module list
# This is needed with the current version of Slurm (21.08.x):
export SLURM_JOB_NUM_NODES=1-$SLURM_JOB_NUM_NODES
# The set of parallel runs:
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
wait
Download the script:
Note that with the currently installed versions of Slurm (22.05.x and newer), instead of
export SLURM_JOB_NUM_NODES=1-$SLURM_JOB_NUM_NODES
one can use the slightly simpler
export SLURM_DISTRIBUTION=pack
or add -m=pack
/ --distribution=pack
to the srun
command lines.
This will work with any Job Types that hands out cpus
and memory, so that one specifies --mem-per-cpu
. For instance
sbatch --partition=bigmem parallel_steps_cpu.sh
For job types that hand out whole nodes, notably the normal jobs
on Fram and Betzy, one has to do it slightly different. Here is an example to
run a normal
job with 8 MPI job steps at the same time, each using
16 tasks, thus totalling 128 tasks:
#!/bin/bash
#SBATCH --account=YourProject # Substitute with your project name
#SBATCH --job-name=parallel_tasks_node
#SBATCH --nodes=4
#SBATCH --time=00:05:00
# Safety settings
set -o errexit
set -o nounset
# Load MPI module
module --quiet purge
module load OpenMPI/4.1.1-GCC-11.2.0
module list
# This is needed for job types that hand out whole nodes:
unset SLURM_MEM_PER_NODE
export SLURM_MEM_PER_CPU=1888 # This is for Fram. For betzy, use 1952.
# This is needed with the current version of Slurm (21.08.x):
export SLURM_JOB_NUM_NODES=1-$SLURM_JOB_NUM_NODES
# The set of parallel runs:
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
wait
Download the script:
For instance (on Fram):
sbatch parallel_steps_node.sh
A couple of notes:
The
wait
command is important - the run script will only continue once all commands started with&
have completed.It is possible to use
mpirun
instead ofsrun
, althoughsrun
is recommended for OpenMPI.The
export SLURM_MEM_PER_CPU=1888
andunset SLURM_MEM_PER_NODE
lines prior to thesrun
lines are needed for jobs in thenormal
oroptimist
partitions on Fram and Betzy, because it is not possible to specify this tosbatch
for such jobs. Alternatively, you can add--mem-per-cpu=1888
to thesrun
command lines (this only works withsrun
). (1888 allows up to 32 tasks per node. If each task needs more than 1888 MiB per cpu, the number must be increased (and the number of tasks per node will be reduced). On Betzy, the corresponding number is 1952, which will allow up to 128 tasks per node.This technique does not work with IntelMPI, at least not when using
mpirun
, which is currently the recommended way of running IntelMPI jobs.