Packaging smaller parallel jobs into one large

There are several ways to package smaller parallel jobs into one large parallel job. The preferred way is to use Array Jobs. Here we want to present a more pedestrian alternative which can give a lot of flexibility, but can also be a little more complicated to get right.

Note that how to use this mechanism has changed since Slurm 19.05.x (and might change again later).

In this example we imagine that we wish to run a job with 5 MPI job steps at the same time, each using 4 tasks, thus totalling to 20 tasks:

#!/bin/bash

#SBATCH --account=YourProject  # Substitute with your project name
#SBATCH --job-name=parallel_tasks_cpu
#SBATCH --ntasks=20
#SBATCH --time=0-00:05:00
#SBATCH --mem-per-cpu=2000M

# Safety settings
set -o errexit
set -o nounset

# Load MPI module
module --quiet purge
module load OpenMPI/4.1.1-GCC-11.2.0
module list

# This is needed with the current version of Slurm (21.08.x):
export SLURM_JOB_NUM_NODES=1-$SLURM_JOB_NUM_NODES

# The set of parallel runs:
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &
srun --ntasks=4 --exact ./my-binary &

wait

Download the script:

files/parallel_steps_cpu.sh

Note that with the currently installed versions of Slurm (22.05.x and newer), instead of

export SLURM_JOB_NUM_NODES=1-$SLURM_JOB_NUM_NODES

one can use the slightly simpler

export SLURM_DISTRIBUTION=pack

or add -m=pack / --distribution=pack to the srun command lines.

This will work with any Job Types that hands out cpus and memory, so that one specifies --mem-per-cpu. For instance

sbatch --partition=bigmem parallel_steps_cpu.sh

For job types that hand out whole nodes, notably the normal jobs on Fram and Betzy, one has to do it slightly different. Here is an example to run a normal job with 8 MPI job steps at the same time, each using 16 tasks, thus totalling 128 tasks:

#!/bin/bash

#SBATCH --account=YourProject  # Substitute with your project name
#SBATCH --job-name=parallel_tasks_node
#SBATCH --nodes=4
#SBATCH --time=00:05:00

# Safety settings
set -o errexit
set -o nounset

# Load MPI module
module --quiet purge
module load OpenMPI/4.1.1-GCC-11.2.0
module list

# This is needed for job types that hand out whole nodes:
unset SLURM_MEM_PER_NODE
export SLURM_MEM_PER_CPU=1888 # This is for Fram.  For betzy, use 1952.

# This is needed with the current version of Slurm (21.08.x):
export SLURM_JOB_NUM_NODES=1-$SLURM_JOB_NUM_NODES

# The set of parallel runs:
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &
srun --ntasks=16 --exact ./my-binary &

wait

Download the script:

files/parallel_steps_node.sh

For instance (on Fram):

sbatch parallel_steps_node.sh

A couple of notes:

The wait command is important - the run script will only continue once all commands started with & have completed.
It is possible to use mpirun instead of srun, although srun is recommended for OpenMPI.
The export SLURM_MEM_PER_CPU=1888 and unset SLURM_MEM_PER_NODE lines prior to the srun lines are needed for jobs in the normal or optimist partitions on Fram and Betzy, because it is not possible to specify this to sbatch for such jobs. Alternatively, you can add --mem-per-cpu=1888 to the srun command lines (this only works with srun). (1888 allows up to 32 tasks per node. If each task needs more than 1888 MiB per cpu, the number must be increased (and the number of tasks per node will be reduced). On Betzy, the corresponding number is 1952, which will allow up to 128 tasks per node.
This technique does not work with IntelMPI, at least not when using mpirun, which is currently the recommended way of running IntelMPI jobs.