Packaging smaller parallel jobs into one large

There are several ways to package smaller parallel jobs into one large parallel job. The preferred way is to use job arrays. Here we want to present a more pedestrian alternative which can give a lot of flexibility.

In this example we imagine that we wish to run a job with 5 MPI job steps at the same time, each using 4 tasks, thus totalling to 20 tasks:

#!/bin/bash

#SBATCH --account=YourProject  # Substitute with your project name
#SBATCH --job-name=parallel_tasks_cpu
#SBATCH --ntasks=20
#SBATCH --time=0-00:05:00
#SBATCH --mem-per-cpu=5000M

# Safety settings
set -o errexit
set -o nounset

# Load MPI module
module --quiet purge
module load OpenMPI/2.1.1-GCC-6.4.0-2.28
module list

# The set of parallel runs:
srun --ntasks=4 --exclusive ./my-binary &
srun --ntasks=4 --exclusive ./my-binary &
srun --ntasks=4 --exclusive ./my-binary &
srun --ntasks=4 --exclusive ./my-binary &
srun --ntasks=4 --exclusive ./my-binary &

wait

Download the script: parallel_steps_cpu.sh (you might have to right-click and select Save Link As... or similar).

This will work with any job type that hands out cpus and memory, so that one specifies --mem-per-cpu. For instance

sbatch --partition=bigmem parallel_steps_cpu.sh

For job types that hand out whole nodes, notably the normal jobs on Fram, one has to do it slightly different. Here is an example to run a normal job with 8 MPI job steps at the same time, each using 16 tasks, thus totalling 128 tasks:

#!/bin/bash

#SBATCH --account=YourProject  # Substitute with your project name
#SBATCH --job-name=parallel_tasks_node
#SBATCH --nodes=4
#SBATCH --time=00:05:00

# Safety settings
set -o errexit
set -o nounset

# Load MPI module
module purge
module load OpenMPI/3.1.1-GCC-7.3.0-2.30
module list

# This is needed for job types that hand out whole nodes:
export SLURM_MEM_PER_CPU=1920

# The set of parallel runs:
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &
srun --ntasks=16 --exclusive ./my-binary &

wait

Download the script: parallel_steps_node.sh (you might have to right-click and select Save Link As... or similar).

For instance (on Fram):

sbatch parallel_steps_node.sh

A couple of notes:

  • The wait command is important - the run script will only continue once all commands started with & have completed.
  • It is possible to use mpirun instead of srun, although srun is recommended for OpenMPI.
  • The export SLURM_MEM_PER_CPU=1920 prior to the srun lines is needed for jobs in the normal or optimist partitions on Fram, because it is not possible to specify this to sbatch for such jobs. Alternatively, you can add --mem-per-cpu=1920 or to the srun command lines (this only works with srun). (1920 gives up to 32 tasks per node. If each task needs more than 1920 MiB per cpu, the number must be increased (and the number of tasks per node will be reduced).
  • This technique does not work with IntelMPI, at least not when using mpirun, which is currently the recommended way of running IntelMPI jobs.

results matching ""

    No results matching ""