Array Jobs
To run many instances of the same job, use the --array
switch to sbatch
.
This is useful if you have a lot of data-sets which you want to process in the
same way:
$ sbatch --array=from-to [other sbatch switches] YourScript
You can also put the --array
switch in an #SBATCH
line inside the script.
from and to are the first and last task number. Each instance of
YourScript
can use the environment variable $SLURM_ARRAY_TASK_ID
for
selecting which data set to use, etc. (The queue system calls the instances
“array tasks”.) For instance:
$ sbatch --array=1-100 MyScript
will run 100 instances of MyScript
, setting the environment variable
$SLURM_ARRAY_TASK_ID
to 1, 2, …, 100 in turn.
Array job properties
Specifying task IDs
It is possible to specify the task ids in other ways than from-to
: it can be
a single number, a range (from-to
), a range with a step size
(from-to:step
), or a comma separated list of these. Finally, adding %max
at
the end of the specification puts a limit on how many tasks will be allowed to
run at the same time. A couple of examples:
Specification ( |
Resulting |
---|---|
|
1, 4, 42 |
|
1, 2, 3, 4, 5 |
|
0, 2, 4, 6, 8, 10 |
|
32, 56, 100, 101, 102, …, 200 |
|
1, 2, …, 200, but maximum 10 running at the same time |
Note
Spaces, decimal numbers or negative numbers are not allowed in the --array
specification.
The queue system allows job arrays with at most 1,000 array tasks, but
the maximal array task ID is 100,000 (thus --array=900-1100
is allowed).
Array job resources
The instances of an array job are independent, they have their own $SCRATCH
(read more about storage locations here) and are treated
like separate jobs. Thus any resources request in the Slurm script is available
for each task.
Canceling array jobs
To cancel all tasks of an array job, cancel the job ID that is returned by
sbatch
. One can also cancel individual tasks with scancel <array job ID>:<task ID>
.
Dependencies between array jobs
To handle dependencies between two or more array jobs one can use the
--depend=aftercorr:<previous job ID>
(regular dependencies can also be used,
but we wanted to highlight this particular way since it can be beneficial with
array jobs), this will start the dependent array tasks as soon as the previous
corresponding array task has completed. E.g. if we start an array job with
--array=1-5
and then start a second array job with --array=1-5 --depend=aftercorr:<other job id>
, once task X
of the first job is complete
the second job will start its task X
, independently of the other task in the
first or second job.
Example
A small, but complete example (for a normal
job on Saga):
#!/bin/bash
#SBATCH --account=YourProject
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=4G --ntasks=2
#SBATCH --array=1-200
set -o errexit # exit on errors
set -o nounset # treat unset variables as errors
module --quiet purge # clear any inherited modules
DATASET=dataset.$SLURM_ARRAY_TASK_ID
OUTFILE=result.$SLURM_ARRAY_TASK_ID
YourProgram $DATASET > $OUTFILE
Submit the script with sbatch minimal_array_job.sh
. This job will process the
datasets dataset.1
, dataset.2
, …, dataset.200
and put the results in
result.1
, result.2
, …, result.200
. Note that your dataset files has to be named dataset.1
, dataset.2
, etc. for this example to work. Make sure that the names of your dataset files and the names in your script are the same. Each of the tasks will consist of
two processes (--ntasks=2
) and get a total of 8GB
of memory (2 x
--mem-per-cpu=4G
).
If your files has inconsistent naming (for example “dataset_one”, dataset_2”, “my_dataset” etc.), you either have to rename your files or include code in your script to handle your files. Here is one way to handle inconsistent names:
Warning
You need to have the same number of files in your dataset directory as the number of tasks you specify in the --array
switch i.e. count the number of files in your dataset directory and use that number in the --array
switch. For example, to check how many csv files are in the directory named data, use ls data/*.csv | wc -l
in the terminal.
#!/bin/bash
#SBATCH --account=YourProject
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=4G --ntasks=2
#SBATCH --array=0-199 # we start at 0 instead of 1 for this
# example, as the $SLURM_ARRAY_TASK_ID
# variable starts at 0
set -o errexit # exit on errors
set -o nounset # treat unset variables as errors
module --quiet purge # clear any inherited modules
DATASETS=(data/*) # get all files in the directory named "data". Replace
# "data" with the path of your dataset directory.
FILE=${DATASETS[$SLURM_ARRAY_TASK_ID]}
FILENAME=$(basename ${FILE%.*})
YourProgram $FILE > ${FILENAME}.out
DATASETS=(data/*)
will get all files in the directory named “data” and store them in an array. The array is indexed from 0, so the first file will be stored in DATASETS[0]
, the second in DATASETS[1]
and so on. The SLURM_ARRAY_TASK_ID
variable is set by the Slurm system and is the task ID of the current task, with counting starting with 0.
Tip
If your datasets for example are csv files and the directory contains other file types, use DATASETS=(data/*.csv) instead.
Alternatively, you can save the names of you files in a text file and use the order of the filenames in the text file as an index. This is useful if you need the order of your files later or if you need to map the Slurm job output file to the correct dataset file.
Run for example these commands in the command line to create a text file with the names of your files:
$ DATASETS=(data/*)
$ printf "%s\n" "${DATASETS[@]}" > map_files.txt
And use the following example as you run script:
#!/bin/bash
#SBATCH --account=YourProject
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=4G --ntasks=2
#SBATCH --array=0-199
set -o errexit # exit on errors
set -o nounset # treat unset variables as errors
module --quiet purge # clear any inherited modules
IDX=($SLURM_ARRAY_TASK_ID)
FILE=$(sed "${IDX}q;d" map_files.txt)
FILENAME=$(basename ${FILE%.*})
YourProgram $FILE > ${FILENAME}.out
Tip
You can find a more extensive example here.