How to recover files before a job times out

Possibly you would like to clean up the work directory or recover files for restart in case a job times out. This is perhaps most useful when using the $SCRATCH work directory (see Storage areas on HPC clusters).

In this example we ask Slurm to send a signal to our script 120 seconds before it times out to give us a chance to perform clean-up actions.

#!/bin/bash

# job name
#SBATCH --job-name=example

# replace this by your account
#SBATCH --account=YourAccount

#SBATCH --qos=devel
#SBATCH --ntasks=1
## Note: On Saga, you will also have to specify --mem-per-cpu

# we give this job 4 minutes
#SBATCH --time=0-00:04:00

# asks Slurm to send the USR1 signal 120 seconds before end of the time limit
#SBATCH --signal=B:USR1@120

# define the handler function
# note that this is not executed here, but rather
# when the associated signal is sent
your_cleanup_function()
{
    echo "function your_cleanup_function called at $(date)"
    # do whatever cleanup you want here
}

# call your_cleanup_function once we receive USR1 signal
trap 'your_cleanup_function' USR1

echo "starting calculation at $(date)"

# the calculation "computes" (in this case sleeps) for 1000 seconds
# but we asked slurm only for 240 seconds so it will not finish
# the "&" after the compute step and "wait" are important
sleep 1000 &
wait

Download the script:

files/timeout_cleanup.sh

Also note that jobs which use $SCRATCH as the work directory can use the savefile and cleanup commands to copy files back to the submit directory before the work directory is deleted (see info about Job work directory).