How to recover files before a job times out
Possibly you would like to clean up the work directory or recover
files for restart in case a job times out. This is perhaps most
useful when using the $SCRATCH
work directory (see Storage areas on HPC clusters).
In this example we ask Slurm to send a signal to our script 120 seconds before it times out to give us a chance to perform clean-up actions.
#!/bin/bash
# job name
#SBATCH --job-name=example
# replace this by your account
#SBATCH --account=YourAccount
#SBATCH --qos=devel
#SBATCH --ntasks=1
## Note: On Saga, you will also have to specify --mem-per-cpu
# we give this job 4 minutes
#SBATCH --time=0-00:04:00
# asks Slurm to send the USR1 signal 120 seconds before end of the time limit
#SBATCH --signal=B:USR1@120
# define the handler function
# note that this is not executed here, but rather
# when the associated signal is sent
your_cleanup_function()
{
echo "function your_cleanup_function called at $(date)"
# do whatever cleanup you want here
}
# call your_cleanup_function once we receive USR1 signal
trap 'your_cleanup_function' USR1
echo "starting calculation at $(date)"
# the calculation "computes" (in this case sleeps) for 1000 seconds
# but we asked slurm only for 240 seconds so it will not finish
# the "&" after the compute step and "wait" are important
sleep 1000 &
wait
Download the script:
Also note that jobs which use $SCRATCH
as the work directory can use
the savefile
and cleanup
commands to copy files back to the submit
directory before the work directory is deleted (see info about Job work directory).