Useful Slurm commands and tools for managing jobs

Slurm

Slurm, the job scheduler used on the HPC clusters, has a number of useful commands for managing jobs. Here is a growing collection of useful Slurm commands. Slurm commands can also be found in the official Slurm documentation.

Commands for sbatch

Note

There are two ways of giving sbatch a command. One way is to include the command in the job script by adding #SBATCH before the command (just like you would do for the required sbatch commands such as #SBATCH --job-name, #SBATCH --nodes, etc.) The other way is to give the command in the command line when submitting a job script. For example for the command --test-only, you would submit the job with sbatch --test-only job_script.sh.

--test-only

Validates the script and report about any missing information (misspelled input files, invalid arguments, etc.) and give an estimate of when the job will start running. Will not actually submit the job to the queue.

--gres=localscratch:<size>

A job on Fram or Saga can request a scratch area on local disk on the node it is running on to speed up I/O intensive jobs. This command is not useful for jobs running on more than one node. Currently, there are no special commands to ensure that files are copied back automatically, so one has to do that with cp commands or similar in the job script. More information on using this command is found here: Job scratch area on local disk.

Other Slurm commands

sstat and sacct

Job statistics can be found with sstat for running jobs and with sacct for completed jobs. In the command line, use sstat or sacct with the option -j followed by the job id number.

$ sstat -j JobId
$ sacct -j JobId

Both sstat and sacct have an option --format to select which fields to show. See the documentation on sstat here and on sacct here.

Tools

Here, a growing collection of useful tools available in the command line.

seff

seff is a nice tool which we can use on completed jobs. For example here we ask for a summary for the job number 4200691:

$ seff 4200691
Job ID: 4200691
Cluster: saga
User/Group: user/user
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:01
CPU Efficiency: 2.70% of 00:00:37 core-walltime
Job Wall-clock time: 00:00:37
Memory Utilized: 3.06 GB
Memory Efficiency: 89.58% of 3.42 GB

Slurm Job Script Generator

A tool for generating Slurm job scripts tailored for our HPC clusters: Slurm Job Script Generator