Resource usage of a running job
On Fram and Saga there is also a web tool that can be used to inspect many aspects of the jobs like processes, cpuload, memory consumption and network traffic both for running and completed jobs (last 24 hours)
Remark: The stats are collected on the compute node level. This means that stats will be affected by all jobs running on the same compute node.
How to check whether your job is running
To check the job status of all your jobs, you can use squeue:
squeue -u MyUsername
You can also get a quick view of the status of a job
squeue -j JobId
JobId is the job id number that
sbatch returns. To see more
details about a job, use
scontrol show job JobId
Both commands will show the job state (ST), and can show a job reason for why a job is pending. Job States describes a few of the more common ones.
sstat -j JobId sacct -j JobId
sacct have an option
--format to select which
fields to show. See the documentation of the commands for the
available fields and what they mean.
When a job has finished, the output file
contain some usage statistics from
Cancelling jobs and putting jobs on hold
You can cancel running or pending (waiting) jobs with scancel:
scancel JobId # Cancel job with id JobId (as returned from sbatch) scancel --user=MyUsername # Cancel all your jobs scancel --account=MyProject # Cancel all jobs in MyProject
The command scontrol can be used to further control pending or running jobs:
scontrol requeue JobId: Requeue a running job. The job will be stopped, and its state changed to pending.
scontrol hold JobId: Hold a pending job. This prevents the queue system from starting the job. The job reason will be set to
scontrol release JobId: Release a held job. This allows the queue system to start the job.
It is also possible to submit a job and put it on hold immediately
sbatch --hold JobScript.