Job Types on Saga
Saga is designed to run serial and small (“narrow”) parallel jobs, in addition to GPU jobs. If you need to run “wider” parallel jobs, Fram is a better choice.
Warning
On Saga use srun, not mpirun
mpirun can get the number of tasks wrong and also lead to wrong task
placement. We don’t fully understand why this happens. When using srun
instead of mpirun or mpiexec, we observe correct task placement on Saga.
The basic allocation units on Saga are cpu and memory. The details about how the billing units are calculated can be found in Projects and accounting.
Most jobs on Saga are normal jobs.
Jobs requiring a lot of memory (> 8 GiB/cpu) should run as bigmem or hugemem jobs.
Jobs that are very short, or implement checkpointing, can run as optimist jobs, which means they can use resources that are idle for a short time before they are requeued by a non-optimist job.
For development or testing, use a devel job
Here is a more detailed description of the different job types on Saga:
Normal
Allocation units: cpus and memory
Job Limits:
maximum 256 units
Maximum walltime: 7 days
Priority: normal
Available resources:
200 nodes with 40 cpus and 178.5 GiB RAM
120 nodes with 52 cpus and 178.5 GiB RAM
Parameter for sbatch/salloc:
None, normal is the default
Job Scripts: Normal
This is the default job type. Most jobs are normal jobs.
Bigmem
Allocation units: cpus and memory
Job Limits:
maximum 256 units
Maximum walltime: 14 days
Priority: normal
Available resources:
28 nodes with 40 cpus and 362 GiB RAM
8 nodes with 64 cpus and 3021 GiB RAM
Parameter for sbatch/salloc:
--partition=bigmem
Job Scripts: Bigmem and Hugemem
Bigmem jobs are meant for jobs that need a lot of memory (RAM), typically more than 8 GiB per cpu. (The normal nodes on Saga have slightly more than 4.5 GiB per cpu.)
Can be combined with --qos=devel
to get higher priority but maximum wall time (2h)
and resource limits of devel apply.
Hugemem
Allocation units: cpus and memory
Job Limits:
maximum 256 units
Maximum walltime: 14 days
Priority: normal
Available resources:
2 nodes with 64 cpus and 6040 GiB RAM
Parameter for sbatch/salloc:
--partition=hugemem
Job Scripts: Bigmem and Hugemem
Hugemem jobs are meant for jobs that need even more memory (RAM) than bigmem jobs.
Can be combined with --qos=devel
to get higher priority but maximum wall time (2h)
and resource limits of devel apply.
Please note that not all of the ordinary software modules will work on
the hugemem nodes, due to the different cpu type. If you encounter
any software-related issues, we are happy to help you at
support@nris.no. As an alternative, you can use the NESSI or
EESSI modules. These have been built to
support the cpus on the hugemem nodes. To activate the modules, do
module load NESSI/2023.06
(NESSI) or
module load EESSI/2023.06
(EESSI)
before you load modules.
Accel
Allocation units: cpus, memory and GPUs
Job Limits:
maximum 256 units
Maximum walltime: 14 days
Priority: normal
Available resources: 8 nodes (max 7 per user) with 24 cpus, 364 GiB RAM and 4 P100 GPUs.
Parameter for sbatch/salloc:
--partition=accel
--gpus=N
,--gpus-per-node=N
or similar, with N being the number of GPUs
Job Scripts: Accel and A100
Accel jobs give access to use the P100 GPUs.
Can be combined with --qos=devel
to get higher priority but maximum wall time (2h)
and resource limits of devel apply.
A100
Allocation units: cpus, memory and GPUs
Job Limits:
maximum 256 units
Maximum walltime: 14 days
Priority: normal
Available resources: 8 nodes (max 7 per user) with 32 cpus, 1,000 GiB RAM and 4 A100 GPUs.
Parameter for sbatch/salloc:
--partition=a100
--gpus=N
,--gpus-per-node=N
or similar, with N being the number of GPUs
Job Scripts: Accel and A100
A100 jobs give access to use the A100 GPUs.
Can be combined with --qos=devel
to get higher priority but maximum wall time (2h)
and resource limits of devel apply.
Devel
Allocation units: cpus and memory and GPUs
Job Limits:
maximum 128 units per job
maximum 256 units in use at the same time
maximum 2 running jobs per user
Maximum walltime: 2 hours
Priority: high
Available resources: devel jobs can run on any node on Saga
Parameter for sbatch/salloc:
--qos=devel
Job Scripts: Devel
This is meant for small, short development or test jobs. Devel jobs get higher priority for them to run as soon as possible. On the other hand, there are limits on the size and number of devel jobs.
Can be combined with either --partition=accel
, --partition=bigmem
or --partition=huemem
to increase
priority while having max wall time and job limits of devel job.
If you have temporary development needs that cannot be fulfilled by the devel or short job types, please contact us at support@nris.no.
Optimist
Allocation units: cpus and memory
Job Limits:
maximum 256 units
Maximum Walltime: None. The jobs will start as soon as resources are available for at least 30 minutes, but can be requeued at any time, so there is no guaranteed minimum run time.
Priority: low
Available resources: optimist jobs can run on any node on Saga
Parameter for sbatch/salloc:
--qos=optimist
Job Scripts: Optimist
The optimist job type is meant for very short jobs, or jobs with checkpointing (i.e., they save state regularly, so they can restart from where they left off).
Optimist jobs get lower priority than other jobs, but will start as soon as there are free resources for at least 30 minutes. However, when any other non-optimist job needs its resources, the optimist job is stopped and put back on the job queue. This can happen before the optimist job has run 30 minutes, so there is no guaranteed minimum run time.
Therefore, all optimist jobs must use checkpointing, and access to run optimist jobs will only be given to projects that demonstrate that they can use checkpointing. If you want to run optimist jobs, send a request to support@nris.no.