Job Types on Saga

Saga is designed to run serial and small (“narrow”) parallel jobs, in addition to GPU jobs. If you need to run “wider” parallel jobs, Fram is a better choice.

Warning

On Saga use srun, not mpirun

mpirun can get the number of tasks wrong and also lead to wrong task
placement. We don’t fully understand why this happens. When using srun
instead of mpirun or mpiexec, we observe correct task placement on Saga.

The basic allocation units on Saga are cpu and memory. The details about how the billing units are calculated can be found in Projects and accounting.

Most jobs on Saga are normal jobs.

Jobs requiring a lot of memory (> 8 GiB/cpu) should run as bigmem or hugemem jobs.

Jobs that are very short, or implement checkpointing, can run as optimist jobs, which means they can use resources that are idle for a short time before they are requeued by a non-optimist job.

For development or testing, use a devel job

Here is a more detailed description of the different job types on Saga:

Normal

  • Allocation units: cpus and memory

  • Job Limits:

    • maximum 256 units

  • Maximum walltime: 7 days

  • Priority: normal

  • Available resources:

    • 200 nodes with 40 cpus and 178.5 GiB RAM

    • 120 nodes with 52 cpus and 178.5 GiB RAM

  • Parameter for sbatch/salloc:

    • None, normal is the default

  • Job Scripts: Normal

This is the default job type. Most jobs are normal jobs.

Bigmem

  • Allocation units: cpus and memory

  • Job Limits:

    • maximum 256 units

  • Maximum walltime: 14 days

  • Priority: normal

  • Available resources:

    • 28 nodes with 40 cpus and 362 GiB RAM

    • 8 nodes with 64 cpus and 3021 GiB RAM

  • Parameter for sbatch/salloc:

    • --partition=bigmem

  • Job Scripts: Bigmem and Hugemem

Bigmem jobs are meant for jobs that need a lot of memory (RAM), typically more than 8 GiB per cpu. (The normal nodes on Saga have slightly more than 4.5 GiB per cpu.)

Can be combined with --qos=devel to get higher priority but maximum wall time (2h) and resource limits of devel apply.

Hugemem

  • Allocation units: cpus and memory

  • Job Limits:

    • maximum 256 units

  • Maximum walltime: 14 days

  • Priority: normal

  • Available resources:

    • 2 nodes with 64 cpus and 6040 GiB RAM

  • Parameter for sbatch/salloc:

    • --partition=hugemem

  • Job Scripts: Bigmem and Hugemem

Hugemem jobs are meant for jobs that need even more memory (RAM) than bigmem jobs.

Can be combined with --qos=devel to get higher priority but maximum wall time (2h) and resource limits of devel apply.

Please note that not all of the ordinary software modules will work on the hugemem nodes, due to the different cpu type. If you encounter any software-related issues, we are happy to help you at support@nris.no. As an alternative, you can use the NESSI or EESSI modules. These have been built to support the cpus on the hugemem nodes. To activate the modules, do module load NESSI/2023.06 (NESSI) or module load EESSI/2023.06 (EESSI) before you load modules.

Accel

  • Allocation units: cpus, memory and GPUs

  • Job Limits:

    • maximum 256 units

  • Maximum walltime: 14 days

  • Priority: normal

  • Available resources: 8 nodes (max 7 per user) with 24 cpus, 364 GiB RAM and 4 P100 GPUs.

  • Parameter for sbatch/salloc:

    • --partition=accel

    • --gpus=N, --gpus-per-node=N or similar, with N being the number of GPUs

  • Job Scripts: Accel and A100

Accel jobs give access to use the P100 GPUs.

Can be combined with --qos=devel to get higher priority but maximum wall time (2h) and resource limits of devel apply.

A100

  • Allocation units: cpus, memory and GPUs

  • Job Limits:

    • maximum 256 units

  • Maximum walltime: 14 days

  • Priority: normal

  • Available resources: 8 nodes (max 7 per user) with 32 cpus, 1,000 GiB RAM and 4 A100 GPUs.

  • Parameter for sbatch/salloc:

    • --partition=a100

    • --gpus=N, --gpus-per-node=N or similar, with N being the number of GPUs

  • Job Scripts: Accel and A100

A100 jobs give access to use the A100 GPUs.

Can be combined with --qos=devel to get higher priority but maximum wall time (2h) and resource limits of devel apply.

Devel

  • Allocation units: cpus and memory and GPUs

  • Job Limits:

    • maximum 128 units per job

    • maximum 256 units in use at the same time

    • maximum 2 running jobs per user

  • Maximum walltime: 2 hours

  • Priority: high

  • Available resources: devel jobs can run on any node on Saga

  • Parameter for sbatch/salloc:

    • --qos=devel

  • Job Scripts: Devel

This is meant for small, short development or test jobs. Devel jobs get higher priority for them to run as soon as possible. On the other hand, there are limits on the size and number of devel jobs.

Can be combined with either --partition=accel, --partition=bigmem or --partition=huemem to increase priority while having max wall time and job limits of devel job.

If you have temporary development needs that cannot be fulfilled by the devel or short job types, please contact us at support@nris.no.

Optimist

  • Allocation units: cpus and memory

  • Job Limits:

    • maximum 256 units

  • Maximum Walltime: None. The jobs will start as soon as resources are available for at least 30 minutes, but can be requeued at any time, so there is no guaranteed minimum run time.

  • Priority: low

  • Available resources: optimist jobs can run on any node on Saga

  • Parameter for sbatch/salloc:

    • --qos=optimist

  • Job Scripts: Optimist

The optimist job type is meant for very short jobs, or jobs with checkpointing (i.e., they save state regularly, so they can restart from where they left off).

Optimist jobs get lower priority than other jobs, but will start as soon as there are free resources for at least 30 minutes. However, when any other non-optimist job needs its resources, the optimist job is stopped and put back on the job queue. This can happen before the optimist job has run 30 minutes, so there is no guaranteed minimum run time.

Therefore, all optimist jobs must use checkpointing, and access to run optimist jobs will only be given to projects that demonstrate that they can use checkpointing. If you want to run optimist jobs, send a request to support@nris.no.