Job Placement on Fram
The compute nodes on Fram are divided into four groups, called islands. Each island has about the same number of nodes. The Infiniband network throughput (“speed”) within an island is higher than the throughput between islands. Some jobs need high network throughput between its nodes, and will usually run faster if they run within a single island.
Default Setup
Therefore, the queue system is configured to run each job within one island, if that does not delay the job too much. It works like this: When a job is submitted, the queue system lets the job wait until there are enough free resources so that it can run within one island. If this has not happened when the job has waited 7 days[1], the job will be started on more than one island.
Overriding the Setup
The downside of requiring that all nodes belonging to a job should be in the same island, is that the job might have to wait longer in the queue, especially if the job needs many nodes. Some jobs do not need high network throughput between its nodes. For such jobs, you can override the setup, either for individual jobs or for all your jobs.
Individual Jobs
For individual jobs, you can use the switch --switches=N[@time]
on the
command line when submitting the job, where N is the maximal number of
islands to use (1, 2, 3 or 4), and time (optional) is the maximum time to
wait. See man sbatch
for details. Two examples:
--switches=2 # Allow two islands
--switches=1@4-0:0:0 # Change max wait time to 4 days
The maximal possible wait time to specify is 28 days[1]. A longer time will silently be truncated to 28 days!
Note that putting this option in an #SBATCH
line in the job script will
not work (it will silently be overridden by the environment variables we
set to get the default behaviour)!
On the other hand, you might want to guarantee that your job never,
ever, starts on more than one island. The easiest way to do that is to
specify --constraint=[island1|island2|island3|island4]
instead (this option
can be used either on the command line or in the job script).
Changing the Defaults
For changing the default for your jobs, you can change the followin environment variables:
SBATCH_REQ_SWITCH
: Max number of islands forsbatch
jobs.SALLOC_REQ_SWITCH
: Max number of islands forsalloc
jobs.SRUN_REQ_SWITCH
: Max number of islands forsrun
jobs.SBATCH_WAIT4SWITCH
: Max wait time forsbatch
jobs.SALLOC_WAIT4SWITCH
: Max wait time forsalloc
jobs.SRUN_WAIT4SWITCH
: Max wait time forsrun
jobs.
salloc
and srun
jobs are interactive jobs; see Interactive jobs.
As above, the maximal possible wait
time to specify is 28 days[1], and any time longer than that will silently be
truncated. The change takes effect for jobs submitted after you change the
variables. For instance, to change the default to allow two islands, and wait
up to two weeks:
export SBATCH_REQ_SWITCH=2
export SALLOC_REQ_SWITCH=2
export SRUN_REQ_SWITCH=2
export SBATCH_WAIT4SWITCH=14-00:00:00
export SALLOC_WAIT4SWITCH=14-00:00:00
export SRUN_WAIT4SWITCH=14-00:00:00
Note that we do not recommend that you unset these variables. If you want
your jobs to start on any nodes, whichever island they are on, simply set
*_REQ_SWITCH
variables to 4. Specifically, if you unset the
*_WAIT4SWITCH
variables, they will default to 28 days[1]. Also, in the
future we might change the underlying mechanism, in which case unsetting these
variables will have no effect (but setting them will).
Footnotes