PyTorch Software Options on Olivia

Olivia supports three distinct ways to run PyTorch workloads:

  1. Module path through the NRIS GPU module stack.

  2. Direct container path using Apptainer explicitly.

  3. EESSI path using the EESSI software stack.

The PyTorch overview page (PyTorch on Olivia) links to the scaling guides and supporting reference pages. This page focuses only on the software choices available to you.

For validated examples of all three approaches, see Single-GPU Implementation for PyTorch on Olivia for single-GPU, Multi-GPU Implementation for PyTorch on Olivia for multi-GPU on one node, and Multi-Node Implementation for PyTorch on Olivia for multi-node runs.

Software Paths

The module path is the main user-facing solution on Olivia. It looks like a normal module workflow, but the runtime is container-backed underneath.

During the current rollout, examples may still include:

ml use /cluster/work/support/temporary_modules

This line adds a temporary module root. It will disappear once the PyTorch module is fully published.

Typical loading currently looks like this:

ml reset
ml load NRIS/GPU
ml load NCCL/2.26.6-GCCcore-14.2.0-CUDA-12.8.0
ml use /cluster/work/support/temporary_modules
ml load PyTorch/2.8.0

In this stack:

  1. ml reset starts from a clean module environment.

  2. NRIS/GPU selects the ARM software stack used on Olivia GPU compute nodes.

  3. NCCL/2.26.6-GCCcore-14.2.0-CUDA-12.8.0 loads the GPU communication stack needed for multi-GPU and multi-node PyTorch jobs. On Olivia this also means the supporting communication layer used with NCCL, including components such as libfabric and AWS OFI NCCL.

  4. ml use /cluster/work/support/temporary_modules adds the temporary module root used during the current rollout.

  5. PyTorch/2.8.0 loads the user-facing PyTorch module. Under the hood, this module is a wrapper around the container-backed PyTorch runtime.

If you want to see the container-based launch model directly, the PyTorch guide pages also show the equivalent direct-container examples for single-GPU, multi-GPU, and multi-node runs.

This is the recommended default when the published module already covers your workflow.

If you need to add Python packages to the module path or the direct container path, see Adding Python Packages to PyTorch Containers.

If you want recommendations for where to keep overlays, models, datasets, and Hugging Face caches on Olivia, see Models, Datasets, Caches, and Overlays on Olivia.