Multi-Node Implementation for PyTorch on Olivia

This is part 3 of the PyTorch on Olivia guide. See Single-GPU Implementation for PyTorch on Olivia for single-GPU and Multi-GPU Implementation for PyTorch on Olivia for multi-GPU setup.

Multi-node training on Olivia requires a consistent NCCL-enabled module environment and a stable rendezvous endpoint shared by all nodes. The job script below handles both.

Learning Outcomes

By the end of this part, you can:

  1. Launch PyTorch training across multiple nodes with torchrun.

  2. Configure the required module environment for distributed communication.

  3. Set rendezvous parameters correctly for a stable multi-node start.

Job Script for Multi-Node Training

Choose either module-based launch or direct container launch.

 1#!/bin/bash
 2#SBATCH --account=<project_number>
 3#SBATCH --job-name=resnet_multinode_mod
 4#SBATCH --output=multinode_module_%j.out
 5#SBATCH --error=multinode_module_%j.err
 6#SBATCH --time=01:00:00
 7#SBATCH --partition=accel
 8#SBATCH --nodes=2
 9#SBATCH --ntasks-per-node=1
10#SBATCH --gpus-per-node=4
11#SBATCH --cpus-per-task=72
12#SBATCH --mem=440G
13
14set -euo pipefail
15
16SCRIPT_DIR="/cluster/work/projects/<project_number>/<username>/pytorch_olivia"
17
18ml reset
19ml load NRIS/GPU
20ml load NCCL/2.26.6-GCCcore-14.2.0-CUDA-12.8.0
21ml use /cluster/work/support/pytorch_module
22ml load PyTorch/2.8.0
23
24export PYTORCH_OVERLAY_MODE=ro
25
26cd "${SCRIPT_DIR}"
27
28mapfile -t nodes < <(scontrol show hostnames "${SLURM_JOB_NODELIST}")
29head_node="${nodes[0]}"
30export RDZV_ENDPOINT="${head_node}:29500"
31
32echo "Head node: ${head_node}"
33echo "Rendezvous endpoint: ${RDZV_ENDPOINT}"
34
35srun torchrun \
36  --nnodes="${SLURM_JOB_NUM_NODES}" \
37  --nproc_per_node="${SLURM_GPUS_ON_NODE}" \
38  --rdzv_id="${SLURM_JOB_ID}" \
39  --rdzv_backend=c10d \
40  --rdzv_endpoint="${RDZV_ENDPOINT}" \
41  train_ddp.py --epochs 100 --batch-size 2048 --base-lr 0.04 --target-accuracy 0.95 --patience 2

The submit and monitor commands are identical for both launch modes.

sbatch multinode_module.sh
squeue -u $USER
tail -f multinode_module_<jobid>.out

Key Changes from Multi-GPU to Multi-Node

The multi-node-specific additions are:

Change

Purpose

#SBATCH --nodes=2 and #SBATCH --gpus-per-node=4

Requests resources on multiple nodes

ml load NRIS/GPU and ml load NCCL/2.26.6-...

Loads the distributed communication stack

Head-node hostname from SLURM_JOB_NODELIST

Defines rendezvous endpoint for all processes

srun torchrun ... --rdzv_backend=c10d --rdzv_endpoint=...

Coordinates multi-node process-group formation

Note

The key difference from single-node multi-GPU is the rendezvous setup. Single-node uses --standalone, while multi-node requires explicit coordination via --rdzv_backend=c10d and --rdzv_endpoint pointing to the head node.

The output of this job script is shown below:

Epoch 95/100 completed in 0.771 seconds
Validation Loss: 1.1998, Validation Accuracy: 0.7101
Epoch Throughput: 63787.926 images/second
Epoch 96/100 completed in 0.759 seconds
Validation Loss: 1.1924, Validation Accuracy: 0.7090
Epoch Throughput: 64736.418 images/second
Epoch 97/100 completed in 0.770 seconds
Validation Loss: 1.1911, Validation Accuracy: 0.7092
Epoch Throughput: 63812.132 images/second
Epoch 98/100 completed in 0.763 seconds
Validation Loss: 1.1671, Validation Accuracy: 0.7128
Epoch Throughput: 64432.161 images/second
Epoch 99/100 completed in 0.756 seconds
Validation Loss: 1.1799, Validation Accuracy: 0.7160
Epoch Throughput: 64995.126 images/second
Epoch 100/100 completed in 0.767 seconds
Validation Loss: 1.2086, Validation Accuracy: 0.7082
Epoch Throughput: 64118.564 images/second

Training Summary:
Total training time: 77.784 seconds
Throughput: 63190.172 images/second
Number of nodes: 2
Number of GPUs per node: 4
Total GPUs used: 8
Training completed successfully.

With 8 GPUs across 2 nodes, the throughput increased from ~5,100 images/second (single GPU) to ~63,000 images/second—a 12x speedup. Training time dropped from ~16 minutes to just ~1.3 minutes.

Success criteria for Part 3:

  • Log shows Head Node and a resolved head-node IP

  • Final summary reports Number of nodes: 2 and Total GPUs used: 8

  • Training completes without rendezvous or NCCL startup errors