Multi-Node Implementation for PyTorch on Olivia
This is part 3 of the PyTorch on Olivia guide. See Single-GPU Implementation for PyTorch on Olivia for single-GPU and Multi-GPU Implementation for PyTorch on Olivia for multi-GPU setup.
Multi-node training on Olivia requires a consistent NCCL-enabled module environment and a stable rendezvous endpoint shared by all nodes. The job script below handles both.
Learning Outcomes
By the end of this part, you can:
Launch PyTorch training across multiple nodes with
torchrun.Configure the required module environment for distributed communication.
Set rendezvous parameters correctly for a stable multi-node start.
Job Script for Multi-Node Training
Choose either module-based launch or direct container launch.
1#!/bin/bash
2#SBATCH --account=<project_number>
3#SBATCH --job-name=resnet_multinode_mod
4#SBATCH --output=multinode_module_%j.out
5#SBATCH --error=multinode_module_%j.err
6#SBATCH --time=01:00:00
7#SBATCH --partition=accel
8#SBATCH --nodes=2
9#SBATCH --ntasks-per-node=1
10#SBATCH --gpus-per-node=4
11#SBATCH --cpus-per-task=72
12#SBATCH --mem=440G
13
14set -euo pipefail
15
16SCRIPT_DIR="/cluster/work/projects/<project_number>/<username>/pytorch_olivia"
17
18ml reset
19ml load NRIS/GPU
20ml load NCCL/2.26.6-GCCcore-14.2.0-CUDA-12.8.0
21ml use /cluster/work/support/pytorch_module
22ml load PyTorch/2.8.0
23
24export PYTORCH_OVERLAY_MODE=ro
25
26cd "${SCRIPT_DIR}"
27
28mapfile -t nodes < <(scontrol show hostnames "${SLURM_JOB_NODELIST}")
29head_node="${nodes[0]}"
30export RDZV_ENDPOINT="${head_node}:29500"
31
32echo "Head node: ${head_node}"
33echo "Rendezvous endpoint: ${RDZV_ENDPOINT}"
34
35srun torchrun \
36 --nnodes="${SLURM_JOB_NUM_NODES}" \
37 --nproc_per_node="${SLURM_GPUS_ON_NODE}" \
38 --rdzv_id="${SLURM_JOB_ID}" \
39 --rdzv_backend=c10d \
40 --rdzv_endpoint="${RDZV_ENDPOINT}" \
41 train_ddp.py --epochs 100 --batch-size 2048 --base-lr 0.04 --target-accuracy 0.95 --patience 2
1#!/bin/bash
2#SBATCH --account=<project_number>
3#SBATCH --job-name=resnet_multinode_ctr
4#SBATCH --output=multinode_container_%j.out
5#SBATCH --error=multinode_container_%j.err
6#SBATCH --time=01:00:00
7#SBATCH --partition=accel
8#SBATCH --nodes=2
9#SBATCH --ntasks-per-node=1
10#SBATCH --gpus-per-node=4
11#SBATCH --cpus-per-task=72
12#SBATCH --mem=440G
13
14set -euo pipefail
15
16CONTAINER_PATH="/cluster/work/support/container/pytorch_nvidia_25.06_arm64.sif"
17SCRIPT_DIR="/cluster/work/projects/<project_number>/<username>/pytorch_olivia"
18TRAINING_SCRIPT="train_ddp.py --epochs 100 --batch-size 2048 --base-lr 0.04 --target-accuracy 0.95 --patience 2"
19
20ml reset
21ml load NRIS/GPU
22ml load NCCL/2.26.6-GCCcore-14.2.0-CUDA-12.8.0
23
24LIBFABRIC_LIB_PATH="${EBROOTLIBFABRIC}/lib"
25LIBFABRIC_INCLUDE_PATH="${EBROOTLIBFABRIC}/include"
26NCCL_ROOT_PATH="${EBROOTNCCL}"
27AWS_OFI_NCCL_LIB_PATH="${EBROOTAWSMINOFIMINNCCL}/lib"
28CXI_LIB_PATH="/usr/lib64"
29
30HF_ROOT="${SCRIPT_DIR}/hf_cache"
31mkdir -p "${HF_ROOT}/hub" "${HF_ROOT}/datasets" "${HF_ROOT}/torch"
32
33cd "${SCRIPT_DIR}"
34
35mapfile -t nodes < <(scontrol show hostnames "${SLURM_JOB_NODELIST}")
36head_node="${nodes[0]}"
37head_node_ip=$(srun --nodes=1 --ntasks=1 -w "${head_node}" hostname --ip-address | awk '{print $1}')
38rdzv_endpoint="${head_node_ip}:29500"
39
40echo "Head node: ${head_node}"
41echo "Head node IP: ${head_node_ip}"
42echo "Rendezvous endpoint: ${rdzv_endpoint}"
43
44srun apptainer exec --nv \
45 --bind "${SCRIPT_DIR}:${SCRIPT_DIR}" \
46 --bind "${LIBFABRIC_LIB_PATH}:/opt/libfabric/lib" \
47 --bind "${LIBFABRIC_INCLUDE_PATH}:/opt/libfabric/include" \
48 --bind "${NCCL_ROOT_PATH}:/opt/nccl" \
49 --bind "${AWS_OFI_NCCL_LIB_PATH}:/opt/aws-ofi-nccl/lib" \
50 --bind "${CXI_LIB_PATH}:${CXI_LIB_PATH}" \
51 --pwd "${SCRIPT_DIR}" \
52 --env FI_PROVIDER="${FI_PROVIDER:-cxi}" \
53 --env FI_CXI_RX_MATCH_MODE="${FI_CXI_RX_MATCH_MODE:-hybrid}" \
54 --env NCCL_PROTO="${NCCL_PROTO:-^LL128}" \
55 --env LIBFABRIC_HOME="/opt/libfabric" \
56 --env NCCL_HOME="/opt/nccl" \
57 --env AWS_OFI_NCCL_HOME="/opt/aws-ofi-nccl" \
58 --env RDZV_ENDPOINT="${rdzv_endpoint}" \
59 --env RDZV_ID="${SLURM_JOB_ID}" \
60 --env SLURM_JOB_NUM_NODES="${SLURM_JOB_NUM_NODES}" \
61 --env SLURM_GPUS_ON_NODE="${SLURM_GPUS_ON_NODE}" \
62 --env TRAINING_SCRIPT="${TRAINING_SCRIPT}" \
63 --env HF_HOME="${HF_ROOT}" \
64 --env HF_HUB_CACHE="${HF_ROOT}/hub" \
65 --env HF_DATASETS_CACHE="${HF_ROOT}/datasets" \
66 --env TRANSFORMERS_CACHE="${HF_ROOT}/hub" \
67 --env TORCH_HOME="${HF_ROOT}/torch" \
68 "${CONTAINER_PATH}" \
69 bash -lc 'export LD_LIBRARY_PATH="${LIBFABRIC_HOME}/lib:${NCCL_HOME}/lib:${AWS_OFI_NCCL_HOME}/lib:/usr/lib64:${LD_LIBRARY_PATH}"; export CPATH="${LIBFABRIC_HOME}/include:${CPATH:-}"; torchrun --nnodes="${SLURM_JOB_NUM_NODES}" --nproc_per_node="${SLURM_GPUS_ON_NODE}" --rdzv_id="${RDZV_ID}" --rdzv_backend=c10d --rdzv_endpoint="${RDZV_ENDPOINT}" ${TRAINING_SCRIPT}'
1#!/bin/bash
2#SBATCH --account=<project_number>
3#SBATCH --job-name=resnet_multinode_eessi
4#SBATCH --output=multinode_eessi_%j.out
5#SBATCH --error=multinode_eessi_%j.err
6#SBATCH --time=01:00:00
7#SBATCH --partition=accel
8#SBATCH --nodes=2
9#SBATCH --ntasks-per-node=1
10#SBATCH --gpus-per-node=4
11#SBATCH --cpus-per-task=72
12#SBATCH --mem=440G
13
14set -euo pipefail
15
16SCRIPT_DIR="/cluster/work/projects/<project_number>/<username>/pytorch_olivia"
17
18ml reset
19module load EESSI/2025.06
20module load PyTorch/2.7.1-foss-2024a-CUDA-12.6.0
21module load torchvision/0.22.0-foss-2024a-CUDA-12.6.0
22
23cd "${SCRIPT_DIR}"
24
25mapfile -t nodes < <(scontrol show hostnames "${SLURM_JOB_NODELIST}")
26head_node="${nodes[0]}"
27export RDZV_ENDPOINT="${head_node}:29500"
28
29echo "Head node: ${head_node}"
30echo "Rendezvous endpoint: ${RDZV_ENDPOINT}"
31
32srun torchrun \
33 --nnodes="${SLURM_JOB_NUM_NODES}" \
34 --nproc_per_node="${SLURM_GPUS_ON_NODE}" \
35 --rdzv_id="${SLURM_JOB_ID}" \
36 --rdzv_backend=c10d \
37 --rdzv_endpoint="${RDZV_ENDPOINT}" \
38 train_ddp.py --epochs 100 --batch-size 2048 --base-lr 0.04 --target-accuracy 0.95 --patience 2
The submit and monitor commands are identical for both launch modes.
sbatch multinode_module.sh
squeue -u $USER
tail -f multinode_module_<jobid>.out
sbatch multinode_container.sh
squeue -u $USER
tail -f multinode_container_<jobid>.out
sbatch multinode_eessi.sh
squeue -u $USER
tail -f multinode_eessi_<jobid>.out
Key Changes from Multi-GPU to Multi-Node
The multi-node-specific additions are:
Change |
Purpose |
|---|---|
|
Requests resources on multiple nodes |
|
Loads the distributed communication stack |
Head-node hostname from |
Defines rendezvous endpoint for all processes |
|
Coordinates multi-node process-group formation |
Note
The key difference from single-node multi-GPU is the rendezvous setup. Single-node uses --standalone, while multi-node requires explicit coordination via --rdzv_backend=c10d and --rdzv_endpoint pointing to the head node.
The output of this job script is shown below:
Epoch 95/100 completed in 0.771 seconds
Validation Loss: 1.1998, Validation Accuracy: 0.7101
Epoch Throughput: 63787.926 images/second
Epoch 96/100 completed in 0.759 seconds
Validation Loss: 1.1924, Validation Accuracy: 0.7090
Epoch Throughput: 64736.418 images/second
Epoch 97/100 completed in 0.770 seconds
Validation Loss: 1.1911, Validation Accuracy: 0.7092
Epoch Throughput: 63812.132 images/second
Epoch 98/100 completed in 0.763 seconds
Validation Loss: 1.1671, Validation Accuracy: 0.7128
Epoch Throughput: 64432.161 images/second
Epoch 99/100 completed in 0.756 seconds
Validation Loss: 1.1799, Validation Accuracy: 0.7160
Epoch Throughput: 64995.126 images/second
Epoch 100/100 completed in 0.767 seconds
Validation Loss: 1.2086, Validation Accuracy: 0.7082
Epoch Throughput: 64118.564 images/second
Training Summary:
Total training time: 77.784 seconds
Throughput: 63190.172 images/second
Number of nodes: 2
Number of GPUs per node: 4
Total GPUs used: 8
Training completed successfully.
With 8 GPUs across 2 nodes, the throughput increased from ~5,100 images/second (single GPU) to ~63,000 images/second—a 12x speedup. Training time dropped from ~16 minutes to just ~1.3 minutes.
Success criteria for Part 3:
Log shows
Head Nodeand a resolved head-node IPFinal summary reports
Number of nodes: 2andTotal GPUs used: 8Training completes without rendezvous or NCCL startup errors