Multi-Node Implementation for PyTorch on Olivia

This is part 3 of the PyTorch on Olivia guide. See PyTorch on Olivia for single-GPU and Multi-GPU Implementation for PyTorch on Olivia for multi-GPU setup.

Multi-node training on Olivia requires proper configuration of NCCL with the OFI plugin and libfabric for the Slingshot interconnect. The job script below handles these configurations.

Job Script for Multi-Node Training

 1#!/bin/bash
 2#SBATCH --account=<project_number>
 3#SBATCH --job-name=resnet_multinode
 4#SBATCH --output=multinode_%j.out
 5#SBATCH --error=multinode_%j.err
 6#SBATCH --time=01:00:00
 7#SBATCH --partition=accel
 8#SBATCH --nodes=2
 9#SBATCH --ntasks-per-node=1
10#SBATCH --gpus-per-node=4
11#SBATCH --cpus-per-task=72
12#SBATCH --mem=440G
13
14# Path to the container
15CONTAINER_PATH="/cluster/work/support/container/pytorch_nvidia_25.06_arm64.sif"
16
17# Path to the training script
18export APPTAINERENV_TRAINING_SCRIPT="train_ddp.py --epochs 100 --batch-size 2048 --base-lr 0.04 --target-accuracy 0.95 --patience 2"
19
20# Set the libfabric and nccl path from the host
21HOST_LIBFABRIC_LIB_PATH=/opt/cray/libfabric/1.22.0/lib64
22HOST_LIBFABRIC_INCLUDE_PATH=/opt/cray/libfabric/1.22.0/include
23HOST_NCCL_PATH=/cluster/work/projects/nn9997k/software/nccl
24HOST_NVIDIA_HPC_LIB_PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/compilers/lib
25HOST_CXI_LIB_PATH=/usr/lib64  # Directory containing libcxi.so.1
26
27
28# Explicitly specify the full path to torchrun
29export APPTAINERENV_TORCHRUN_PATH="/usr/local/bin/torchrun"
30
31# Debugging: Enable NCCL logs
32#export APPTAINERENV_NCCL_DEBUG=INFO
33#export APPTAINERENV_NCCL_DEBUG_SUBSYS=ALL
34
35# Get the head node and its IP address
36nodes=( $(scontrol show hostnames $SLURM_JOB_NODELIST) )
37head_node=${nodes[0]}
38export APPTAINERENV_head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | awk '{print $1}')
39echo "Head Node: $head_node"
40echo "Head Node IP: $APPTAINERENV_head_node_ip"
41
42# Pass SLURM variables explicitly to the container
43export APPTAINERENV_SLURM_JOB_NUM_NODES=$SLURM_JOB_NUM_NODES
44export APPTAINERENV_SLURM_GPUS_ON_NODE=$SLURM_GPUS_ON_NODE
45
46# Start GPU utilization monitoring in the background
47GPU_LOG_FILE="multinode.log"
48echo "Starting GPU utilization monitoring..."
49nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.total,memory.used --format=csv -l 5 > $GPU_LOG_FILE &
50
51# Run the training script with torchrun inside the container
52srun  apptainer exec  --nv \
53  --bind $HOST_LIBFABRIC_LIB_PATH:/opt/libfabric/lib \
54  --bind $HOST_LIBFABRIC_INCLUDE_PATH:/opt/libfabric/include \
55  --bind $HOST_NCCL_PATH:/opt/nccl \
56  --bind $HOST_CXI_LIB_PATH:/usr/lib64 \
57  --bind $HOST_NVIDIA_HPC_LIB_PATH:/opt/nvidia/hpc_sdk/lib \
58  --env LIBFABRIC_HOME=/opt/libfabric \
59  --env NCCL_HOME=/opt/nccl \
60  --env NVIDIA_HPC_HOME=/opt/nvidia/hpc_sdk \
61  --env head_node_ip=$APPTAINERENV_head_node_ip \
62  --env TRAINING_SCRIPT="$APPTAINERENV_TRAINING_SCRIPT" \
63  --env TORCHRUN_PATH="$APPTAINERENV_TORCHRUN_PATH" \
64  --env RDZV_ID=$SLURM_JOB_ID \
65  --env SLURM_JOB_NUM_NODES=$APPTAINERENV_SLURM_JOB_NUM_NODES \
66  --env SLURM_GPUS_ON_NODE=$APPTAINERENV_SLURM_GPUS_ON_NODE \
67  $CONTAINER_PATH \
68  bash -c 'export LD_LIBRARY_PATH=$LIBFABRIC_HOME/lib:$NCCL_HOME/lib:$NVIDIA_HPC_HOME/lib:/usr/lib64:$LD_LIBRARY_PATH; \
69  export CPATH=$LIBFABRIC_HOME/include:$CPATH; \
70  $TORCHRUN_PATH  \
71  --nnodes=$SLURM_JOB_NUM_NODES \
72  --nproc_per_node=$SLURM_GPUS_ON_NODE \
73  --rdzv_id=$RDZV_ID \
74  --rdzv_backend=c10d \
75  --rdzv_endpoint=$head_node_ip:29500 \
76  $TRAINING_SCRIPT'
77
78# Stop GPU utilization monitoring
79echo "Stopping GPU utilization monitoring..."
80pkill -f "nvidia-smi --query-gpu"

Key Changes from Multi-GPU to Multi-Node

The highlighted lines show the multi-node specific additions:

Lines

Change

Purpose

8

--nodes=2

Request multiple nodes

10

--gpus-per-node=4

GPUs per node (instead of --gpus=4)

18

--batch-size 2048

Larger batch for 8 GPUs

21-25

Host library paths

Paths to libfabric, NCCL, CXI on host

35-40

Head node discovery

Get head node IP for rendezvous

43-44

Pass SLURM vars

Export node/GPU counts to container

52-67

srun with --bind/--env

Mount host libraries and set environment

70-76

torchrun with rendezvous

Use rdzv_backend=c10d and rdzv_endpoint for multi-node coordination

Note

The key difference from single-node multi-GPU is the rendezvous setup. Single-node uses --standalone, while multi-node requires explicit coordination via --rdzv_backend=c10d and --rdzv_endpoint pointing to the head node.

The output of this job script is shown below:

Epoch 95/100 completed in 0.771 seconds
Validation Loss: 1.1998, Validation Accuracy: 0.7101
Epoch Throughput: 63787.926 images/second
Epoch 96/100 completed in 0.759 seconds
Validation Loss: 1.1924, Validation Accuracy: 0.7090
Epoch Throughput: 64736.418 images/second
Epoch 97/100 completed in 0.770 seconds
Validation Loss: 1.1911, Validation Accuracy: 0.7092
Epoch Throughput: 63812.132 images/second
Epoch 98/100 completed in 0.763 seconds
Validation Loss: 1.1671, Validation Accuracy: 0.7128
Epoch Throughput: 64432.161 images/second
Epoch 99/100 completed in 0.756 seconds
Validation Loss: 1.1799, Validation Accuracy: 0.7160
Epoch Throughput: 64995.126 images/second
Epoch 100/100 completed in 0.767 seconds
Validation Loss: 1.2086, Validation Accuracy: 0.7082
Epoch Throughput: 64118.564 images/second

Training Summary:
Total training time: 77.784 seconds
Throughput: 63190.172 images/second
Number of nodes: 2
Number of GPUs per node: 4
Total GPUs used: 8
Training completed successfully.
Stopping GPU utilization monitoring...

With 8 GPUs across 2 nodes, the throughput increased from ~5,100 images/second (single GPU) to ~63,000 images/second—a 12x speedup. Training time dropped from ~16 minutes to just ~1.3 minutes.