Multi-Node Implementation for PyTorch on Olivia
This is part 3 of the PyTorch on Olivia guide. See PyTorch on Olivia for single-GPU and Multi-GPU Implementation for PyTorch on Olivia for multi-GPU setup.
Multi-node training on Olivia requires proper configuration of NCCL with the OFI plugin and libfabric for the Slingshot interconnect. The job script below handles these configurations.
Job Script for Multi-Node Training
1#!/bin/bash
2#SBATCH --account=<project_number>
3#SBATCH --job-name=resnet_multinode
4#SBATCH --output=multinode_%j.out
5#SBATCH --error=multinode_%j.err
6#SBATCH --time=01:00:00
7#SBATCH --partition=accel
8#SBATCH --nodes=2
9#SBATCH --ntasks-per-node=1
10#SBATCH --gpus-per-node=4
11#SBATCH --cpus-per-task=72
12#SBATCH --mem=440G
13
14# Path to the container
15CONTAINER_PATH="/cluster/work/support/container/pytorch_nvidia_25.06_arm64.sif"
16
17# Path to the training script
18export APPTAINERENV_TRAINING_SCRIPT="train_ddp.py --epochs 100 --batch-size 2048 --base-lr 0.04 --target-accuracy 0.95 --patience 2"
19
20# Set the libfabric and nccl path from the host
21HOST_LIBFABRIC_LIB_PATH=/opt/cray/libfabric/1.22.0/lib64
22HOST_LIBFABRIC_INCLUDE_PATH=/opt/cray/libfabric/1.22.0/include
23HOST_NCCL_PATH=/cluster/work/projects/nn9997k/software/nccl
24HOST_NVIDIA_HPC_LIB_PATH=/opt/nvidia/hpc_sdk/Linux_aarch64/24.11/compilers/lib
25HOST_CXI_LIB_PATH=/usr/lib64 # Directory containing libcxi.so.1
26
27
28# Explicitly specify the full path to torchrun
29export APPTAINERENV_TORCHRUN_PATH="/usr/local/bin/torchrun"
30
31# Debugging: Enable NCCL logs
32#export APPTAINERENV_NCCL_DEBUG=INFO
33#export APPTAINERENV_NCCL_DEBUG_SUBSYS=ALL
34
35# Get the head node and its IP address
36nodes=( $(scontrol show hostnames $SLURM_JOB_NODELIST) )
37head_node=${nodes[0]}
38export APPTAINERENV_head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | awk '{print $1}')
39echo "Head Node: $head_node"
40echo "Head Node IP: $APPTAINERENV_head_node_ip"
41
42# Pass SLURM variables explicitly to the container
43export APPTAINERENV_SLURM_JOB_NUM_NODES=$SLURM_JOB_NUM_NODES
44export APPTAINERENV_SLURM_GPUS_ON_NODE=$SLURM_GPUS_ON_NODE
45
46# Start GPU utilization monitoring in the background
47GPU_LOG_FILE="multinode.log"
48echo "Starting GPU utilization monitoring..."
49nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.total,memory.used --format=csv -l 5 > $GPU_LOG_FILE &
50
51# Run the training script with torchrun inside the container
52srun apptainer exec --nv \
53 --bind $HOST_LIBFABRIC_LIB_PATH:/opt/libfabric/lib \
54 --bind $HOST_LIBFABRIC_INCLUDE_PATH:/opt/libfabric/include \
55 --bind $HOST_NCCL_PATH:/opt/nccl \
56 --bind $HOST_CXI_LIB_PATH:/usr/lib64 \
57 --bind $HOST_NVIDIA_HPC_LIB_PATH:/opt/nvidia/hpc_sdk/lib \
58 --env LIBFABRIC_HOME=/opt/libfabric \
59 --env NCCL_HOME=/opt/nccl \
60 --env NVIDIA_HPC_HOME=/opt/nvidia/hpc_sdk \
61 --env head_node_ip=$APPTAINERENV_head_node_ip \
62 --env TRAINING_SCRIPT="$APPTAINERENV_TRAINING_SCRIPT" \
63 --env TORCHRUN_PATH="$APPTAINERENV_TORCHRUN_PATH" \
64 --env RDZV_ID=$SLURM_JOB_ID \
65 --env SLURM_JOB_NUM_NODES=$APPTAINERENV_SLURM_JOB_NUM_NODES \
66 --env SLURM_GPUS_ON_NODE=$APPTAINERENV_SLURM_GPUS_ON_NODE \
67 $CONTAINER_PATH \
68 bash -c 'export LD_LIBRARY_PATH=$LIBFABRIC_HOME/lib:$NCCL_HOME/lib:$NVIDIA_HPC_HOME/lib:/usr/lib64:$LD_LIBRARY_PATH; \
69 export CPATH=$LIBFABRIC_HOME/include:$CPATH; \
70 $TORCHRUN_PATH \
71 --nnodes=$SLURM_JOB_NUM_NODES \
72 --nproc_per_node=$SLURM_GPUS_ON_NODE \
73 --rdzv_id=$RDZV_ID \
74 --rdzv_backend=c10d \
75 --rdzv_endpoint=$head_node_ip:29500 \
76 $TRAINING_SCRIPT'
77
78# Stop GPU utilization monitoring
79echo "Stopping GPU utilization monitoring..."
80pkill -f "nvidia-smi --query-gpu"
Key Changes from Multi-GPU to Multi-Node
The highlighted lines show the multi-node specific additions:
Lines |
Change |
Purpose |
|---|---|---|
8 |
|
Request multiple nodes |
10 |
|
GPUs per node (instead of |
18 |
|
Larger batch for 8 GPUs |
21-25 |
Host library paths |
Paths to libfabric, NCCL, CXI on host |
35-40 |
Head node discovery |
Get head node IP for rendezvous |
43-44 |
Pass SLURM vars |
Export node/GPU counts to container |
52-67 |
|
Mount host libraries and set environment |
70-76 |
|
Use |
Note
The key difference from single-node multi-GPU is the rendezvous setup. Single-node uses --standalone, while multi-node requires explicit coordination via --rdzv_backend=c10d and --rdzv_endpoint pointing to the head node.
The output of this job script is shown below:
Epoch 95/100 completed in 0.771 seconds
Validation Loss: 1.1998, Validation Accuracy: 0.7101
Epoch Throughput: 63787.926 images/second
Epoch 96/100 completed in 0.759 seconds
Validation Loss: 1.1924, Validation Accuracy: 0.7090
Epoch Throughput: 64736.418 images/second
Epoch 97/100 completed in 0.770 seconds
Validation Loss: 1.1911, Validation Accuracy: 0.7092
Epoch Throughput: 63812.132 images/second
Epoch 98/100 completed in 0.763 seconds
Validation Loss: 1.1671, Validation Accuracy: 0.7128
Epoch Throughput: 64432.161 images/second
Epoch 99/100 completed in 0.756 seconds
Validation Loss: 1.1799, Validation Accuracy: 0.7160
Epoch Throughput: 64995.126 images/second
Epoch 100/100 completed in 0.767 seconds
Validation Loss: 1.2086, Validation Accuracy: 0.7082
Epoch Throughput: 64118.564 images/second
Training Summary:
Total training time: 77.784 seconds
Throughput: 63190.172 images/second
Number of nodes: 2
Number of GPUs per node: 4
Total GPUs used: 8
Training completed successfully.
Stopping GPU utilization monitoring...
With 8 GPUs across 2 nodes, the throughput increased from ~5,100 images/second (single GPU) to ~63,000 images/second—a 12x speedup. Training time dropped from ~16 minutes to just ~1.3 minutes.