PyTorch on Olivia

In this guide, we’ll be testing PyTorch on the Olivia system, which uses the Aarch64 architecture on its compute nodes. To do this, we’ll use specific PyTorch wheels compatible with this architecture.

Note: This documentation is a work in progress and is intended for testing purposes. If you encounter any issues or something does not work as expected, please let us know. Furthermore, this approach relies on pip installation, which can degrade the file system. As a result, it will no longer be permitted after the pilot phase.

Key Considerations:

Different Architectures:

The login node and the compute node on Olivia have different architectures. The login node uses the x86_64 architecture, while the compute node uses Aarch64. This means we cannot install software directly on the login node and expect it to work on the compute node.

Internet Connectivity on Compute Nodes:

At the time of testing, the compute nodes did not have direct internet access. Consequently, we had to install the required PyTorch wheels within a virtual environment using a job script. This ensured that the installation occurred on the compute node during the execution of the job script.

However, it is now possible to access the internet directly from the compute nodes by configuring proxies, as demonstrated below:

export http_proxy=http://10.63.2.48:3128/
export https_proxy=http://10.63.2.48:3128/

CUDA Version:

The compute nodes are equipped with CUDA Version 12.7, as confirmed by running the nvidia-smi command. Therefore, we need to ensure that the PyTorch wheels we use are compatible with this CUDA version.

You can download the necessary PyTorch wheels for your project from the PyTorch nightly builds using the following link:

torch

pytorch-triton

torchvision

Training a ResNet Model with the Fashion-MNIST Dataset

To test Olivia’s capabilities with real-world workloads, we will train a ResNet model using the Fashion-MNIST dataset. The testing will be conducted under the following scenarios:

Single GPU
Multiple GPUs
Multiple Nodes

The primary goal of this exercise is to verify that we can successfully run training tasks on Olivia. As such, we will not delve into the specifics of neural network training in this documentation. A separate guide will be prepared to cover those details.

Single GPU Implementation

To train the ResNet model on a single GPU, we used the following files. These files include the main Python script responsible for training the ResNet model.

# resnet.py
"""
This script trains a WideResNet model on the Fashion MNIST dataset without using Distributed Data Parallel (DDP).
The goal is to provide a simple, single-GPU implementation of the training process.
"""
import os
import time
import torch
import torch.nn as nn
import torch.optim as optim
# Import custom modules
from data.dataset_utils import load_fashion_mnist
from models.wide_resnet import WideResNet
from training.train_utils import train, test
from utils.device_utils import get_device
# Define paths
shared_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "../shared"))
images_dir = os.path.join(shared_dir, "images")
os.makedirs(images_dir, exist_ok=True)
# Hyperparameters
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 0.01
TARGET_ACCURACY = 0.85
PATIENCE = 2
def train_resnet_without_ddp(batch_size, epochs, learning_rate, device):
    """
    Trains a WideResNet model on the Fashion MNIST dataset without DDP.
    Args:
        batch_size (int): Batch size for training.
        epochs (int): Number of epochs to train.
        learning_rate (float): Learning rate for the optimizer.
        device (torch.device): Device to run training on (CPU or GPU).
    Returns:
        None
    """
    print(f"Training WideResNet on Fashion MNIST with Batch Size: {batch_size}")
    # Training variables
    val_accuracy = []
    total_time = 0
    # Load the dataset
    train_loader, test_loader = load_fashion_mnist(batch_size=batch_size)
    # Initialize the WideResNet Model
    num_classes = 10
    model = WideResNet(num_classes).to(device)
    # Define the loss function and optimizer
    loss_fn = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    for epoch in range(epochs):
        t0 = time.time()
        # Train the model for one epoch
        train(model, optimizer, train_loader, loss_fn, device)
        # Calculate epoch time
        epoch_time = time.time() - t0
        total_time += epoch_time
        # Compute throughput (images per second)
        images_per_sec = len(train_loader) * batch_size / epoch_time
        # Compute validation accuracy and loss
        v_accuracy, v_loss = test(model, test_loader, loss_fn, device)
        val_accuracy.append(v_accuracy)
        # Print metrics
        print("Epoch = {:2d}: Epoch Time = {:5.3f}, Validation Loss = {:5.3f}, Validation Accuracy = {:5.3f}, Images/sec = {:5.3f}, Cumulative Time = {:5.3f}".format(
            epoch + 1, epoch_time, v_loss, v_accuracy, images_per_sec, total_time
        ))
        # Early stopping
        if len(val_accuracy) >= PATIENCE and all(acc >= TARGET_ACCURACY for acc in val_accuracy[-PATIENCE:]):
            print('Early stopping after epoch {}'.format(epoch + 1))
            break
    # Final metrics
    print("\nTraining complete. Final Validation Accuracy = {:5.3f}".format(val_accuracy[-1]))
    print("Total Training Time: {:5.3f} seconds".format(total_time))
def main():
    # Set the compute device
    device = get_device()
    # Train the WideResNet model
    train_resnet_without_ddp(batch_size=BATCH_SIZE, epochs=EPOCHS, learning_rate=LEARNING_RATE, device=device)
if __name__ == "__main__":
    main()

This file contains the data utility functions used for preparing and managing the dataset.Please note that, you need to manually install the Fashion-MNIST dataset and place it in the respective folder.

# dataset_utils.py
# NUmpy is a fundamental package for scientific computing. It contains an implementation of an array
import numpy as np
# to generate our own dataset
import random 
import torchvision
import torchvision.transforms as transforms
import torch



def load_fashion_mnist(batch_size, train_subset_size=10000, test_subset_size=10000):
    """
    Loads and preprocesses the Fashion-MNIST dataset.
    Args:
        batch_size (int): Batch size for training and testing.
        train_subset_size (int): Number of training samples to use (default:10,000) which we used initially for testing purpose.
        test_subset_size (int): Number of testing samples to use (default: 10,000) which we used initially for testing purpose.
    Returns:
        train_loader, test_loader: Data loaders for training and testing.
    """
    # Define transformations
    transform = transforms.Compose([transforms.ToTensor()])
    # Load full Datasets
    # Note that, if there is no internet access on the compute node, we need to manually download the 
    # datasets and then use it, for which we need to set download=False
    train_set = torchvision.datasets.FashionMNIST("/cluster/work/users/<user_name>/deepLearning/private/shared/data", download=False, transform=transform)
    test_set = torchvision.datasets.FashionMNIST("/cluster/work/users/<user_name>/deepLearning/private/shared/data", download=False, train=False, transform=transform)
    # Create subsets
    train_subset= torch.utils.data.Subset(train_set, list(range(0, train_subset_size)))
    test_subset= torch.utils.data.Subset(test_set, list(range(0, test_subset_size)))
    # Create the data loaders
    train_loader = torch.utils.data.Dataloader(train_subset, batch_size=batch_size, drop_last=True)
    test_loader = torch.utils.data.Dataloader(test_subset, batch_size, drop_last=True)
    return train_loader, test_loader


def load_fashion_mnist_fulldataset(batch_size):
    #Define transformations
    transform = transforms.Compose([transforms.ToTensor()])
    # Load full datasets
    train_set = torchvision.datasets.FashionMNIST("/cluster/work/users/<user_name>/deepLearning/private/shared/data", download=False, transform=transform)
    test_set = torchvision.datasets.FashionMNIST("/cluster/work/users/<user_name>/deepLearning/private/shared/data", download=False, train=False, transform=transform)
    # Create the data loaders
    train_loader = torch.utils.data.DataLoader(train_set,batch_size=batch_size, dropLast=True, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_set,batch_size=batch_size, dropLast=True, shuffle=False)
    return train_loader, test_loader

This file includes the device utility functions, which handle device selection and management for training (e.g., selecting the appropriate GPU).

# device_utils.py
import torch

def get_device():
    """
    Determine the compute device (GPU or CPU).
    Returns:
        torch.device: The device to use for the computations.
    """

    return torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

This file contains the implementation of the ResNet model architecture.

# wide_resnet.py
import torch.nn as nn

# Standard convulation block followed by batch normalization
class cbrblock(nn.Module):
    def __init__(self, input_channels, output_channels):
        super(cbrblock, self).__init__()
        self.cbr = nn.Sequential(nn.Conv2d(input_channels, output_channels, kernel_size=3, stride=(1,1), padding='same', bias=False),nn.BatchNorm2d(output_channels), nn.ReLU())
    def forward(self, x):
        return self.cbr(x)
    
# Basic residual block
class conv_block(nn.Module):
    def __init__(self, input_channels, output_channels, scale_input):
        super(conv_block, self).__init__()
        self.scale_input = scale_input
        if self.scale_input:
            self.scale = nn.Conv2d(input_channels,output_channels, kernel_size=1, stride=(1,1), padding='same')
        self.layer1 = cbrblock(input_channels, output_channels)
        self.dropout = nn.Dropout(p=0.01)
        self.layer2 = cbrblock(output_channels, output_channels)
    def forward(self,x):
        residual = x
        out = self.layer1(x)
        out = self.dropout(out)
        out = self.layer2(out)
        if self.scale_input:
            residual = self.scale(residual)
        return out + residual

# WideResnet model
class WideResNet(nn.Module):
    def __init__(self, num_classes):
        super(WideResNet, self).__init__()
        nChannels = [1, 16, 160, 320, 640]
        self.input_block = cbrblock(nChannels[0], nChannels[1])
        self.block1 = conv_block(nChannels[1], nChannels[2], scale_input=True)
        self.block2 = conv_block(nChannels[2], nChannels[2], scale_input=False)
        self.pool1 = nn.MaxPool2d(2)
        self.block3 = conv_block(nChannels[2], nChannels[3], scale_input=True)
        self.block4 = conv_block(nChannels[3], nChannels[3], scale_input=False)
        self.pool2 = nn.MaxPool2d(2)
        self.block5 = conv_block(nChannels[3], nChannels[4], scale_input=True)
        self.block6 = conv_block(nChannels[4], nChannels[4], scale_input=False)
        # Global Average pooling
        self.pool = nn.AvgPool2d(7)
        # Fully connected layer
        self.flat = nn.Flatten()
        self.fc = nn.Linear(nChannels[4], num_classes)

    def forward(self, x):
        out = self.input_block(x)
        out = self.block1(out)
        out = self.block2(out)
        out = self.pool1(out)
        out = self.block3(out)
        out = self.block4(out)
        out = self.pool2(out)
        out = self.block5(out)
        out = self.block6(out)
        out = self.pool(out)
        out = self.flat(out)
        out = self.fc(out)
        return out

Finally, this file serves as a utility module for importing the training and testing datasets.

# train_utils.py
import torch 
def train(model, optimizer, train_loader, loss_fn, device):
    """
    Trains the model for one epoch.
    Args:
        model(torch.nn.Module): The model to train.
        optimizer(torch.optim.Optimizer): Optimizer for updating model parameters.
        train_loader(torch.utils.data.DataLoader): DataLoader for training data.
        loss_fn (torch.nn.Module): Loss function.
        device (torch.device): Device to run training on (CPU or GPU).
    """
    model.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        # Forward passs
        outputs = model(images)
        loss = loss_fn(outputs, labels)
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()



def test(model, test_loader, loss_fn, device):
    """
    Evaluates the model on the validation dataset.
    Args:
        model(torch.nn.Module): The model to evaluate.
        test_loader (torch.utils.data.DataLoader): DataLoader for validation data.
        loss_fn (torch.nn.Module): Loss function.
        device (torch.device): Device to run evaluation on (CPU or GPU).
    Returns:
        tuple: Validation accuracy and validaiton loss.
    """

    model.eval()
    total_labels = 0
    correct_labels = 0
    loss_total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            # Forward pass
            outputs = model(images)
            loss = loss_fn(outputs, labels)
            # Compute accuracy and loss
            predictions = torch.max(outputs, 1)[1]
            total_labels += len(labels)
            correct_labels += (predictions == labels).sum().item()
            loss_total += loss.item()

    v_accuracy = correct_labels / total_labels
    v_loss = loss_total / len(test_loader)
    return v_accuracy, v_loss

Job Script for Single GPU Training

To run the training on a single GPU, we use the following job script. The accel partition is used to access GPU resources. After loading the Python module from cray-python, you can create a virtual environment and install all the required wheels for your program using pip.

For resource management, we use the torchrun utility from PyTorch. This tool ensures efficient allocation of resources, especially when scaling to multiple GPUs or nodes.

#!/bin/bash
#SBATCH --job-name=simple_nn_training
#SBATCH --account=<project_number>
#SBATCH --output=singlenode.out
#SBATCH --error=singlenode.err
#SBATCH --time=00:10:00
#SBATCH --partition=accel
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --gpus-per-node=1
# Load required modules
module load cray-python/3.11.7

# Create and activate virtual environment in compute node's local storage
VENV_PATH="$SCRATCH/pytorch_venv"  # Using compute node's scratch space
python -m venv $VENV_PATH
source $VENV_PATH/bin/activate

# Install PyTorch from the wheel (offline installation)
WHEEL_DIR="/cluster/work/projects/<project_number>/<user_name>/PyTorch/torch_wheels"
pip install --no-index --find-links=$WHEEL_DIR $(ls $WHEEL_DIR/*.whl | tr '\n' ' ')


# Set PYTHONPATH to include the shared directory
export PYTHONPATH=/cluster/work/projects/<project_number>/<user_name>/PyTorch/private/shared:$PYTHONPATH


torchrun --standalone --nnodes=1 --nproc_per_node=1 ../resnet.py

deactivate

Output of the training is shown below:

Training WideResNet on Fashion MNIST with Batch Size: 32
Epoch =  1: Epoch Time = 2.963, Validation Loss = 0.524, Validation Accuracy = 0.799, Images/sec = 3369.181, Cumulative Time = 2.963
Epoch =  2: Epoch Time = 2.623, Validation Loss = 0.440, Validation Accuracy = 0.837, Images/sec = 3806.211, Cumulative Time = 5.586
Epoch =  3: Epoch Time = 2.601, Validation Loss = 0.422, Validation Accuracy = 0.849, Images/sec = 3839.022, Cumulative Time = 8.187
Epoch =  4: Epoch Time = 2.553, Validation Loss = 0.393, Validation Accuracy = 0.862, Images/sec = 3909.946, Cumulative Time = 10.741
Epoch =  5: Epoch Time = 2.553, Validation Loss = 0.424, Validation Accuracy = 0.859, Images/sec = 3911.137, Cumulative Time = 13.293
Early stopping after epoch 5

Training complete. Final Validation Accuracy = 0.859
Total Training Time: 13.293 seconds

Multi-GPU Implementation

To scale our training to multiple GPUs, we will utilize PyTorch’s Distributed Data Parallel (DDP) framework. DDP allows us to efficiently scale training across multiple GPUs and even across multiple nodes.

For this, we need to modify the main Python script to include DDP implementation. The updated script will work for both scenarios Multiple GPUs within a single node and Multiple nodes.

# resnetddp.py
import os
import time
import argparse
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
from torch.utils.data.distributed import DistributedSampler
from data.dataset_utils import load_fashion_mnist_fulldataset
from models.wide_resnet import WideResNet
from training.train_utils import train, test
# Parse input arguments
parser = argparse.ArgumentParser(description='Fashion MNIST DDP example',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--batch-size', type=int, default=512, help='Input batch size for training')
parser.add_argument('--epochs', type=int, default=5, help='Number of epochs to train')
parser.add_argument('--base-lr', type=float, default=0.01, help='Learning rate for single GPU')
parser.add_argument('--target-accuracy', type=float, default=0.85, help='Target accuracy to stop training')
parser.add_argument('--patience', type=int, default=2, help='Number of epochs that meet target before stopping')
args = parser.parse_args()
def ddp_setup():
    """Set up the distributed environment."""
    init_process_group(backend="nccl")
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
def prepare_dataloader(dataset, batch_size):
    """Prepare DataLoader with DistributedSampler."""
    sampler = DistributedSampler(dataset, drop_last=False)  # Ensure no data is dropped
    dataloader = DataLoader(dataset, batch_size=batch_size, sampler=sampler)
    return dataloader, sampler
def main_worker():
    ddp_setup()
    # Get the local rank and device
    local_rank = int(os.environ["LOCAL_RANK"])
    global_rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    device = torch.device(f"cuda:{local_rank}")


    # Log initialization info
    if global_rank == 0:
        print(f"Training started with {world_size} processes across {world_size // torch.cuda.device_count()} nodes.")
        print(f"Using {torch.cuda.device_count()} GPUs per node.")

    
    # Load the dataset
    train_loader, test_loader = load_fashion_mnist_fulldataset(batch_size=args.batch_size)
    # Create the model and wrap it with DDP
    num_classes = 10
    model = WideResNet(num_classes).to(device)
    model = DDP(model, device_ids=[local_rank])
   
    # Define loss function and optimizer
    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=args.base_lr)
    val_accuracy = []
    total_time = 0
    # Training loop
    for epoch in range(args.epochs):
        if global_rank == 0:
            print(f"\nEpoch {epoch + 1}/{args.epochs}")
        # Train the model for one epoch
        t0 = time.time()
        train(model, optimizer, train_loader, loss_fn, device)
        # Synchronize all processes
        torch.distributed.barrier()
        epoch_time = time.time() - t0
        total_time += epoch_time
        # Compute validation accuracy and loss
        v_accuracy, v_loss = test(model, test_loader, loss_fn, device)
        # Average validation metrics across all GPUs
        v_accuracy_tensor = torch.tensor(v_accuracy).to(device)
        v_loss_tensor = torch.tensor(v_loss).to(device)
        torch.distributed.all_reduce(v_accuracy_tensor, op=torch.distributed.ReduceOp.AVG)
        torch.distributed.all_reduce(v_loss_tensor, op=torch.distributed.ReduceOp.AVG)
        # Print metrics only from the main process
        if global_rank == 0:
            print(f"Epoch {epoch + 1} completed in {epoch_time:.3f} seconds")
            print(f"Validation Loss: {v_loss_tensor.item():.4f}, Validation Accuracy: {v_accuracy_tensor.item():.4f}")
        # Early stopping
        val_accuracy.append(v_accuracy_tensor.item())
        if len(val_accuracy) >= args.patience and all(acc >= args.target_accuracy for acc in val_accuracy[-args.patience:]):
            if global_rank == 0:
                print(f"Target accuracy reached. Early stopping after epoch {epoch + 1}.")
            break
    
       # Log total training time and summary
    if global_rank == 0:
        print("\nTraining Summary:")
        print(f"Total training time: {total_time:.3f} seconds")
        print(f"Number of nodes: {world_size // torch.cuda.device_count()}")
        print(f"Number of GPUs per node: {torch.cuda.device_count()}")
        print(f"Total GPUs used: {world_size}")
        print("Training completed successfully.")
    # Clean up the distributed environment
    destroy_process_group()
if __name__ == '__main__':
    main_worker()

Job Script for Multi GPU Training

To run the training on multiple GPUs, we can use the same job script mentioned earlier, but specify a higher number of GPUs.

When using torchrun for a single-node setup, you need to include the standalone argument. However, this argument is not required for a multi-node setup. The full job script is given below:

#!/bin/bash
#SBATCH --job-name=resnet_training_singlenode
#SBATCH --account=<project_number>
#SBATCH --output=singlenode.out
#SBATCH --error=singlenode.err
#SBATCH --time=00:10:00
#SBATCH --partition=accel
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --gpus-per-node=2
# Load required modules
module load cray-python/3.11.7

# Create and activate virtual environment in compute node's local storage
VENV_PATH="$SCRATCH/pytorch_venv"  # Using compute node's scratch space
python -m venv $VENV_PATH
source $VENV_PATH/bin/activate

# Install PyTorch from the wheel (offline installation)
WHEEL_DIR="/cluster/work/projects/<project_number>/<user_name>/PyTorch/torch_wheels"
pip install --no-index --find-links=$WHEEL_DIR $(ls $WHEEL_DIR/*.whl | tr '\n' ' ')


# Set PYTHONPATH to include the shared directory
export PYTHONPATH=/cluster/work/projects/<project_number>/<user_name>/PyTorch/private/shared:$PYTHONPATH


torchrun --standalone --nnodes=1 --nproc_per_node=2 ../resnetddp.py  --epochs 10 --batch-size 512

deactivate

Output of the training is shown below:

Training started with 2 processes across 1 nodes.
Using 2 GPUs per node.

Epoch 1/10
Epoch 1 completed in 7.350 seconds
Validation Loss: 0.5492, Validation Accuracy: 0.7971

Epoch 2/10
Epoch 2 completed in 7.150 seconds
Validation Loss: 0.4315, Validation Accuracy: 0.8414

Epoch 3/10
Epoch 3 completed in 7.090 seconds
Validation Loss: 0.3536, Validation Accuracy: 0.8744

Epoch 4/10
Epoch 4 completed in 7.112 seconds
Validation Loss: 0.3402, Validation Accuracy: 0.8759
Target accuracy reached. Early stopping after epoch 4.

Training Summary:
Total training time: 28.702 seconds
Number of nodes: 1
Number of GPUs per node: 2
Total GPUs used: 2
Training completed successfully.

Multi-Node Setup

Setting up training across multiple nodes is relatively straightforward since we use the same Python script as in the multi-GPU implementation. The main difference lies in using a different job script, which is provided below.

For multi-node jobs, a few key considerations are important:

Communication Interface: You need to specify the communication interface to enable proper communication between nodes.
Master Node: The master node must be designated to handle coordination and communication across nodes.

We use srun to launch the job across multiple nodes, allowing torchrun to efficiently manage and coordinate the training process.

Job Script for Multi node Training

#!/bin/bash
#SBATCH --job-name=resnet_training_multinode
#SBATCH --account=<project_number>
#SBATCH --output=multinode.out
#SBATCH --error=multinode.err
#SBATCH --time=01:00:00
#SBATCH --partition=accel
#SBATCH --nodes=2  # Request 2 nodes
#SBATCH --ntasks-per-node=1  # One task per node
#SBATCH --cpus-per-task=72
#SBATCH --mem-per-gpu=120G
#SBATCH --gpus-per-node=4  # 4 GPUs per node

# Load required modules
module load cray-python/3.11.7

# Create and activate virtual environment
VENV_PATH="$SCRATCH/pytorch_venv"
python -m venv $VENV_PATH
source $VENV_PATH/bin/activate
# Install PyTorch from the wheel (offline installation)
WHEEL_DIR="/cluster/work/projects/<project_number>/<user_name>/PyTorch/torch_wheels"
pip install --no-index --find-links=$WHEEL_DIR $(ls $WHEEL_DIR/*.whl | tr '\n' ' ')

# Set PYTHONPATH to include the shared directory
export PYTHONPATH=/cluster/work/projects/<project_number>/<user_name>/PyTorch/private/shared:$PYTHONPATH

# Set NCCL environment variables for debugging and communication
# export NCCL_DEBUG=INFO  # Use it to see details
export NCCL_SOCKET_IFNAME=hsn0  # Replace with hsn1 if needed

# Get the head node and its IP address
nodes=( $(scontrol show hostnames $SLURM_JOB_NODELIST) )
head_node=${nodes[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address | awk '{print $1}')
echo "Head Node: $head_node"
echo "Head Node IP: $head_node_ip"

# Run the Python script using torchrun
srun torchrun \
  --nnodes=2 \
  --nproc_per_node=4 \
  --rdzv_id=$RANDOM \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$head_node_ip:29500 \
  ../restnet_multinode.py
# Deactivate the virtual environment
deactivate

Below is the output generated from running the training across multiple nodes.

Training started with 4 processes across 2 nodes.
Using 2 GPUs per node.

Epoch 1/10
Epoch 1 completed in 10.208 seconds
Validation Loss: 0.5929, Validation Accuracy: 0.7752

Epoch 2/10
Epoch 2 completed in 9.450 seconds
Validation Loss: 0.4266, Validation Accuracy: 0.8477

Epoch 3/10
Epoch 3 completed in 9.385 seconds
Validation Loss: 0.4440, Validation Accuracy: 0.8397

Epoch 4/10
Epoch 4 completed in 9.400 seconds
Validation Loss: 0.4783, Validation Accuracy: 0.8431

Epoch 5/10
Epoch 5 completed in 9.436 seconds
Validation Loss: 0.3952, Validation Accuracy: 0.8575

Epoch 6/10
Epoch 6 completed in 9.526 seconds
Validation Loss: 0.2980, Validation Accuracy: 0.8931
Target accuracy reached. Early stopping after epoch 6.

Training Summary:
Total training time: 57.405 seconds
Number of nodes: 2
Number of GPUs per node: 2
Total GPUs used: 4
Training completed successfully.