TensorFlow on GPU

This guide provides an introduction to use machine learning libraries on GPU. The examples ask for Sigma2 recources with GPU. This is achieved with the --partition=accel --gpus=1(Accel and A100).

Note

When loading modules pressing <tab> gives you autocomplete options.

Note

It can be useful to do an initial module reset to ensure nothing from previous experiments is loaded before loading modules for the first time.

Note

Modules are regularly updated so if you would like a newer version, than what is listed above, use module avail | less to browse all available packages.

Below we will go through the main approaches for using TensorFlow:

  • Using the preinstalled TensorFlow module

  • Loading TensorFlow from EESSI

  • Using a container

Using the preinstalled TensorFlow module

This is the easiest and recommended option for most users. The preinstalled modules are optimized for the cluster hardware, pre-configured with correct dependencies (cuDNN, NCCL, etc.), and tested to work with the available GPU drivers.

Finding available versions

To discover which TensorFlow versions are installed:

# Show all available versions with descriptions
$ module spider TensorFlow

# Show versions you can load directly
$ module avail TensorFlow

# Show dependencies loaded by a specific module
$ module show TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

# Load an available TensorFlow version
$ module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

Note

Modules with CUDA in the name are GPU-enabled. Modules without CUDA run on CPU only.

EESSI

EESSI (European Environment for Scientific Software Installations) provides a curated collection of scientific software that is built once and distributed globally via CernVM-FS. Software is optimized for different CPU architectures and works seamlessly across systems.

Loading TensorFlow from EESSI

First, load the EESSI environment:

$ module load EESSI/2023.06

Then load TensorFlow:

$ module load TensorFlow/2.13.0-foss-2023a

To see all available TensorFlow versions in EESSI:

$ module spider TensorFlow

For more information, see EESSI documentation.

Containers

If you need a specific TensorFlow version not available as a module or via EESSI, you can use a container. This gives you complete control over the software environment.

Using Apptainer/Singularity

The clusters support Apptainer/Singularity, which can run Docker images in HPC environments. For GPU workloads, NVIDIA provides optimized TensorFlow containers on the NGC catalog.

Tip

When choosing a container, ensure it supports:

  • GPU (look for tags with gpu or check the description)

  • The amd64 architecture (x86_64)

Pulling a container

# Pull a TensorFlow container from NVIDIA NGC
$ apptainer pull docker://nvcr.io/nvidia/tensorflow:23.09-tf2-py3

This creates a .sif file (Singularity Image Format) in your current directory.

Warning

Container images can be large (several GB). Store them in your project area rather than your home directory to avoid quota issues.

Verifying the container

# Check TensorFlow version
$ apptainer run tensorflow_23.09-tf2-py3.sif python -c "import tensorflow as tf; print(tf.__version__)"

Running jobs with containers

In your Slurm job script, use apptainer exec with the --nv flag to enable GPU support:

apptainer exec --nv \
    --bind /cluster/projects/nnXXXXk:/cluster/projects/nnXXXXk \
    tensorflow_23.09-tf2-py3.sif python mnist.py

Key options:

  • --nv: Enables NVIDIA GPU support inside the container

  • --bind: Mounts directories from the host system into the container

For more details on containers, see Running containers.

Loading data

There are two ways to load data:

  • Uploading data from your local machine using rsync

  • Loading built-in datasets

Loading data using rsync

Use rsync to upload data from your computer:

rsync -zh --info=progress2 -r /path/to/dataset/folder <username>@saga.sigma2.no:~/.

For large amounts of data it is recommended to load into your Project area to avoid filling your home area.

To retrieving the path to the dataset, we can utilize python’s os module to access the variable, like so:

# ...Somewhere in your experiment python file...
# Load 'os' module to get access to environment and path functions
import os

# Path to dataset
dataset_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'], 'dataset')

If you use a container, you must define your dataset path like this:

dataset_path = os.path.join(os.environ['PWD'], 'dataset')

Loading built-in datasets

First we will need to download the dataset on the login node. Ensure that the correct modules are loaded. Next open up an interactive python session with python, then:

tf.keras.datasets.mnist.load_data()

This will download and cache the MNIST dataset which we can use for training models. Load the data in your training file like so:

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

Saving model data

For saving model data and weights we suggest the TensorFlow built-in checkpointing and save functions.

Below follows an example of how to load built-in data and save weights.

#!/usr/bin/env python

import tensorflow as tf
import os

# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
			    os.environ['SLURM_JOB_ID'])

# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255., x_test / 255.

def create_model():
        model = tf.keras.models.Sequential([
                tf.keras.layers.Flatten(input_shape=(28, 28)),
                tf.keras.layers.Dense(512, activation='relu'),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(10, activation='softmax')
                ])
        model.compile(optimizer='adam',
                      loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                      metrics=['accuracy'])
        return model

# Create and display summary of model
model = create_model()
# Output, such as from the following command, is outputted into the '.out' file
# produced by 'sbatch'
model.summary()

# Save model in TensorFlow format
model.save(os.path.join(storage_path, "model"))

# Create checkpointing of weights
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path,
        save_weights_only=True,
        verbose=1)

# Save initial weights
model.save_weights(ckpt_path.format(epoch=0))

# Train model with checkpointing
model.fit(x_train[:1000], y_train[:1000],
          epochs=50,
          callbacks=[ckpt_callback],
          validation_data=(x_test[:1000], y_test[:1000]),
          verbose=0)

The Python script above can be run with the following job script:

#!/usr/bin/bash

# Assumed to be 'mnist_test.sh'

#SBATCH --account=<your_account>
#SBATCH --job-name=<creative_job_name>
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8G
## The following line can be omitted to run on CPU alone
#SBATCH --partition=accel --gpus=1
#SBATCH --time=00:30:00

# Purge modules and load tensorflow
module reset
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# List loaded modules for reproducibility
module list

# Run python script
python $SLURM_SUBMIT_DIR/mnist.py

Once these two files are located on a Sigma2 resource we can run it with:

$ sbatch mnist_test.sh

In your code it is important to load the latest checkpoint if available, which can be retrieved with:

# Load weights from a previous run
ckpt_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
			"<job_id to load checkpoints from>",
			"checkpoints")
latest = tf.train.latest_checkpoint(ckpt_dir)

# Create a new model instance
model = create_model()

# Load the previously saved weights if they exist
if latest:
	model.load_weights(latest)

Using TensorBoard

TensorBoard is a nice utility for comparing different runs and viewing progress during optimization. To enable this on Sigma2 resources we will need to write data into our home area and some steps are necessary for connecting and viewing the board.

We will continue to use the MNIST example from above. The following changes are needed to enable TensorBoard.

# In the 'mnist.py' script
import datetime

# We will store the 'TensorBoard' logs in the folder where the 'mnist_test.sh'
# file was launched and create a folder like 'logs/fit'. In your own code we
# recommended that you give these folders names that you will recognize,
# the last folder uses the time when the program was started to separate related
# runs
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
		       "logs",
		       "fit",
		       datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# Change the last line, where we fit our data in the example above, to also
# include the TensorBoard callback
model.fit(x_train[:1000], y_train[:1000],
          epochs=50,
	  # Change here:
          callbacks=[ckpt_callback, tensorboard_callback],
          validation_data=(x_test[:1000], y_test[:1000]),
          verbose=0)

Once you have started a job with the above code embedded, or have a previous run which created a TensorBoard log, it can be viewed as follows.

  1. Copy the log files into your local machine and ensure TensorBoard is installed.

  2. Run tensorboard --logdir=/path/to/logs/fit --port=0.

  3. In the output from the above command note which port TensorBoard has started on, the last line should look something like: TensorBoard 2.1.0 at http://localhost:44124/ (Press CTRL+C to quit).

  4. Click the link provided in the output to view the results.

Advanced topics

Using multiple GPUs

Since all of the GPU machines on Saga have four GPUs it can be beneficial for some workloads to distribute the work over more than one device at a time. This can be accomplished with the tf.distribute.MirroredStrategy.

#!/usr/bin/env python

# Assumed to be `mnist_multiple_gpus.py`

import datetime
import os
import tensorflow as tf

# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
			    os.environ['SLURM_JOB_ID'])

## --- NEW ---
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")

# Calculate batch size
# For your own experiments you will likely need to adjust this based on testing
# on GPUs to find the 'optimal' size
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), _ = mnist.load_data()
x_train = x_train / 255.
## --- NEW ---
# NOTE: We need to create a 'Dataset' so that we can process the data in
# batches
train_dataset =\
tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(60000).repeat().batch(BATCH_SIZE)

def create_model():
    model = tf.keras.models.Sequential([
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(512, activation='relu'),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(10, activation='softmax')
            ])
    model.compile(optimizer='adam',
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    return model

# Create and display summary of model
## --- NEW ---
with strategy.scope():
    model = create_model()
# Output, such as from the following command, is outputted into the '.out' file
# produced by 'sbatch'
model.summary()
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
		       "logs",
		       "fit",
		       datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# Save model in TensorFlow format
model.save(os.path.join(storage_path, "model"))

# Create checkpointing of weights
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path,
        save_weights_only=True,
        verbose=1)

# Save initial weights
model.save_weights(ckpt_path.format(epoch=0))

# Train model with checkpointing
model.fit(train_dataset,
          epochs=50,
	  steps_per_epoch=70,
          callbacks=[ckpt_callback, tensorboard_callback])

To ask for two GPUs, modify the slurm script:

#!/usr/bin/bash

#SBATCH --account=<your_account> 
#SBATCH --job-name=<creative_job_name>
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8G
#SBATCH --partition=accel --gpus=2
#SBATCH --time=00:30:00

# Reset modules and load tensorflow
module reset
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# List loaded modules for reproducibility
module list

# Run python script
srun python $SLURM_SUBMIT_DIR/mnist_multiple_gpus.py

Distributed training on multiple nodes

To utilize more than four GPUs we will turn to the Horovod project which supports several different machine learning libraries and is capable of utilizing MPI. Horovod is responsible for communicating between different nodes and perform gradient computation, averaged over the different nodes.

The following example demonstrates how to adapt the MNIST training script for multi-node training with Horovod. Key modifications include initializing Horovod, pinning each MPI rank to a specific GPU, scaling the learning rate with the number of workers, and using Horovod’s distributed optimizer and callbacks for gradient averaging and synchronized checkpointing:

#!/usr/bin/env python

# Assumed to be 'mnist_hvd.py'

import datetime
import os
import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Initialize Horovod.
hvd.init()

# Extract number of visible GPUs in order to pin them to MPI process
gpus = tf.config.experimental.list_physical_devices('GPU')
if hvd.rank() == 0:
    print(f"Found the following GPUs: '{gpus}'")
# Allow memory growth on GPU, required by Horovod
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
# Since multiple GPUs might be visible to multiple ranks it is important to
# bind the rank to a given GPU

if gpus:
    # Each task sees only 1 GPU due to --gpus-per-task=1, so we always use index 0
    print(f"Rank '{hvd.local_rank()}/{hvd.rank()}' using GPU: '{gpus[0]}'")
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')

#if gpus:
#    print(f"Rank '{hvd.local_rank()}/{hvd.rank()}' using GPU: '{gpus[hvd.local_rank()]}'")
#    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
else:
    print(f"No GPU(s) configured for ({hvd.local_rank()}/{hvd.rank()})!")

# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
                            os.environ['SLURM_JOB_ID'])

# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), _ = mnist.load_data()
x_train = x_train / 255.

# Create dataset for batching
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.repeat().shuffle(10000).batch(128)

# Define learning rate as a function of number of GPUs
scaled_lr = 0.001 * hvd.size()


def create_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(512, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    # Horovod: adjust learning rate based on number of GPUs.
    opt = tf.optimizers.Adam(scaled_lr)
    model.compile(optimizer=opt,
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'],
                  experimental_run_tf_function=False)
    return model


# Create and display summary of model
model = create_model()
# Output, such as from the following command, is outputted into the '.out' file
# produced by 'sbatch'
if hvd.rank() == 0:
    model.summary()

# Create list of callback so we can separate callbacks based on rank
callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other
    # processes.  This is necessary to ensure consistent initialization of all
    # workers when training is started with random weights or restored from a
    # checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),

    # Horovod: average metrics among workers at the end of every epoch.
    #
    # Note: This callback must be in the list before the ReduceLROnPlateau,
    # TensorBoard or other metrics-based callbacks.
    hvd.callbacks.MetricAverageCallback(),

    # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to
    # worse final accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 *
    # hvd.size()` during the first three epochs. See
    # https://arxiv.org/abs/1706.02677 for details.
    hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=3,
                                             initial_lr=scaled_lr,
                                             verbose=1),
]

# Only perform the following actions on rank 0 to avoid all workers clash
if hvd.rank() == 0:
    # Tensorboard support
    log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
                           "logs",
                           "fit",
                           datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
                                                          histogram_freq=1)
    # Save model in TensorFlow format
    model.save(os.path.join(storage_path, "model"))
    # Create checkpointing of weights
    ckpt_path = os.path.join(storage_path,
                             "checkpoints",
                             "mnist-{epoch:04d}.ckpt")
    ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path,
        save_weights_only=True,
        verbose=0)
    # Save initial weights
    model.save_weights(ckpt_path.format(epoch=0))
    callbacks.extend([tensorboard_callback, ckpt_callback])

verbose = 1 if hvd.rank() == 0 else 0
# Train model with checkpointing
model.fit(x_train, y_train,
          steps_per_epoch=500 // hvd.size(),
          epochs=100,
          callbacks=callbacks,
          verbose=verbose)

The following SLURM script is designed to run a distributed training job using Horovod and TensorFlow on Saga. It requests 2 nodes, each with 4 GPUs, and distributes the workload across 8 tasks (1 task per GPU). The script also ensures that the necessary modules and environment variables are properly configured.

#!/usr/bin/bash

#SBATCH --account=<account_name>  # Replace with your account name
#SBATCH --job-name=<job_name>  # Replace with a descriptive job name
#SBATCH --partition=accel
#SBATCH --nodes=2                # Request 2 nodes
#SBATCH --ntasks=8               # Total number of tasks (4 tasks per node)
#SBATCH --gpus-per-task=1        # 1 GPU per task
#SBATCH --mem-per-cpu=8G         # Memory per CPU
#SBATCH --time=00:30:00          # Time limit

# Reset modules and load TensorFlow and Horovod
module reset
module load CMake/3.23.1-GCCcore-11.3.0
module load Horovod/0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0
# List loaded modules for reproducibility
module list
# Export settings expected by Horovod and MPI
export OMPI_MCA_btl="^openib"  # Disable InfiniBand if not configured
export OMPI_MCA_pml="ob1"      # Use the "ob1" communication layer
export HOROVOD_MPI_THREADS_DISABLE=1  # Disable MPI threading for Horovod
export OMPI_MCA_btl_base_verbose=100  # Debug MPI communication

# Run the Python script using srun
srun python $SLURM_SUBMIT_DIR/mnist_hvd.py

Note

To use a100 GPUs on Saga, replace --partition=accel with --partition=a100 and add module --force swap StdEnv Zen2Env to your jobscript (more information here).