TensorFlow on GPU
This guide provides an introduction to use machine learning libraries on GPU.
The examples ask for Sigma2 recources with GPU. This is achieved with the --partition=accel --gpus=1 or --partition=a100 --gpus=1 (Accel and A100).
Note
When loading modules pressing <tab> gives you
autocomplete options.
Note
It can be useful to do an initial module reset to
ensure nothing from previous experiments is loaded before
loading modules for the first time.
Note
Modules are regularly updated so if you would like a newer
version, than what is listed above, use module avail | less to browse all available packages.
Below we will go through the main approaches for using TensorFlow:
Using the preinstalled TensorFlow module
Loading TensorFlow from EESSI
Using a container
Using the preinstalled TensorFlow module
This is the easiest and recommended option for most users. The preinstalled modules are optimized for the cluster hardware, pre-configured with correct dependencies (cuDNN, NCCL, etc.), and tested to work with the available GPU drivers. For a complete example Slurm job script using preinstalled modules on Saga, see example Slurm job script.
Finding available versions
To discover which TensorFlow versions are installed:
# Show all available versions with descriptions
$ module spider TensorFlow
# Show versions you can load directly
$ module avail TensorFlow
# Show dependencies loaded by a specific module
$ module show TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# Load an available TensorFlow version
$ module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
Note
Modules with CUDA in the name are GPU-enabled. Modules without CUDA run on CPU only.
EESSI
EESSI (European Environment for Scientific Software Installations) provides a curated collection of scientific software that is built once and distributed globally via CernVM-FS. Software is optimized for different CPU architectures and works seamlessly across systems.
Loading TensorFlow from EESSI
First, load the EESSI environment:
$ module load EESSI/2025.06
Then load TensorFlow:
$ module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
Note
Remember to load a TensorFlow version with CUDA support.
To run a TensorFlow job with EESSI, you must use a100 GPUs. Below follows a template for a Slurm script using EESSI.
#!/usr/bin/bash
#SBATCH --account=<your_account>
#SBATCH --job-name=<your_job_name>
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=a100 --gpus=1
#SBATCH --time=00:30:00
module reset
module load Zen2Env
module load EESSI/2025.06
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# list modules for reproducibility
module list
# Run python script
python $SLURM_SUBMIT_DIR/<your_python_script>.py
For more information, see EESSI documentation.
Containers
If you need a specific TensorFlow version not available as a module or via EESSI, you can use a container. This gives you complete control over the software environment.
Using Apptainer/Singularity
The clusters support Apptainer/Singularity, which can run Docker images in HPC environments. For GPU workloads, NVIDIA provides optimized TensorFlow containers on the NVIDIA NGC catalog.
Tip
When choosing a container, ensure it supports:
GPU (look for tags with
gpuor check the description)The
amd64architecture (x86_64)
Pulling a container
# Pull a TensorFlow container from NVIDIA NGC
$ apptainer pull docker://nvcr.io/nvidia/tensorflow:21.09-tf2-py3
This creates a .sif file (Singularity Image Format) in your current directory.
Warning
Container images can be large (several GB). Store them in your project area rather than your home directory to avoid quota issues.
Verifying the container
To verify the container you must first start an interactive session:
$ srun --account=nnXXXXk --partition=accel --gpus=1 --mem=8G --time=00:10:00 --pty bash
Next, check TensorFlow version and verify GPU visibility:
$ apptainer run --nv tensorflow_21.09-tf2-py3.sif \
python -c "import tensorflow as tf; print(tf.__version__);print('Physical GPUs:', tf.config.list_physical_devices('GPU'))"
Running jobs with containers
In your Slurm job script, use apptainer exec with the --nv flag to enable GPU support:
apptainer exec --nv \
--bind /cluster/projects/nnXXXXk:/cluster/projects/nnXXXXk \
tensorflow_21.09-tf2-py3.sif python mnist.py
Key options:
--nv: Enables NVIDIA GPU support inside the container--bind: Mounts directories from the host system into the container
Note
The most recent containers work only on newer GPUs (like a100), whereas older containers can also be used on p100 GPUs (accel partition).
For more details on containers, see Running containers.
Loading data
There are two ways to load data:
Uploading data from your local machine using
rsyncLoading built-in datasets
Loading data using rsync
Use rsync to upload data from your computer:
rsync -zh --info=progress2 -r /path/to/dataset/folder <username>@saga.sigma2.no:~/.
For large amounts of data it is recommended to load into your Project area to avoid filling your home area.
To retrieving the path to the dataset, we can utilize python’s os
module to access the variable, like so:
# ...Somewhere in your experiment python file...
# Load 'os' module to get access to environment and path functions
import os
# Path to dataset
dataset_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'], 'dataset')
If you use a container, you must define your dataset path like this:
dataset_path = os.path.join(os.environ['PWD'], 'dataset')
Loading built-in datasets
The Keras library has the MNIST dataset built-in and provides a convenient function load it automatically. This saves time from having to download and preproces the data manually.
First we will need to download the dataset on the login node. Ensure that the
correct modules are loaded. Next open up an interactive python session with
python, then:
tf.keras.datasets.mnist.load_data()
This will download and cache the MNIST dataset which we can use for training models. Load the data in your training file like so:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
Saving model data
For saving model data and weights we suggest the TensorFlow built-in
checkpointing and save functions. Below follows an example of how to load built-in data and save weights.
#!/usr/bin/env python
import tensorflow as tf
import os
# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
os.environ['SLURM_JOB_ID'])
# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255., x_test / 255.
def create_model():
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
return model
# Create and display summary of model
model = create_model()
# Output, such as from the following command, is outputted into the '.out' file
# produced by 'sbatch'
model.summary()
# Save model in TensorFlow format
model.save(os.path.join(storage_path, "model"))
# Create checkpointing of weights
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=ckpt_path,
save_weights_only=True,
verbose=1)
# Save initial weights
model.save_weights(ckpt_path.format(epoch=0))
# Train model with checkpointing
model.fit(x_train[:1000], y_train[:1000],
epochs=50,
callbacks=[ckpt_callback],
validation_data=(x_test[:1000], y_test[:1000]),
verbose=0)
The Python script above can be run with the following job script:
#!/usr/bin/bash
# Assumed to be 'mnist_test.sh'
#SBATCH --account=<your_account>
#SBATCH --job-name=<creative_job_name>
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8G
## The following line can be omitted to run on CPU alone
#SBATCH --partition=accel --gpus=1
#SBATCH --time=00:30:00
# Purge modules and load tensorflow
module reset
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# List loaded modules for reproducibility
module list
# Run python script
python $SLURM_SUBMIT_DIR/mnist.py
Once these two files are located on a Sigma2 resource we can run it with:
$ sbatch mnist_test.sh
In your code it is important to load the latest checkpoint if available, which can be retrieved with:
# Load weights from a previous run
ckpt_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
"<job_id to load checkpoints from>",
"checkpoints")
latest = tf.train.latest_checkpoint(ckpt_dir)
# Create a new model instance
model = create_model()
# Load the previously saved weights if they exist
if latest:
model.load_weights(latest)
Using TensorBoard
TensorBoard is a nice utility for comparing different runs and viewing progress during optimization. To enable this on Sigma2 resources we will need to write data into our home area and some steps are necessary for connecting and viewing the board.
We will continue to use the MNIST example from above. The following changes are needed to enable TensorBoard.
# In the 'mnist.py' script
import datetime
# We will store the 'TensorBoard' logs in the folder where the 'mnist_test.sh'
# file was launched and create a folder like 'logs/fit'. In your own code we
# recommended that you give these folders names that you will recognize,
# the last folder uses the time when the program was started to separate related
# runs
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
"logs",
"fit",
datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
# Change the last line, where we fit our data in the example above, to also
# include the TensorBoard callback
model.fit(x_train[:1000], y_train[:1000],
epochs=50,
# Change here:
callbacks=[ckpt_callback, tensorboard_callback],
validation_data=(x_test[:1000], y_test[:1000]),
verbose=0)
Once you have started a job with the above code embedded, or have a previous run
which created a TensorBoard log, it can be viewed as follows.
Copy the log files into your local machine and ensure
TensorBoardis installed.Run
tensorboard --logdir=/path/to/logs/fit --port=0.In the output from the above command note which port
TensorBoardhas started on, the last line should look something like:TensorBoard 2.1.0 at http://localhost:44124/ (Press CTRL+C to quit).Click the link provided in the output to view the results.
Advanced topics
Using multiple GPUs
Since all of the GPU machines on Saga have four GPUs it can be beneficial for some workloads to distribute the work over more than one device at a time. This can be accomplished with the tf.distribute.MirroredStrategy.
To achieve actual speedup when using multiple GPUs, it is critical to use a data pipeline that keeps the GPUs busy. In the example below, we use .prefetch(tf.data.AUTOTUNE) to allow the CPU to prepare data while the GPUs are training, and we scale the batch size by the number of GPUs.
#!/usr/bin/env python
# Assumed to be `mnist_multiple_gpus.py`
import datetime
import os
import tensorflow as tf
# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
os.environ['SLURM_JOB_ID'])
# --- 1. Define Distribution Strategy ---
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")
# --- 2. Scale Batch Size ---
# The global batch size should be multiplied by the number of GPUs
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
# --- 3. Create Dataset (OUTSIDE the strategy scope) ---
mnist = tf.keras.datasets.mnist
(x_train, y_train), _ = mnist.load_data()
x_train = x_train / 255.
# prefetch(tf.data.AUTOTUNE) is critical for performance on HPC systems
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))\
.shuffle(60000)\
.repeat()\
.batch(BATCH_SIZE)\
.prefetch(tf.data.AUTOTUNE)
# Calculate correct steps to cover the whole 60k dataset
STEPS_PER_EPOCH = 60000 // BATCH_SIZE
# --- 4. Create Model within Strategy Scope ---
with strategy.scope():
def create_model():
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
# FIX: Explicitly state sparse_categorical_accuracy
metrics=['sparse_categorical_accuracy'])
return model
model = create_model()
# Display summary of model
model.summary()
# --- 5. Logging and Checkpointing ---
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'], "logs", "fit",
datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
# Create checkpointing of weights
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=ckpt_path,
save_weights_only=True,
verbose=1)
# --- 6. Train the model ---
model.fit(train_dataset,
epochs=50,
steps_per_epoch=STEPS_PER_EPOCH,
callbacks=[ckpt_callback, tensorboard_callback])
# Save the *trained* model in TensorFlow format at the very end
model.save(os.path.join(storage_path, "trained_model"))
To ask for two GPUs, the following Slurm options need to be set:
--nodes=1: Keeps all GPUs on the same node soMirroredStrategycan communicate between them via fast NVLink rather than the network.--ntasks=1: A single Python process manages both GPUs;MirroredStrategyhandles the intra-process GPU coordination internally.--gpus=2: Requests two GPUs forMirroredStrategyto distribute across.--cpus-per-task=12: Allocates enough CPU cores to keep both GPUs fed with preprocessed data without bottlenecking.--mem-per-cpu=8G: With 12 CPUs this gives 96 GB of RAM in total — sufficient for large datasets and model weights.
#!/usr/bin/bash
#SBATCH --account=<your_account>
#SBATCH --job-name=<your_job_name>
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=accel
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpus=2
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=8G
#SBATCH --time=00:30:00
# Reset modules and load tensorflow
module reset
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# list modules for reproducibility
module list
# Run python script
srun python $SLURM_SUBMIT_DIR/mnist_multiple_gpus.py
Distributed training on multiple nodes
To utilize more than four GPUs we will turn to the Horovod project
which supports several different machine learning libraries and is capable of
utilizing MPI. Horovod is responsible for communicating between different
nodes and perform gradient computation, averaged over the different nodes.
The following example shows the key changes needed to adapt a single-GPU training
script for multi-node training with Horovod:
hvd.init(): Initializes Horovod and the underlying MPI communication.GPU pinning: Each process explicitly binds to its assigned GPU with
set_visible_devices(gpus[0], 'GPU'), ensuring no two ranks share a device.Scaled learning rate: The learning rate is multiplied by
hvd.size()(number of workers) to compensate for the larger effective batch size across all GPUs.hvd.DistributedOptimizer: Wraps the standard optimizer to average gradients across all workers after each step.Horovod callbacks:
BroadcastGlobalVariablesCallbackensures all workers start from the same initial weights;MetricAverageCallbackaverages metrics across workers;LearningRateWarmupCallbackgradually ramps up the learning rate to avoid instability at the start.Rank 0 only for I/O: Checkpointing, logging, and model saving are done only on rank 0 to avoid all workers writing to the same files simultaneously.
Scaled
steps_per_epoch: Divided byhvd.size()since each worker only sees a fraction of the data per epoch.
#!/usr/bin/env python
# Assumed to be 'mnist_hvd.py'
import datetime
import os
import tensorflow as tf
import horovod.tensorflow.keras as hvd
# Initialize Horovod.
hvd.init()
# Extract number of visible GPUs in order to pin them to MPI process
gpus = tf.config.experimental.list_physical_devices('GPU')
if hvd.rank() == 0:
print(f"Found the following GPUs: '{gpus}'")
# Allow memory growth on GPU, required by Horovod
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# Bind the rank to a given GPU
if gpus:
# Each task sees only 1 GPU due to --gpus-per-task=1, so we always use index 0
print(f"Rank '{hvd.local_rank()}/{hvd.rank()}' using GPU: '{gpus[0]}'")
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
else:
print(f"No GPU(s) configured for ({hvd.local_rank()}/{hvd.rank()})!")
# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
os.environ['SLURM_JOB_ID'])
# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), _ = mnist.load_data()
x_train = x_train / 255.
# Create dataset for batching
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.repeat().shuffle(10000).batch(128)
# Define learning rate as a function of number of GPUs
scaled_lr = 0.001 * hvd.size()
def create_model():
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
# Horovod: adjust learning rate based on number of GPUs.
opt = tf.optimizers.Adam(scaled_lr)
# Horovod: wrap the optimizer to average gradients across all workers
opt = hvd.DistributedOptimizer(opt)
model.compile(optimizer=opt,
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'],
experimental_run_tf_function=False)
return model
# Create and display summary of model
model = create_model()
if hvd.rank() == 0:
model.summary()
# Create list of callbacks
callbacks = [
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
hvd.callbacks.MetricAverageCallback(),
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=3,
initial_lr=scaled_lr,
verbose=1 if hvd.rank() == 0 else 0),
]
# Only perform the following actions on rank 0
if hvd.rank() == 0:
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
"logs",
"fit",
datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
histogram_freq=1)
model.save(os.path.join(storage_path, "model"))
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=ckpt_path,
save_weights_only=True,
verbose=0)
model.save_weights(ckpt_path.format(epoch=0))
callbacks.extend([tensorboard_callback, ckpt_callback])
# Set verbosity to 1 only for the master node
verbose = 1 if hvd.rank() == 0 else 0
# Train model using the Dataset object
model.fit(dataset,
steps_per_epoch=500 // hvd.size(),
epochs=100,
callbacks=callbacks,
verbose=verbose)
The following Slurm script runs the distributed training job across 2 nodes. Key resource options:
--nodes=2: Requests 2 nodes, giving access to up to 8 GPUs in total (4 per node on Sagaaccelnodes).--ntasks=8: One MPI task per GPU — Horovod maps each task to one GPU.--gpus-per-task=1: Assigns exactly one GPU to each task.--cpus-per-task=6: 6 CPUs per task maps evenly to the 24 CPU cores available on Sagaaccelnodes.--mem-per-cpu=8G: Gives 48 GB of RAM per task and 192 GB per node, well within the 384 GB node limit.
Three environment variables configure the MPI and Horovod communication layer:
OMPI_MCA_btl="^openib": Disables InfiniBand transport, necessary on some setups to prevent MCA context errors.OMPI_MCA_pml="ob1": Selects theob1point-to-point management layer, consistent with disabling InfiniBand.HOROVOD_MPI_THREADS_DISABLE=1: Disables MPI thread support in Horovod, required by certain MPI implementations.
#!/usr/bin/bash
#SBATCH --account=<your_account>
#SBATCH --job-name=<your_job_name>
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=accel
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=6
#SBATCH --mem-per-cpu=8G
#SBATCH --time=00:30:00
# Reset modules and load TensorFlow and Horovod
module reset
module load CMake/3.23.1-GCCcore-11.3.0
module load Horovod/0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0
# Export settings expected by Horovod and MPI
export OMPI_MCA_btl="^openib"
export OMPI_MCA_pml="ob1"
export HOROVOD_MPI_THREADS_DISABLE=1
# Run the Python script using srun
srun python $SLURM_SUBMIT_DIR/mnist_hvd.py
Note
To use a100 GPUs on Saga, replace --partition=accel with --partition=a100 and add module --force swap StdEnv Zen2Env to your job script
(more information here).