Using TensorFlow in Python

In this example we will try to utilize the TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1 library to execute a very simple computation on the GPU. We could do the following interactively in Python, but we will instead use a Slurm script, which will make it a bit more reproducible and in some sense a bit easier, since we don’t have to sit and wait for the interactive session to start.

We will use the following simple calculation in Python and TensorFlow to test the GPUs:

#!/usr/bin/env python3

import tensorflow as tf

# Test if there are any GPUs available
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

# Have Tensorflow output where computations are run
tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

# Print result
print(c)

gpu_intro.py

To run this we will first have to create a Slurm script in which we will request resources. A good place to start is with a basic job script (see Job Scripts). Use the following to create submit_cpu.sh (remember to substitute your project number under --account):

#!/bin/bash
#SBATCH --job-name=TestGPUOnSaga
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=4G
#SBATCH --qos=devel

## Set up job environment:
set -o errexit  # Exit the script on any error
set -o nounset  # Treat any unset variables as an error

module --quiet purge  # Reset the modules to the system default
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list

python gpu_intro.py

submit_cpu.sh

If we just run the above Slurm script with sbatch submit_cpu.sh the output (found in the same directory as you executed the sbatch command with a name like slurm-<job-id>.out) will contain several errors as Tensorflow attempts to communicate with the GPU, however, the program will still run and give the following successful output:

Num GPUs Available:  0                   
tf.Tensor(                               
[[22. 28.]                               
 [49. 64.]], shape=(2, 2), dtype=float32)

So the above, eventually, ran fine, but did not report any GPUs. The reason for this is of course that we never asked for any GPUs in the first place. To remedy this we will change the Slurm script to include the --partition=accel and --gpus=1, as follows:

#!/bin/bash
#SBATCH --job-name=TestGPUOnSaga
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=4G
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1

## Set up job environment:
set -o errexit  # Exit the script on any error
set -o nounset  # Treat any unset variables as an error

module --quiet purge  # Reset the modules to the system default
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list

python gpu_intro.py

submit_gpu.sh

We should now see the following output:

Num GPUs Available:  1                    
tf.Tensor(                                
[[22. 28.]                                
 [49. 64.]], shape=(2, 2), dtype=float32) 

However, with complicated libraries such as Tensorflow we are still not guaranteed that the above actually ran on the GPU. There is some output to verify this, but we will check this manually as that can be applied more generally.

Monitoring the GPUs

To do this monitoring we will start nvidia-smi before our job and let it run while we use the GPU. We will change the submit_gpu.sh Slurm script above to submit_monitor.sh, shown below:

#!/bin/bash
#SBATCH --job-name=TestGPUOnSaga
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=4G
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1

## Set up job environment:
set -o errexit  # Exit the script on any error
set -o nounset  # Treat any unset variables as an error

module --quiet purge  # Reset the modules to the system default
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list

# Setup monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory \
	--format=csv --loop=1 > "gpu_util-$SLURM_JOB_ID.csv" &
NVIDIA_MONITOR_PID=$!  # Capture PID of monitoring process
# Run our computation
python gpu_intro.py
# After computation stop monitoring
kill -SIGINT "$NVIDIA_MONITOR_PID"

submit_monitor.sh

Note

The query used to monitor the GPU can be further extended by adding additional parameters to the --query-gpu flag. Check available options here.

Run this script with sbatch submit_monitor.sh to test if the output gpu_util-<job id>.csv actually contains some data. We can then use this data to ensure that we are actually using the GPU as intended. Pay specific attention to utilization.gpu which shows the percentage of how much processing the GPU is doing. It is not expected that this will always be 100% as we will need to transfer data, but the average should be quite high.