Using TensorFlow in Python
In this example we will try to utilize the
TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
library to execute a very simple
computation on the GPU. We could do the following interactively in Python, but
we will instead use a Slurm script, which will make it a bit more reproducible
and in some sense a bit easier, since we don’t have to sit and wait for the
interactive session to start.
We will use the following simple calculation in Python and TensorFlow
to test
the GPUs:
#!/usr/bin/env python3
import tensorflow as tf
# Test if there are any GPUs available
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
# Have Tensorflow output where computations are run
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
# Print result
print(c)
To run this we will first have to create a Slurm script in which we will request
resources. A good place to start is with a basic job
script (see Job Scripts).
Use the following to create submit_cpu.sh
(remember to substitute your project
number under --account
):
#!/bin/bash
#SBATCH --job-name=TestGPUOnSaga
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=4G
#SBATCH --qos=devel
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list
python gpu_intro.py
If we just run the above Slurm script with sbatch submit_cpu.sh
the output
(found in the same directory as you executed the sbatch
command with a name
like slurm-<job-id>.out
) will contain several errors as Tensorflow
attempts
to communicate with the GPU, however, the program will still run and give the
following successful output:
Num GPUs Available: 0
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
So the above, eventually, ran fine, but did not report any GPUs. The reason for
this is of course that we never asked for any GPUs in the first place. To remedy
this we will change the Slurm script to include the --partition=accel
and
--gpus=1
, as follows:
#!/bin/bash
#SBATCH --job-name=TestGPUOnSaga
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=4G
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list
python gpu_intro.py
We should now see the following output:
Num GPUs Available: 1
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
However, with complicated libraries such as Tensorflow
we are still not
guaranteed that the above actually ran on the GPU. There is some output to
verify this, but we will check this manually as that can be applied more
generally.
Monitoring the GPUs
To do this monitoring we will start nvidia-smi
before our job and let it run
while we use the GPU. We will change the submit_gpu.sh
Slurm script above to
submit_monitor.sh
, shown below:
#!/bin/bash
#SBATCH --job-name=TestGPUOnSaga
#SBATCH --account=nn<XXXX>k
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=4G
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list
# Setup monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory \
--format=csv --loop=1 > "gpu_util-$SLURM_JOB_ID.csv" &
NVIDIA_MONITOR_PID=$! # Capture PID of monitoring process
# Run our computation
python gpu_intro.py
# After computation stop monitoring
kill -SIGINT "$NVIDIA_MONITOR_PID"
Note
The query used to monitor the GPU can be further extended by adding additional
parameters to the --query-gpu
flag. Check available options
here.
Run this script with sbatch submit_monitor.sh
to test if the output
gpu_util-<job id>.csv
actually contains some data. We can then use this data
to ensure that we are actually using the GPU as intended. Pay specific attention
to utilization.gpu
which shows the percentage of how much processing the GPU
is doing. It is not expected that this will always be 100%
as we will need to
transfer data, but the average should be quite high.