Understanding the job output file
When a job begins to run it will create a slurm-<job id>.out
. All output,
both standard output and standard error, is sent to this file. In this way we
can monitor what our application is doing when it is running under Slurm. Slurm
will also ensure that no matter how many tasks or number of nodes everything
printed to standard out will be collected into this output file.
Tip
When contacting support with a job error always attach the Slurm script used to submit the job as well as the Slurm output file. This helps us understand the error and reduces misscommunication.
To illustrate the output we will use the following Slurm script:
#!/bin/bash
#SBATCH --account=nn<XXXX>k
#SBATCH --job-name=TestGPUOutput
#SBATCH --time=12:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=2G
#SBATCH --partition=accel
#SBATCH --gpus=1
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load hipSYCL/0.9.1-gcccuda-2020b
module list # For easier debugging
# Activate local virtual environment
source env/bin/activate
# Run our application test on compute node with GPU
pytest
This script does nothing special and follows the best practises for a Slurm
script as described in our introduction to Slurm scripts.
We ran the script on Saga which produced an output file called
slurm-4677199.out
.
The anatomy of the output file is as follows (we will explain each part in detail below):
Header created when the application launched
Application specific output
CPU and memory statistics
Disk read/write statistics
(Optional) GPU statistics
Output header
Below, we have attached the first 15 lines of the output, from our sample
application above, which contain some shared information and some job specific
output. The first highlighted line shows which node(s) the job ran on and when
it started, this is always added to the output. Next we have highlighted the
beginning of the module list
output. This output can be very useful to check
if the job does what you expect and allows you to see which software versions
you were using.
Starting job 4677199 on c7-2 at Thu Dec 16 09:31:53 CET 2021
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) StdEnv
Currently Loaded Modules:
1) StdEnv (S)
2) GCCcore/10.2.0
3) zlib/1.2.11-GCCcore-10.2.0 (H)
4) binutils/2.35-GCCcore-10.2.0 (H)
5) GCC/10.2.0
6) CUDAcore/11.1.1
7) CUDA/11.1.1-GCC-10.2.0
After this follows the application output, which we will not show here as that is application specific. However, know that using standard output to log what your application is doing at different times can be a good way to better understand how your application is running on the HPC machines.
CPU statistics
Note
Slurm collects statistics every second, which means that for applications that run for a short amount of time the CPU, memory, disk and GPU statistics can be a bit missleading. Keep this in mind when developing your Slurm script and when scaling up.
Once the application finishes running, either successfully or not, some statistics about resource usage is outputted. The first of these are the CPU and memory statistics. Using our example from above, here is the CPU and memory statistics.
Task and CPU usage stats:
JobID JobName AllocCPUS NTasks MinCPU MinCPUTask AveCPU Elapsed ExitCode
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- --------
4677199 run.slurm 16 00:01:05 0:0
4677199.bat+ batch 16 1 00:02:11 0 00:02:11 00:01:05 0:0
4677199.ext+ extern 16 1 00:00:00 0 00:00:00 00:01:05 0:0
Memory usage stats:
JobID MaxRSS MaxRSSTask AveRSS MaxPages MaxPagesTask AvePages
------------ ---------- ---------- ---------- -------- -------------- ----------
4677199
4677199.bat+ 150344K 0 150344K 0 0 0
4677199.ext+ 0 0 0 0 0 0
In the above output we can see a few different things, but lets first explain
the meaning of batch
and extern
. When you submit a Slurm script the script
itself is counted as batch
by Slurm. That means that any resources used by
your Slurm script is accounted for under this heading. If you run a command
directly within the Slurm script the time this command used will be accounted
for under batch
. Looking back at our Slurm script above we can see that the
main application was pytest
, we could alternatively have used srun pytest
which would create a new line in the output which would account for everything
under the srun
call (multiple srun
calls are accounted in separate line
items in the output). extern
is everything outside these two usages and
should be fairly small, one example of an extern
usage is SSH-ing to the node
where your code is running and inspecting, where the resources used during the
inspection would be accounted for under extern
.
Below, we have highlighted the batch
step in both the CPU and memory
statistics, this is most likely where you would find the most useful
information. In the CPU statistics we can see that we allocated 16 CPU cores
(AllocCPUS
), for 1 task (NTasks
), the script ran for 1 minute (Elapsed
),
but the 16 CPU cores ran for around 2 minutes in total (AveCPU
). Lastly, we
can see that the job ended successfully since the ExitCode
is 0:0
(an exit
code of 0
means success on Linux).
Task and CPU usage stats:
JobID JobName AllocCPUS NTasks MinCPU MinCPUTask AveCPU Elapsed ExitCode
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- --------
4677199 run.slurm 16 00:01:05 0:0
4677199.bat+ batch 16 1 00:02:11 0 00:02:11 00:01:05 0:0
4677199.ext+ extern 16 1 00:00:00 0 00:00:00 00:01:05 0:0
Memory usage stats:
JobID MaxRSS MaxRSSTask AveRSS MaxPages MaxPagesTask AvePages
------------ ---------- ---------- ---------- -------- -------------- ----------
4677199
4677199.bat+ 150344K 0 150344K 0 0 0
4677199.ext+ 0 0 0 0 0 0
The other line we highlighted above is the memory footprint of our application.
The most pertinent information here is the maximum memory used (MaxRSS
) (note
that this is the maximum of memory used by any task and not the total memory
footprint of the application as a whole) and the average memory used
(AveRSS
), which is the average Resident Set Size over all tasks in the given job step (for example, if you run srun
several times in a job, each will get their own AveRSS and the job script itself is counted as the .batch job step). Above, we can see that our application used a maximum of 150344 KiB
(or around 147 MiB
). This information is very important as that can be
used to optimize your Slurm script to request less memory (and thus be
scheduled quicker).
Disk statistics
The next information shows our applications disk usage. This information is important because it can help you reduce unnecessary disk access which is not the fastest for small files.
Disk usage stats:
JobID MaxDiskRead MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteTask AveDiskWrite
------------ ------------ --------------- -------------- ------------ ---------------- --------------
4677199
4677199.bat+ 8.55M 0 8.55M 0.05M 0 0.05M
4677199.ext+ 0.00M 0 0.00M 0 0 0
As we can see, our test application reads a bit of data, but writes very little.
GPU statistics
Note
GPU statistics is only outputted if your Slurm script requests one or more GPUs.
Currently, only Saga supports automatic GPU metric collection. We are working on enabling the same support on Betzy. One alternative that works on both Saga and Betzy, but require manual intervention, is described in our CUDA introduction.
To give some indications on how your applications interacts with GPUs, the following GPU statistics is collected during a run. The first thing to note in the GPU output is that statistics is collected for all GPUs requested, and each GPU is displayed separately. Below we have highlighted the information of most interest.
GPU usage stats:
Successfully retrieved statistics for job: 4677199.
+------------------------------------------------------------------------------+
| GPU ID: 0 |
+====================================+=========================================+
|----- Execution Stats ------------+-----------------------------------------|
| Start Time | Thu Dec 16 09:31:54 2021 |
| End Time | Thu Dec 16 09:32:58 2021 |
| Total Execution Time (sec) | 64.4 |
| No. of Processes | 1 |
+----- Performance Stats ----------+-----------------------------------------+
| Energy Consumed (Joules) | 3104 |
| Power Usage (Watts) | Avg: 103.465, Max: 172.91, Min: 34.02 |
| Max GPU Memory Used (bytes) | 466616320 |
| SM Clock (MHz) | Avg: 1258, Max: 1328, Min: 1189 |
| Memory Clock (MHz) | Avg: 715, Max: 715, Min: 715 |
| SM Utilization (%) | Avg: 50, Max: 100, Min: 0 |
| Memory Utilization (%) | Avg: 8, Max: 17, Min: 0 |
| PCIe Rx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
| PCIe Tx Bandwidth (megabytes) | Avg: N/A, Max: N/A, Min: N/A |
+----- Event Stats ----------------+-----------------------------------------+
| Single Bit ECC Errors | 0 |
| Double Bit ECC Errors | 0 |
| PCIe Replay Warnings | 0 |
| Critical XID Errors | 0 |
+----- Slowdown Stats -------------+-----------------------------------------+
| Due to - Power (%) | 0 |
| - Thermal (%) | 0 |
| - Reliability (%) | Not Supported |
| - Board Limit (%) | Not Supported |
| - Low Utilization (%) | Not Supported |
| - Sync Boost (%) | 0 |
+-- Compute Process Utilization ---+-----------------------------------------+
| PID | 7392 |
| Avg SM Utilization (%) | 15 |
| Avg Memory Utilization (%) | 2 |
+----- Overall Health -------------+-----------------------------------------+
| Overall Health | Healthy |
+------------------------------------+-----------------------------------------+
Max GPU Memory Used
is quite self explanatory, but can be important to check
(maybe your application can trade higher memory usage for more performance).SM Utilization
describes how well our application used the GPU compute resources.
If the maximum value here is low this could be an indication that your
application is not utilizing the requested GPUs. For our example application
not all steps utilized a GPU so we get a medium average utilization. The other
variable to pay extra attention to is the Memory Utilization
which describes
how well our application utilized the GPU memory bandwidth. For longer running
application optimizing memory transfer is one of the first optimization that
should be undertaken for GPU optimization.
Note
The information under PID
tends to not be as accurate as the other
information so take this with a grain of salt.