Understanding the job output file

When a job begins to run it will create a slurm-<job id>.out. All output, both standard output and standard error, is sent to this file. In this way we can monitor what our application is doing when it is running under Slurm. Slurm will also ensure that no matter how many tasks or number of nodes everything printed to standard out will be collected into this output file.

Tip

When contacting support with a job error always attach the Slurm script used to submit the job as well as the Slurm output file. This helps us understand the error and reduces misscommunication.

To illustrate the output we will use the following Slurm script:

#!/bin/bash

#SBATCH --account=nn<XXXX>k
#SBATCH --job-name=TestGPUOutput
#SBATCH --time=12:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=2G

#SBATCH --partition=accel
#SBATCH --gpus=1

set -o errexit  # Exit the script on any error
set -o nounset  # Treat any unset variables as an error

module --quiet purge  # Reset the modules to the system default
module load hipSYCL/0.9.1-gcccuda-2020b
module list     # For easier debugging

# Activate local virtual environment
source env/bin/activate
# Run our application test on compute node with GPU
pytest

This script does nothing special and follows the best practises for a Slurm script as described in our introduction to Slurm scripts. We ran the script on Saga which produced an output file called slurm-4677199.out.

The anatomy of the output file is as follows (we will explain each part in detail below):

  1. Header created when the application launched

  2. Application specific output

  3. CPU and memory statistics

  4. Disk read/write statistics

  5. (Optional) GPU statistics

Output header

Below, we have attached the first 15 lines of the output, from our sample application above, which contain some shared information and some job specific output. The first highlighted line shows which node(s) the job ran on and when it started, this is always added to the output. Next we have highlighted the beginning of the module list output. This output can be very useful to check if the job does what you expect and allows you to see which software versions you were using.

Starting job 4677199 on c7-2 at Thu Dec 16 09:31:53 CET 2021

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) StdEnv

Currently Loaded Modules:
  1) StdEnv                           (S)
  2) GCCcore/10.2.0
  3) zlib/1.2.11-GCCcore-10.2.0       (H)
  4) binutils/2.35-GCCcore-10.2.0     (H)
  5) GCC/10.2.0
  6) CUDAcore/11.1.1
  7) CUDA/11.1.1-GCC-10.2.0

After this follows the application output, which we will not show here as that is application specific. However, know that using standard output to log what your application is doing at different times can be a good way to better understand how your application is running on the HPC machines.

CPU statistics

Note

Slurm collects statistics every second, which means that for applications that run for a short amount of time the CPU, memory, disk and GPU statistics can be a bit missleading. Keep this in mind when developing your Slurm script and when scaling up.

Once the application finishes running, either successfully or not, some statistics about resource usage is outputted. The first of these are the CPU and memory statistics. Using our example from above, here is the CPU and memory statistics.

Task and CPU usage stats:
       JobID    JobName  AllocCPUS   NTasks     MinCPU MinCPUTask     AveCPU    Elapsed ExitCode 
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- -------- 
4677199       run.slurm         16                                             00:01:05      0:0 
4677199.bat+      batch         16        1   00:02:11          0   00:02:11   00:01:05      0:0 
4677199.ext+     extern         16        1   00:00:00          0   00:00:00   00:01:05      0:0 

Memory usage stats:
       JobID     MaxRSS MaxRSSTask     AveRSS MaxPages   MaxPagesTask   AvePages 
------------ ---------- ---------- ---------- -------- -------------- ---------- 
4677199                                                                          
4677199.bat+    150344K          0    150344K        0              0          0 
4677199.ext+          0          0          0        0              0          0 

In the above output we can see a few different things, but lets first explain the meaning of batch and extern. When you submit a Slurm script the script itself is counted as batch by Slurm. That means that any resources used by your Slurm script is accounted for under this heading. If you run a command directly within the Slurm script the time this command used will be accounted for under batch. Looking back at our Slurm script above we can see that the main application was pytest, we could alternatively have used srun pytest which would create a new line in the output which would account for everything under the srun call (multiple srun calls are accounted in separate line items in the output). extern is everything outside these two usages and should be fairly small, one example of an extern usage is SSH-ing to the node where your code is running and inspecting, where the resources used during the inspection would be accounted for under extern.


Below, we have highlighted the batch step in both the CPU and memory statistics, this is most likely where you would find the most useful information. In the CPU statistics we can see that we allocated 16 CPU cores (AllocCPUS), for 1 task (NTasks), the script ran for 1 minute (Elapsed), but the 16 CPU cores ran for around 2 minutes in total (AveCPU). Lastly, we can see that the job ended successfully since the ExitCode is 0:0 (an exit code of 0 means success on Linux).

Task and CPU usage stats:
       JobID    JobName  AllocCPUS   NTasks     MinCPU MinCPUTask     AveCPU    Elapsed ExitCode 
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- -------- 
4677199       run.slurm         16                                             00:01:05      0:0 
4677199.bat+      batch         16        1   00:02:11          0   00:02:11   00:01:05      0:0 
4677199.ext+     extern         16        1   00:00:00          0   00:00:00   00:01:05      0:0 

Memory usage stats:
       JobID     MaxRSS MaxRSSTask     AveRSS MaxPages   MaxPagesTask   AvePages 
------------ ---------- ---------- ---------- -------- -------------- ---------- 
4677199                                                                          
4677199.bat+    150344K          0    150344K        0              0          0 
4677199.ext+          0          0          0        0              0          0 

The other line we highlighted above is the memory footprint of our application. The most pertinent information here is the maximum memory used (MaxRSS) (note that this is the maximum of memory used by any task and not the total memory footprint of the application as a whole) and the average memory used (AveRSS), which is the average Resident Set Size over all tasks in the given job step (for example, if you run srun several times in a job, each will get their own AveRSS and the job script itself is counted as the .batch job step). Above, we can see that our application used a maximum of 150344 KiB (or around 147 MiB). This information is very important as that can be used to optimize your Slurm script to request less memory (and thus be scheduled quicker).

Disk statistics

The next information shows our applications disk usage. This information is important because it can help you reduce unnecessary disk access which is not the fastest for small files.

Disk usage stats:
       JobID  MaxDiskRead MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteTask   AveDiskWrite 
------------ ------------ --------------- -------------- ------------ ---------------- -------------- 
4677199                                                                                               
4677199.bat+        8.55M               0          8.55M        0.05M                0          0.05M 
4677199.ext+        0.00M               0          0.00M            0                0              0 

As we can see, our test application reads a bit of data, but writes very little.

GPU statistics

Note

GPU statistics is only outputted if your Slurm script requests one or more GPUs.

Currently, only Saga supports automatic GPU metric collection. We are working on enabling the same support on Betzy. One alternative that works on both Saga and Betzy, but require manual intervention, is described in our CUDA introduction.

To give some indications on how your applications interacts with GPUs, the following GPU statistics is collected during a run. The first thing to note in the GPU output is that statistics is collected for all GPUs requested, and each GPU is displayed separately. Below we have highlighted the information of most interest.

GPU usage stats:
Successfully retrieved statistics for job: 4677199. 
+------------------------------------------------------------------------------+
| GPU ID: 0                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Thu Dec 16 09:31:54 2021                |
| End Time                           | Thu Dec 16 09:32:58 2021                |
| Total Execution Time (sec)         | 64.4                                    |
| No. of Processes                   | 1                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 3104                                    |
| Power Usage (Watts)                | Avg: 103.465, Max: 172.91, Min: 34.02   |
| Max GPU Memory Used (bytes)        | 466616320                               |
| SM Clock (MHz)                     | Avg: 1258, Max: 1328, Min: 1189         |
| Memory Clock (MHz)                 | Avg: 715, Max: 715, Min: 715            |
| SM Utilization (%)                 | Avg: 50, Max: 100, Min: 0               |
| Memory Utilization (%)             | Avg: 8, Max: 17, Min: 0                 |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | 0                                       |
+--  Compute Process Utilization  ---+-----------------------------------------+
| PID                                | 7392                                    |
|     Avg SM Utilization (%)         | 15                                      |
|     Avg Memory Utilization (%)     | 2                                       |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

Max GPU Memory Used is quite self explanatory, but can be important to check (maybe your application can trade higher memory usage for more performance).SM Utilization describes how well our application used the GPU compute resources. If the maximum value here is low this could be an indication that your application is not utilizing the requested GPUs. For our example application not all steps utilized a GPU so we get a medium average utilization. The other variable to pay extra attention to is the Memory Utilization which describes how well our application utilized the GPU memory bandwidth. For longer running application optimizing memory transfer is one of the first optimization that should be undertaken for GPU optimization.

Note

The information under PID tends to not be as accurate as the other information so take this with a grain of salt.