How to check the performance and scaling using Arm Performance Reports
Arm Performance Reports is a performance evaluation tool which is simple to use, produces a clear, single-file report, and it is used to obtain a high-level overview of the performance characteristics.
It can report CPU time spent on various types of instructions (e.g., floating-point), communication time (MPI), multi-threading level and thread synchronization overheads, memory bandwidth, and IO performance. Such a report can help spotting certain bottlenecks in the code and highlight potential optimization directions, but also suggest simple changes in how the code should be executed to better utilize the resources. Some typical examples of the suggestions are
The CPU performance appears well-optimized for numerical computation. The biggest gains may now come from running at larger scales.
or
Significant time is spent on memory accesses. Use a profiler to identify time-consuming loops and check their cache performance.
A successful Arm Performance Reports run will produce two files, a HTML summary and a text file summary, like in this example:
example_128p_4n_1t_2020-05-23_18-04.html
example_128p_4n_1t_2020-05-23_18-04.txt
Do I need to recompile the code?
You can use Arm Performance Reports on dynamically linked binaries without recompilation. However, you may have to recompile statically linked binaries (for this please consult the official documentation).
Due to a bug in older versions of OpenMPI, on Fram Arm Performance Reports works only with OpenMPI version 3.1.3 and newer. If you have compiled your application with OpenMPI 3.1.1, you don’t need to recompile it. Simply load the 3.1.3 module - those versions are compatible.
Profiling a batch script
Let us consider the following example job script as your usual computation which you wish to profile:
#!/bin/bash -l
# all your SBATCH directives
#SBATCH --account=myaccount
#SBATCH --job-name=without-apr
#SBATCH --time=0-00:05:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
# recommended bash safety settings
set -o errexit # make bash exit on any error
set -o nounset # treat unset variables as errors
srun ./myexample.x # <- we will need to modify this line
To profile the code you don’t need to modify any of the #SBATCH
part.
All we need to do is to load the Arm-PerfReports/20.0.3
module
and to modify the srun
command to instead use
perf-report:
#!/bin/bash -l
# ...
# we kept the top of the script unchanged
# we added this line:
module load Arm-PerfReports/20.0.3
# we added these two lines:
echo "set sysroot /" > gdbfile
export ALLINEA_DEBUGGER_USER_FILE=gdbfile
# we added perf-report in front of srun
perf-report srun ./myexample.x
This works the same way on Saga, Fram, and Betzy.
In other words, add 3 lines, replace srun
or mpirun -n ${SLURM_NTASKS}
by
perf-report srun
.
That’s it.
Why are these two lines needed?
echo "set sysroot /" > gdbfile
export ALLINEA_DEBUGGER_USER_FILE=gdbfile
We have a Slurm plug-in that (deliberately) detaches a job from the global mount
name space in order to create private versions of /tmp
and /var/tmp
(i.e.,
bind mounted) for each job. This is done both so jobs cannot see other jobs’
/tmp
and /var/tmp
, and also so that we avoid filling up (the global) /tmp
and /var/tmp
(since we allow more than one job per compute node, we cannot
clean these directories after each job - we don’t know which job created the
files). However, for perf-report
to work with this setup we need to set GDB’s
sysroot to /
.
Profiling on an interactive compute node
To run interactive tests one needs to submit
an interactive job
to Slurm using srun
(not using salloc
), e.g.:
First obtain an interactive compute node (adjust “myaccount”), on Saga:
$ srun --nodes=1 --ntasks-per-node=4 --mem-per-cpu=1G --time=00:30:00 --qos=devel --account=myaccount --pty bash -i
or Fram:
$ srun --nodes=1 --ntasks-per-node=32 --time=00:30:00 --qos=devel --account=myaccount --pty bash -i
or Betzy:
$ srun --nodes=1 --ntasks-per-node=128 --time=00:30:00 --qos=devel --account=myaccount --pty bash -i
Once you get the interactive node, you can run the profile:
$ module load Arm-PerfReports/20.0.3
$ echo "set sysroot /" > gdbfile
$ export ALLINEA_DEBUGGER_USER_FILE=gdbfile
$ perf-report srun ./myexample.x
Use cases and pitfalls
We demonstrate some pitfalls of profiling, and show how one can use profiling to reason about the performance of real-world codes.
STREAM benchmark (measures the memory bandwidth)
LINPACK benchmark (measures the floating-point capabilities)
OSU benchmark (measures the interconnect performance)
Known issues
ARM Performance reports may fail if too many processes are generated on a single node, due to the
ulimit -u
default value (4096). This can easily be fixed by setting ulimit -u
to a high number.
E.g., add the line ulimit -u 40960
in your jobscript.
There seems to be a compatibility issue between ARM Performance reports and the Intel/20XX
modules.
If you are using this module and having trouble with APR, you might want to test alternative modules.
In some cases, loading the ARM Performance reports module after the Intel module, might fix the issue.
What if the job timed out?
The job has to finish within the allocated time for the report to be generated. So if the job times out, there is a risk that no report is generated.
If you run a job that always times out by design (in other words the job never terminates itself but is terminated by Slurm), there is a workaround if you are running the profile on Fram on no more than 64 cores:
As an example let us imagine we profile the following example:
# ...
#SBATCH --time=0-01:00:00
# ...
module load Arm-PerfReports/20.0.3
echo "set sysroot /" > gdbfile
export ALLINEA_DEBUGGER_USER_FILE=gdbfile
perf-report srun ./myexample.x # <- we will need to change this
Let’s imagine the above example code (./myexample.x
) always times out,
and we expect it to time out after 1 hour (see --time
above).
In this case we get no report.
To get a report on Fram, we can do this instead:
# ...
#SBATCH --time=0-01:00:00
# ...
module load Arm-MAP/20.0.3 # <- we changed this line
echo "set sysroot /" > gdbfile
export ALLINEA_DEBUGGER_USER_FILE=gdbfile
# we changed this line and tell map to stop the code after 3500 seconds
map --profile --stop-after=3500 srun ./myexample.x
We told map to stop the code after 3500 seconds but to still have some time to
generate a map file. This run will generate a file with a .map
suffix. From
this file we can generate the profile reports on the command line (no need to
run this as a batch script):
$ module load Arm-PerfReports/20.0.3
$ perf-report prefix.map # <- replace "prefix.map" to the actual map file
The latter command generates the .txt
and .html
reports.