How to check the performance and scaling
Arm Performance Reports
Arm Performance Reports is a performance evaluation tool which is simple to use, produces a clear, single-file report, and it is used to obtain a high-level overview of the performance characteristics.
It can report CPU time spent on various types of instructions (e.g., floating-point), communication time (MPI), multi-threading level and thread synchronization overheads, memory bandwidth, and IO performance. Such a report can help spotting certain bottlenecks in the code and highlight potential optimization directions, but also suggest simple changes in how the code should be executed to better utilize the resources. Some typical examples of the suggestions are
The CPU performance appears well-optimized for numerical computation. The biggest gains may now come from running at larger scales.
Significant time is spent on memory accesses. Use a profiler to identify time-consuming loops and check their cache performance.
To profile a statically linked binary, you need to recompile
You can use Arm Performance Reports on dynamically linked binaries without recompilation. However, you may have to recompile statically linked binaries (for this please consult the official documentation).
A successful run will produce two files
A successful Arm Performance Reports run will produce two files, a HTML summary and a text file summary, like in this example:
Using Arm Performance Reports on Fram and Saga
Use cases and pitfalls
We demonstrate some pitfalls of profiling, and show how one can use profiling to reason about the performance of real-world codes.