Performance Analysis and Tuning

Understanding application performance on modern HPC architectures is a very complex task. There are a number of factors that can limit performance: IO speed, CPU speed, memory latency and bandwidth, thread binding and correct memory allocation on NUMA architectures, communication cost in both threaded shared-memory applications, and in MPI-based codes.

Sometimes the performance can be improved without recompiling the code, e.g., by arranging the working threads or MPI ranks in a more efficient way, or by using more / less CPU cores. In other cases it might be required to perform an in-depth investigation into the hardware performance counters and re-writing (parts of) the code. Either way, identifying the bottlenecks and deciding on what needs to be done can be made simpler by using specialized tools. Here we describe some of the tools available on Fram.