Profiling using PerfTools
In this section, we discuss the HPE Cray PerfTools performance
measurement and analysis tool, accessed via various perftools modules.
These tools are also referred to as CrayPat-lite or CrayPat (from an
earlier name Performance Analysis Toolkit).
PerfTools can be used with compiled programs (typically C/C++ or Fortran), interpreted python, and for executables where the source code is not available. The PerfTools modules admit a very general and flexible approach to profiling, depending on the level of user experience. We start with the least invasive approach, and then describe methods requiring increasing levels of user intervention.
Generating a profile with pat_run
For an existing executable, or for a python script, a simple profile
can be generated with minimal intervention using pat_run. This is
also relevant for executables for which the source code is not
available (but are dynamically linked).
Existing executable
-
Make sure the
perftools-basemodule is loaded. This provides the common underlying functionality of PerfTools, and should be present by default. -
Run the application in the usual way, but insert
pat_runjust before the executable name:srun --ntasks=36 pat_run ./myapp.x -
At the start of execution, a new directory will appear in the current working directory which contains the profiling data files. The default directory name is that of the executable followed by a unique series of numbers for each profiling instance, e.g.,
myapp.x+2048522-1010655545s. -
To produce a summary report of the profiling data, run
pat_reportwith the new directory name as the argument, e.g.,pat_report myapp.x+2048522-1010655545s
The typical output will provide a summary, and a series of tables which are discussed in more detail in the following sections.
Typical PerfTools summary output using pat_run
CrayPat/X: Version 25.03.0 Revision 5ef9ac5bb rhel9.5_x86_64 01/22/25 23:09:16
Number of PEs (MPI ranks): 36
Number of Nodes: 1
Numbers of PEs per Node: 36
Numbers of Threads per PE: 1
Number of Cores per Socket: 144
Execution start time: Wed Dec 3 14:28:07 2025
System name and speed: cs-n0000 2.326 GHz (nominal)
AMD Turin CPU Family: 26 Model: 17 Stepping: 0
Core Performance Boost: 288 PEs have CPB capability
...
...
Python
Support for python profiling when using module cray-python is
available. For example,
module load cray-python
srun --ntasks=36 pat_run $(which python3) myapp.py
pat_run.
Again, a summary report can be generated with pat_report, e.g.:
module load cray-python
pat_report myapp.py+2048522-1010655545s
Profiling following re-compilation
To allow PerfTools access to fuller information about a compiled program,
re-compilation is required to allow the introduction of instrumentation.
There are a number of ways to do this using PerfTools. A lightweight
starting point is to use the perftools-lite module.
perftools-lite
-
Make sure the
perftools-basemodule is loaded (it should be present by default), and loadperftools-litein addition:module load perftools-lite -
Compile your application normally (make sure any existing objects are removed). A message will appear at the link stage indicating that the executable has been instrumented. For example:
You must use the compiler wrapperscc -o myapp.x myapp.c INFO: creating the PerfTools-instrumented executable 'myapp.x' (lite-samples)cc,CCorftn. -
Run the new executable by submitting a job in the usual way. There are no special additions related to profiling once the executable exists. E.g.,
srun --ntasks=36 ./myapp.x -
At the start of execution, a new directory will be created to hold the profiling data files. The directory name is based on the executable name and a unique string of numbers for each profiling run, e.g.,
myapp.x+1607079-1010655545s. Note thesat the end, indicating this was a sampling exercise. -
When the job finishes executing, summary profile report will be directed to standard output (i.e., at the end of the job's output file).
The
perftoolsreport is structured as a series of tables which are designed to be self-explanatory.
Typical PerfTools sampling output via perftools-lite
...
Summary output omitted
...
Table 1: Sample Profile by Function
Samp% | Samp | Imb. | Imb. | Group
| | Samp | Samp% | Function
| | | | PE=HIDE
100.0% | 6,113.2 | -- | -- | Total
|-----------------------------------------------------------------
| 92.5% | 5,654.4 | -- | -- | USER
||----------------------------------------------------------------
|| 22.4% | 1,371.3 | 86.7 | 6.1% | function_a
|| 11.0% | 674.6 | 61.4 | 8.6% | function_b
|| 8.9% | 545.9 | 59.1 | 10.0% | function_c
|| 8.2% | 502.8 | 26.2 | 5.1% | function_d
|| 7.8% | 474.5 | 18.5 | 3.9% | function_e
||================================================================
| 3.9% | 235.7 | -- | -- | MPI
||----------------------------------------------------------------
|| 3.6% | 222.8 | 213.2 | 50.3% | MPI_Waitall
||================================================================
...
...
For perftools-lite the default profiling is a sampling exercise,
where a statistical picture of performance is obtained based on the
proportion of samples taken in different parts of the program.
The report should include the default sampling interval:
Sampling interval: 10000 microsecs
perftools-lite-events
In the bare perftools-lite sampling approach, a statistical picture
of performance is obtained. If more detailed information is required,
an event-based approach can be employed. This is typically based on
the time-stamp of events such as the entry to and exit from a
particular function. This comes at the cost of higher overhead in time
taken and the size of the report files generated.
To prepare an executable for event profiling, follow the same process as for sampling, that is:
-
Make sure the
perftools-basemodule is loaded and load theperftools-lite-eventsmodulemodule load perftools-lite-events -
Compile your application normally. For example:
ftn -o myapp.x myapp.f90 INFO: creating the PerfTools-instrumented executable 'myapp.x' (lite-events) -
Run the new executable by submitting a job in the usual way.
-
Analyse the data. Again, a summary will appear at the end of execution in the standard output. As this is now event-based, additional information such as the exact number of calls to a given function can be presented.
Example PerfTools event output via perftools-lite-events
...
Summary output omitted
...
Table 1: Profile by Function Group and Function
Time% | Time | Imb. | Imb. | Calls | Group
| | Time | Time% | | Function=[MAX10]
| | | | | PE=HIDE
| | | | | Thread=HIDE
100.0% | 15.801060 | -- | -- | 10,958,605.4 | Total
|-----------------------------------------------------------------------------
| 89.9% | 14.203748 | -- | -- | 10,932,614.3 | USER
||----------------------------------------------------------------------------
|| 13.9% | 2.191225 | 0.094256 | 4.3% | 1,621,433.3 | function_a
|| 11.3% | 1.781757 | 0.029626 | 1.7% | 100.0 | function_b.LOOP@li.164
|| 8.3% | 1.311064 | 0.045393 | 3.5% | 1,621,433.3 | function_c
|| 6.7% | 1.054684 | 0.050999 | 4.8% | 100.0 | function_d.LOOP@li.264
||============================================================================
| 5.4% | 0.856573 | -- | -- | 23,434.0 | MPI
||----------------------------------------------------------------------------
|| 5.0% | 0.792062 | 0.297903 | 28.5% | 313.0 | MPI_Waitall
||============================================================================
| 2.5% | 0.388409 | -- | -- | 2,030.0 | OMP
|=============================================================================
Profiling with events can generate large amounts of data, so it is best to start with a small problem size of short duration. Additional measures to reduce the overhead of profiling by targeting specific part of a program are also discussed below.
Viewing profiling data
Using pat_report
The default reports produced by perftools-lite and perftools-lite-events
give information on a relatively small number of the most significant
routines in the instrumented program in terms of samples, or time taken.
The pat_report
utility can be used to interrogate the profiling data to give additional
information, particularly when event tracing is used.
The report format can be controlled with the -O flag to pat_report.
A number of examples are:
-O calltreeShow top-down call tree with inclusive times (or samples).-O callersShow the calls leading to the routines that have a high use in the report (bottom-up).-O callers+srcAppend the relevant source code line numbers in the callers list.-O load_balanceShow load-balance statistics for the high-use routines in the program. Parallel processes with minimum, maximum and median times for routines will be displayed. Only available with event profiling.-O mpi_callersShow MPI message statistics. Only available with event profiling.
Other pat_report options include:
-TSet threshold for reporting to zero; this will show all functions called by the program.-vGive verbose information and suggestions in the Table notes.
See man pat_report for further information.
Using the Apprentice GUI
A graphical user interface to PerfTools results is provided by Apprentice, for which a suitable X-windows LINK PENING connection will be required.
Apprentice is invoked with, e.g.,
module load perftools-base
app3 myapp.x+606388-1010655545t
myapp.x+606388-1010655545t is the relevant profiling directory.
The text report (cf pat_report) or various graphical representations can
be explored.
General perftools instrumentation
The perftools-lite and perftools-lite-events modules provide a simple
way to generate sampling and event-based profiles, respectively. However,
for a large production run, event sampling might come with an unduly
large overhead. In this situation, it would be disirable to be able to
combine the low overhead of the sampling approach with the detail generated
by the event-based profile. This can be done using the general perftools
module.
Sampling via pat_build
-
Ensure the
perftools-basemodule is loaded, and load theperftoolsmodule:module load perftools -
Compile or re-compile your code using the compiler wrappers (
cc,CCorftn). Object files (or libraries) need to be made available to PerfTools to be able to build an instrumented executable for profiling. This may mean that the compile and link stage need to be separated, e.g..cc -c myapp.c cc -o myapp.x myapp.o -
To instrument the binary, use the
pat_buildcommand. This will generate a new executable with+patappended, e.g.:will generate a new executablepat_build myapp.xmyapp.x+pat(it will leave the original unchanged). -
Run the new executable with
+patextension to generate a sampling result. This will produce a new directory with the raw sampling results, e.g.,myapp.x+pat+540878-1010655545s. -
At this point the sampling results directory will contain a single subdirectory with the raw results (typically
xf-files). Usepat_reportto generate a report in the usual way. This will also create a new file in the results directory calledbuild-options.apa.
Targeted event profiling
What we can now do is to use the sampling information produced by the
program with the +pat extension to generate an event based profile
which consists of only those routines identified in the sampling as
significant. This reduces the overhead of the event profiling.
-
Generate a further executable using
pat_buildfrom the build options file produced at the previous sampling report stage, e.g.:The new execytable will have the exetensionpat_build -O myapp.x+pat+540878-1010655545s/build-options.apamyapp.x+apa(for automatic program analysis). -
Run the new executable with the
.apaextension to produce a new event-based profile. This with create a new results directory, e.g.,myapp.x+apa+933004-1010655545t. Thetat the end of the directory name indicates this is a trace, or event-based profile. -
A report on the new event-based profile can now be generated in the usual way using
pat_report.
Manual event specification
If the automatic program analysis does not produce the correct events,
pat_build can be used to specify explicitly the set to be collected.
The are a number of options to do this, e.g.:
-
pat_build -w myapp.xselects all functions/events. -
pat_build -w -g mpi myapp.xselects a group of functions (here MPI calls). Other groups includelibsci,lapack,omp(for OpenMP runtime API functions). pat_build -T function1 myapp.xselects a named function (this is the mangled name for Fortran and C++).
See the manual page for pat_build for further details.
Hardware counter groups
Profiling will collect a default set of hardware counters which will appear int the report, e.g.,
PAPI_TOT_CYC 2,664,774,564,580 # Total number of CPU cycles
PAPI_TOT_INS 5,310,155,653,321 # Instructions completed
PAPI_TLB_DM 125,957,932 # Translation lookaside buffer misses
PAPI_FP_OPS 3,437,814,054,580 # Floating point operations
export PAT_RT_PERFCTR=fp_stats
papi_avail
Further information
Additional information is available interactively via pat_help, and via
man pages for pat_report, pat_build etc.