Intel VTune
Profiling using VTune
Intel VTune allows profiling of compiled codes, and is particularly suited to analysing high performance applications involving threads (OpenMP), and MPI (or some combination thereof).
Using VTune is a two-stage process. First, an application is compiled using an appropriate Intel compiler and run in a "collection" phase. The results are stored to file, and may then be inspected interactively via the VTune GUI.
Collection
Compile the application in the normal way, and run a batch job in exclusive mode to ensure the node is not shared with other jobs. An example is given below.
Collection of performance data is based on a collect
option, which
defines which set of hardware counters are monitered in a given run. As
not all counters are available at the same time, a number of different
collections are available. A different one may be relevant if interested
in different aspects of performance. Some standard options are:
vtune -collect=performance-snapshot
may be used to product a text
summary of performance (typically to standard output), which can be used
as a basis for further investigation.
vtune -collect=hotspots
produces a more detailed analysis which can be
used to inspect time taken per function and per line of code.
vtune -collect=hpc-performance
may be useful for HPC codes.
vtune --collect=meory-access
will provide figures for memory-related
measures including application memory bandwidth.
Use vtune --help collect
for a full summary of collection options.
Note that not all options are available (e.g., prefer NVIDIA profiling
for GPU codes).
Example SLURM script
Here we give an example of profiling an application which has been
compiled with Intel 20.4 and requests the memory-access
collection. We
assume the application involves OpenMP threads, but no MPI.
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --partition=standard
#SBATCH --qos=standard
export OMP_NUM_THREADS=18
# Load relevant (cf. compile-time) Intel options
module load intel-20.4/compilers
module load intel-20.4/vtune
vtune -collect=memory-access -r results-memory ./my_application
Profiling will generate a certain amount of additional text information;
this appears on standard output. Detailed profiling data will be stored
in various files in a sub-directory, the name of which can be specified
using the -r
option.
Notes
-
Older Intel compilers use
amplxe-cl
instead ofvtune
as the command for collection. Some existing features still reflect this older name. Older versions do not offer the "performance-snapshot" collection option. -
Extra time should be allowed in the wall clock time limit to allow for processing of the profiling data by
vtune
at the end of the run. In general, a short run of the application (a few minutes at most) should be tried first. -
A warning may be issued:
amplxe: Warning: Access to /proc/kallsyms file is limited. Consider changing /proc/sys/kernel/kptr_restrict to 0 to enable resolution of OS kernel and kernel modules symbols.
This may be safely ignored.
-
A warning may be issued:
amplxe: Warning: The specified data limit of 500 MB is reached. Data collection is stopped. amplxe: Collection detached.
This can be safely ignored, as a working result will still be obtained. It is possible to increase the limit via the
-data-limit
option (500 MB is the default). However, larger data files can take an extremely long time to process in the report stage at the end of the run, and so the option is not recommended. -
For Intel 20.4, the
--collect=hostspots
option has been observed to be problematic. We suggest it is not used.
Profiling an MPI code
Intel VTune can also be used to profile MPI codes. It is recommended
that the relavant Intel MPI module is used for compilation. The
following example uses Intel 18 with the older amplxe-cl
command:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --partition=standard
#SBATCH --qos=standard
export OMP_NUM_THREADS=18
module load intel-mpi-18
module load intel-compilers-18
module load intel-vtune-18
mpirun -np 4 -ppn 2 amplxe-cl -collect hotspots -r vtune-hotspots \
./my_application
Note that the Intel MPI launcher mpirun
is used, and this precedes the
VTune command. The example runs a total of 4 MPI tasks (-np 4
) with
two tasks per node (-ppn 2
). Each task runs 18 OpenMP threads.
Viewing the results
We recommend that the latest version of the VTune GUI is used to view results; this can be run interactively with an appropriate X connection. The latest version is available via
$ module load oneapi
$ module load vtune/latest
$ vtune-gui
From the GUI, navigate to the appropriate results file to load the analysis. Note that the latest version of VTune will be able to read results generated with previous versions of the Intel compilers.