Intel VTune
Profiling using VTune
Intel VTune allows profiling of compiled codes, and is particularly suited to analysing high performance applications involving threads (OpenMP), and MPI (or some combination thereof).
Using VTune is a two-stage process. First, an application is compiled using an appropriate Intel compiler and run in a "collection" phase. The results are stored to file, and may then be inspected interactively via the VTune GUI.
Collection
Compile the application in the normal way, and run a batch job in exclusive mode to ensure the node is not shared with other jobs. An example is given below.
Collection of performance data is based on a collect option, which
defines which set of hardware counters are monitered in a given run. As
not all counters are available at the same time, a number of different
collections are available. A different one may be relevant if interested
in different aspects of performance. Some standard options are:
vtune -collect=performance-snapshot may be used to product a text
summary of performance (typically to standard output), which can be used
as a basis for further investigation.
vtune -collect=hotspots produces a more detailed analysis which can be
used to inspect time taken per function and per line of code.
vtune -collect=hpc-performance may be useful for HPC codes.
vtune --collect=meory-access will provide figures for memory-related
measures including application memory bandwidth.
Use vtune --help collect for a full summary of collection options.
Note that not all options are available (e.g., prefer NVIDIA profiling
for GPU codes).
Example SLURM script
Here we give an example of profiling an application which has been
compiled with Intel 20.4 and requests the memory-access collection. We
assume the application involves OpenMP threads, but no MPI.
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --partition=standard
#SBATCH --qos=standard
export OMP_NUM_THREADS=18
# Load relevant (cf. compile-time) Intel options
module load intel-20.4/compilers
module load intel-20.4/vtune
vtune -collect=memory-access -r results-memory ./my_application
Profiling will generate a certain amount of additional text information;
this appears on standard output. Detailed profiling data will be stored
in various files in a sub-directory, the name of which can be specified
using the -r option.
Notes
-
Older Intel compilers use
amplxe-clinstead ofvtuneas the command for collection. Some existing features still reflect this older name. Older versions do not offer the "performance-snapshot" collection option. -
Extra time should be allowed in the wall clock time limit to allow for processing of the profiling data by
vtuneat the end of the run. In general, a short run of the application (a few minutes at most) should be tried first. -
A warning may be issued:
amplxe: Warning: Access to /proc/kallsyms file is limited. Consider changing /proc/sys/kernel/kptr_restrict to 0 to enable resolution of OS kernel and kernel modules symbols.
This may be safely ignored.
-
A warning may be issued:
amplxe: Warning: The specified data limit of 500 MB is reached. Data collection is stopped. amplxe: Collection detached.
This can be safely ignored, as a working result will still be obtained. It is possible to increase the limit via the
-data-limitoption (500 MB is the default). However, larger data files can take an extremely long time to process in the report stage at the end of the run, and so the option is not recommended. -
For Intel 20.4, the
--collect=hostspotsoption has been observed to be problematic. We suggest it is not used.
Profiling an MPI code
Intel VTune can also be used to profile MPI codes. It is recommended
that the relavant Intel MPI module is used for compilation. The
following example uses Intel 18 with the older amplxe-cl command:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=2
#SBATCH --exclusive
#SBATCH --partition=standard
#SBATCH --qos=standard
export OMP_NUM_THREADS=18
module load intel-mpi-18
module load intel-compilers-18
module load intel-vtune-18
mpirun -np 4 -ppn 2 amplxe-cl -collect hotspots -r vtune-hotspots \
./my_application
Note that the Intel MPI launcher mpirun is used, and this precedes the
VTune command. The example runs a total of 4 MPI tasks (-np 4) with
two tasks per node (-ppn 2). Each task runs 18 OpenMP threads.
Viewing the results
We recommend that the latest version of the VTune GUI is used to view results; this can be run interactively with an appropriate X connection. The latest version is available via
$ module load oneapi
$ module load vtune/latest
$ vtune-gui
From the GUI, navigate to the appropriate results file to load the analysis. Note that the latest version of VTune will be able to read results generated with previous versions of the Intel compilers.