Profiling

Examples

Profiling on a Cuda device

Compile Parthenon with debug symbols (currently the default when not specifically specifying a build type in cmake)
Build the connector (which allows named regions and kernel names to be properly shown in the profiler rather than the long names deduced from the templates)

git clone https://github.com/kokkos/kokkos-tools.git
cd kokkos-tools/profiling/nvprof-connector
export CUDA_ROOT=/path/to/cuda
make

Now there should be a kp_nvprof_connector.so file.

Collect the data (using Nsight Systems as nvprof and nvvp are deprecated)

# "enable" the Kokkos profiling tool
export KOKKOS_PROFILE_LIBRARY=/path/to/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so
# (for now) make sure to just use a single thread (for a build with OpenMP host space)
export OMP_NUM_THREADS=1

# collect the actual data
nsys profile --delay=60 --duration=10 -o my_profile /path/to/example/advection/advection-example -i ${PARTHENON_ROOT}/tst/regression/test_suites/advection_performance/parthinput.advection_performance parthenon/mesh/nx1=256 parthenon/mesh/nx2=256  parthenon/mesh/nx3=256 parthenon/meshblock/nx1=32 parthenon/meshblock/nx2=32 parthenon/meshblock/nx3=32  parthenon/time/nlim=1000

A sample output may look like

-----------------------------------------------------------
KokkosP: NVTX Analyzer Connector (sequence is 0, version: 20171029)
-----------------------------------------------------------

Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=1.1718750000000000e-03 zone-cycles/wsec = 0.00e+00
cycle=1 time=1.1718750000000000e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 1.22e+06
cycle=2 time=2.3437499999999999e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.91e+18
cycle=3 time=3.5156249999999997e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.47e+06
cycle=4 time=4.6874999999999998e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.47e+06
cycle=5 time=5.8593750000000000e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.47e+06
cycle=6 time=7.0312500000000002e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.47e+06
cycle=7 time=8.2031250000000003e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.47e+06
Collecting data...
cycle=8 time=9.3749999999999997e-03 dt=1.1718750000000000e-03 zone-cycles/wsec = 4.01e+06
cycle=9 time=1.0546874999999999e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.78e+06
cycle=10 time=1.1718749999999998e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.81e+06
cycle=11 time=1.2890624999999998e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.79e+06
cycle=12 time=1.4062499999999997e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.80e+06
Processing events...
cycle=13 time=1.5234374999999996e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 2.94e+06
cycle=14 time=1.6406249999999997e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.21e+06
Capturing symbol files...
cycle=15 time=1.7578124999999997e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.10e+06
cycle=16 time=1.8749999999999996e-02 dt=1.1718750000000000e-03 zone-cycles/wsec = 3.35e+06
Saving temporary "/tmp/nsys-report-7f08-9ec3-9820-3ff4.qdstrm" file to disk...

Sent signal 15 (Terminated) to target application's process group ID = 22342.
Use the kill option to modify this behavior.
Creating final output files...

Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-7f08-9ec3-9820-3ff4.qdrep"
Report file moved to "/path/to/my_profile.qdrep"
-----------------------------------------------------------
KokkosP: Finalization of NVTX Connector. Complete.
-----------------------------------------------------------

Now there should be a my_profile.qdrep file (that should be copied to system with a desktop environment).

Analyze the data

Start the GUI (nsys-ui)
Go to Tools - Options and set "Rename CUDA Kernels by NVTX" to "Yes" to get the Kokkos labels shown on the GUI.
Open the my_profile.qdrep file. The output should look similar to

Note that if the backtraces for hostcode are not shown properly, this is likely related to an old version of the Nsights suite. This got fixed and with the newer versions (>2020.4) it has been confirmed to work for Parthenon.

Profiling with tau

This example is on RZAnsel.

N.b.: this is a work in progress

Prerequisites

It appears that you have to build your own version of tau with spack, so that tau and parthenon are built with the same compilers (compiler families?). This example uses gcc-7.3.1 and spectrum-mpi-2020-08-19:

Add gcc-7.3.1 to your ~/.spack/linux/compilers.yaml:

- compiler:
  spec: gcc@7.3.1
  paths:
    cc: /usr/tce/packages/gcc/gcc-7.3.1/bin/gcc
    cxx: /usr/tce/packages/gcc/gcc-7.3.1/bin/g++
    f77: /usr/tce/packages/gcc/gcc-7.3.1/bin/gfortran
    fc: /usr/tce/packages/gcc/gcc-7.3.1/bin/gfortran
  flags: {}
  operating_system: rhel7
  target: ppc64le
  modules: [gcc/7.3.1]
  environment: {}
  extra_rpaths: []

Set your ~/.spack/packages.yaml to know about spectrum-mpi, and to prefer gcc over clang:

packages:
  all:
     compiler: [gcc, clang]
# to prefer gcc to clang and only use spectrum mpi.
     providers:
         mpi: [spectrum-mpi]
  spectrum-mpi:
    version: [2020.08.19]
    modules:
     spectrum-mpi@2020.08.19: spectrum-mpi/2020.08.19
    buildable: False

Verify concretized spec: spack spec tau+mpi
and install: spack install tau+mpi

Dynamic instrumentation (`tau_exec`)

Build Parthenon. I don't know if you have to do anything special (perhaps turn on debugging symbols).
Load tau: spack load -r tau+mpi
Use tau_exec to collect profile data, possibly with some environment variables set that control what information is collected
- TAU_TRACK_MESSAGES=1, TAU_COMM_MATRIX=1 -- seem to be needed to collect MPI communication info to create the communication matrix
- TAU_PROFILE=1 -- tell tau to collect profiling info, results in a bunch of profile.<n>.<m>.<o> files (on by default?)
- TAU_TRACE=1 -- tell tau to collect traces, results in a bunch of events.<n>.edf and tautrace.<n>.<m>.<o>.trc files
- example invocation:
```
jsrun -p 2 -g 1 -c 2 -M "-gpu" tau_exec ../../bld/example/advection/advection-example -i ../../example/advection/parthinput.advection parthenon/time/nlim=10 parthenon/mesh/nx1=256 parthenon/mesh/nx2=128 parthenon/mesh/nx3=64 parthenon/meshblock/nx1=32 parthenon/meshblock/nx2=32 parthenon/meshblock/nx3=32 --kokkos-num-devices=1
```
The profile.* files can be read with pprof (prints out a text file) or paraprof (visual profiler):
- pprof > my_profile.txt
- paraprof, then use the File->load option to open the files. If run in the same dir as the profile.* files, it loaded them automatically for me; if not, be sure to select the file type as 'Tau profiles' so that all are read.
The *.trc and *.edf files can be visualized with a number of tools, but usually need to be combined and then converted to a different file format.
- to combine them, run tau_treemerge.pl in the directory containing all those files (for one run); this gives you the combined files as tau.edf and tau.trc
- to convert them to 'slog2' format so that jumpshot can open them, use tau2slog2: tau2slog2 tau.trc tau.edf -o tau.slog2 (n.b. this produced an error when I ran it, but also produced a sizable output file, so...)

Profiling with hpctoolkit

It appears that hpctoolkit may be an easy way to collect hardware counter data (e.g. for manual roofline model construction, etc.)

THIS IS A DRAFT, TO BE UPDATED/COMPLETED LATER

The actually helpful User Guide: HPCToolkit-users-manual.pdf

on rzAnsel:

module load hpctoolkit
lalloc 1 -W 200
hpcrun -L $MYWS/repos/parthenon/build-buffer-in-one/example/advection/advection-example ##Gives available counters, not just PAPI, but not all profilable
papi_event_chooser PAPI_TOT_INS PAPI_TOT_CYC PAPI_L3_DCR PAPI_FP_OPS ##make sure these papi events can be collected at the same time
lrun -n 2 -g 1 -c 1 -M "-gpu" hpcrun -e PAPI_TOT_INS -e PAPI_TOT_CYC -e PAPI_L3_DCR -e PAPI_FP_OPS $MYWS/repos/parthenon/build-buffer-in-one/example/advection/advection-example -i $MYWS/repos/parthenon/runs/parthinput.advection parthenon/time/nlim=10 parthenon/mesh/nx1=512 parthenon/mesh/nx2=256 parthenon/mesh/nx3=256
hpcstruct $MYWS/repos/parthenon/build-buffer-in-one/example/advection/advection-example ##takes a min or so
hpcprof -S advection-example.hpcstruct -I $MYWS/repos/parthenon/example/advection/ hpctoolkit-advection-example-measurements-626723
hpcviewer hpctoolkit-advection-example-database-626723 ##need working x11 windows
# Can do, e.g.:
#  -  select 'flat view'
#     -  navigate down to a function/loop of interest, see source code
#  -  remove all the inclusive '(I)' columns from view
#  -  calculate IPC (instructions per cycle): select 'f(x)' button, follow instructions

Profiling using Kokkos tools

Out of the box usage (space time stack as example)

Load/setup environment modules
Compile Parthenon (compiling with debug symbols is not required)
Build the profiling tool (Here the space-time stack, other tools are available in different directories of the same repo.)

Also note that we're not cloning the kokkos:kokkos-tools repo by a fork where @pgrete merged a couple WIP features in (specifically #113 and #114), lowered the threshold (fraction of time used) for reporting data, and added support to disable allocation tracking.

git clone https://github.com/pgrete/kokkos-tools.git
git checkout develop
cd kokkos-tools/profiling/space-time-stack
make

Now there should be a kp_space_time_stack.so.so file.

Enable profiling tool

export KOKKOS_PROFILE_LIBRARY=/path/to/kokkos-tools/profiling/space-time-stack/kp_space_time_stack.so

Note that using tools that while the profiling interface itself is very lightweight, enabling tools (by setting the KOKKOS_PROFILE_LIBRARY environment variable) is likely to affect performance due to additional fences in the code (that are required to get correct timing results). Therefore, to measure the "actual" performance (e.g., the aggregated zone-cycles/s) a run with no profiling library (unset KOKKOS_PROFILE_LIBRARY) should be used.

Gather performance data

The following command corresponds to our "short" regression test configuration without writing outputs.

/usr/bin/mpiexec -n 4 /PATH/TO/PARTHENON/BUILD/example/advection/advection-example -i /PATH/TO/PARTHENON/ROOT/tst/regression/test_suites/output_hdf5/parthinput.advection parthenon/job/problem_id=advection_3d parthenon/mesh/numlevel=2 parthenon/mesh/nx1=32 parthenon/meshblock/nx1=8 parthenon/mesh/nx2=32 parthenon/meshblock/nx2=8 parthenon/mesh/nx3=32 parthenon/meshblock/nx3=8 parthenon/time/integrator=rk1 Advection/cfl=0.3 parthenon/output0/dt=-1.0 

Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=4.6874999999999998e-03 zone-cycles/wsec = 0.00e+00
cycle=1 time=4.6874999999999998e-03 dt=4.6874999999999998e-03 zone-cycles/wsec = 7.91e+05
cycle=2 time=9.3749999999999997e-03 dt=4.6874999999999998e-03 zone-cycles/wsec = 1.33e+07
cycle=3 time=1.4062499999999999e-02 dt=4.6874999999999998e-03 zone-cycles/wsec = 1.73e+07
...
cycle=214 time=1.0000000000000000e+00 dt=3.1250000000078160e-03 zone-cycles/wsec = 1.16e+07

Driver completed.
time=1.00e+00 cycle=214
tlim=1.00e+00 nlim=-1

Number of MeshBlocks = 204; 588  created, 448 destroyed during this simulation.

walltime used = 2.00e+01
zone-cycles/wallsecond = 1.24e+06

Following the Parthenon output, the processed profiling output is shown:

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 2.05e+01 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
=================== 
|-> 2.00e+01 sec 97.3% 2.3% 0.0% 0.1% 7.89e+03 1 Driver_Main [region]
|   |-> 1.78e+01 sec 86.9% 0.2% 0.1% 0.0% 4.53e+03 214 LoadBalancingAndAdaptiveMeshRefinement [region]
|   |   |-> 1.77e+01 sec 86.0% 0.2% 0.0% 0.0% 4.58e+03 30 RedistributeAndRefineMeshBlocks [region]
|   |   |   |-> 7.64e+00 sec 37.2% 0.2% 13.9% 99.8% 6.55e+03 30 Step 7: Construct new MeshBlockList [region]
|   |   |   |-> 6.91e+00 sec 33.7% 0.1% 45.3% 0.3% 4.36e+03 30 Mesh::Initialize [region]
|   |   |   |   |-> 6.77e+00 sec 33.0% 0.0% 46.3% 99.8% 0.00e+00 1770 Task_ReceiveAndSetBoundariesWithWait [region]
|   |   |   |   |-> 5.37e-02 sec 0.3% 1.0% 5.7% 1.0% 9.05e+04 1800 Task_FillDerived [region]
|   |   |   |   |   |-> 2.72e-02 sec 0.1% 0.7% 4.5% 37.2% 5.96e+04 1800 PostFillDerived [region]
|   |   |   |   |-> 4.98e-02 sec 0.2% 7.1% 3.7% 72.6% 5.07e+05 1770 Task_SendBoundaryBuffers_MeshBlockData [region]
|   |   |   |-> 3.07e+00 sec 14.9% 0.0% 71.2% 100.0% 3.23e+01 30 Step 8: Recv data and unpack [region]
|   |   |-> 1.94e-01 sec 0.9% 0.0% 15.4% 100.0% 0.00e+00 214 UpdateMeshBlockTree [region]
|   |-> 2.10e+00 sec 10.2% 20.4% 1.5% 12.4% 3.58e+04 214 MultiStage_Step [region]
|   |   |-> 4.22e-01 sec 2.1% 0.0% 14.2% 6.3% 0.00e+00 3289 Task_ReceiveBoundaryBuffers_MeshData [region]
|   |   |   |-> 3.95e-01 sec 1.9% 0.0% 10.8% 100.0% 0.00e+00 191929 Task_ReceiveBoundaryBuffers_MeshBlockData [region]
|   |   |-> 2.98e-01 sec 1.4% 0.0% 1.3% 1.2% 0.00e+00 12041 Task_FillDerived [region]
|   |   |   |-> 1.68e-01 sec 0.8% 0.0% 1.8% 28.7% 0.00e+00 12041 PostFillDerived [region]
|   |   |   |   |-> 1.20e-01 sec 0.6% 0.0% 1.5% 100.0% 0.00e+00 12041 advection_package::PostFill [region]
|   |   |   |-> 7.05e-02 sec 0.3% 0.0% 0.4% 54.5% 0.00e+00 12041 PreFillDerived [region]
|   |   |   |   |-> 3.21e-02 sec 0.2% 0.0% 0.8% 100.0% 0.00e+00 12041 advection_package::PreFill [region]
|   |   |   |-> 5.60e-02 sec 0.3% 0.0% 1.0% 49.6% 0.00e+00 12041 FillDerived [region]
|   |   |       |-> 2.82e-02 sec 0.1% 0.0% 0.9% 100.0% 0.00e+00 12041 advection_package::SquareIt [region]
|   |   |-> 2.72e-01 sec 1.3% 30.3% 1.9% 0.3% 3.04e+03 214 Task_SendBoundaryBuffers_MeshData [region]
|   |   |   |-> 1.07e-01 sec 0.5% 8.3% 6.0% 84.5% 5.74e+03 214 Create bndinfo array [region]
|   |   |   |-> 8.06e-02 sec 0.4% 0.0% 5.8% 100.0% 0.00e+00 214 Set complete and/or start sending via MPI [region]
|   |   |   |-> 7.34e-02 sec 0.4% 100.0% 3.3% ------ 214 SendBoundaryBuffers [for]
|   |   |-> 2.41e-01 sec 1.2% 89.8% 1.2% 10.2% 1.50e+05 12041 Task_Advection_CalculateFluxes [region]
|   |   |   |-> 8.18e-02 sec 0.4% 100.0% 1.4% ------ 12041 x3 flux [for]
|   |   |   |-> 7.41e-02 sec 0.4% 100.0% 1.5% ------ 12041 x2 flux [for]
|   |   |   |-> 6.04e-02 sec 0.3% 100.0% 0.5% ------ 12041 x1 flux [for]
|   |   |-> 1.41e-01 sec 0.7% 0.2% 23.9% 25.9% 1.36e+04 428 Task_WeightedSumData [region]
|   |   |   |-> 1.04e-01 sec 0.5% 0.0% 1.9% 100.0% 0.00e+00 428 WeightedSumData [region]
|   |   |-> 1.07e-01 sec 0.5% 89.8% 5.2% 1.5% 2.59e+03 214 Task_SetBoundaries_MeshData [region]
|   |   |   |-> 9.47e-02 sec 0.5% 100.0% 2.1% ------ 214 SetBoundaries [for]
|   |   |-> 9.08e-02 sec 0.4% 0.0% 7.1% 5.5% 0.00e+00 12041 Task_ProlongateBoundaries [region]
|   |   |   |-> 5.64e-02 sec 0.3% 0.0% 7.1% 70.5% 0.00e+00 12041 RestrictBoundaries [region]
|   |   |   |-> 2.84e-02 sec 0.1% 0.0% 8.6% 37.9% 0.00e+00 12041 ProlongateBoundaries [region]
|   |   |-> 7.05e-02 sec 0.3% 0.0% 5.9% 100.0% 0.00e+00 214 FluxDivergenceMesh [region]
|   |   |-> 5.86e-02 sec 0.3% 0.0% 5.7% 100.0% 0.00e+00 12041 Task_StartReceiving [region]
|   |   |-> 3.34e-02 sec 0.2% 70.4% 0.3% 29.6% 3.61e+05 12041 Task_Tag_Block [region]
|   |   |   |-> 2.35e-02 sec 0.1% 100.0% 0.9% ------ 12041 advection check refinement [reduce]
|   |   |-> 3.18e-02 sec 0.2% 0.0% 13.2% 97.2% 0.00e+00 20526 Task_ReceiveFluxCorrection [region]
|   |   |-> 3.08e-02 sec 0.1% 0.0% 4.1% 100.0% 0.00e+00 12041 Task_ClearBoundary [region]
|   |   |-> 2.13e-02 sec 0.1% 4.9% 8.4% 86.7% 2.76e+05 12041 Task_SendFluxCorrection [region]
|-> 4.25e-01 sec 2.1% 0.3% 6.6% 0.8% 1.04e+04 1 Mesh::Initialize [region]
|   |-> 3.87e-01 sec 1.9% 0.2% 0.8% 0.0% 9.05e+03 2 LoadBalancingAndAdaptiveMeshRefinement [region]
|   |   |-> 3.83e-01 sec 1.9% 0.2% 0.1% 0.0% 9.13e+03 1 RedistributeAndRefineMeshBlocks [region]
|   |   |   |-> 3.55e-01 sec 1.7% 0.2% 7.2% 99.8% 8.00e+03 1 Step 7: Construct new MeshBlockList [region]
|   |   |   |-> 2.84e-02 sec 0.1% 0.2% 137.7% 1.1% 2.33e+04 1 Mesh::Initialize [region]
|   |   |   |   |-> 2.61e-02 sec 0.1% 0.0% 149.5% 99.5% 0.00e+00 30 Task_ReceiveAndSetBoundariesWithWait [region]

BOTTOM-UP TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <number of calls> <name> [type]
=================== 
|-> 3.19e+01 sec 38.8% 0.0% 0.0% 0.0% 0.00e+00 31 Step 7: Construct new MeshBlockList [region]
|   |-> 3.19e+01 sec 38.8% 0.0% 0.0% 0.0% 0.00e+00 31 RedistributeAndRefineMeshBlocks [region]
|       |-> 3.19e+01 sec 38.8% 0.0% 0.0% 0.0% 0.00e+00 31 LoadBalancingAndAdaptiveMeshRefinement [region]
|           |-> 3.05e+01 sec 37.1% 0.0% 0.0% 0.0% 0.00e+00 30 Driver_Main [region]
|           |-> 1.42e+00 sec 1.7% 0.0% 0.0% 0.0% 0.00e+00 1 Mesh::Initialize [region]
|-> 2.72e+01 sec 33.2% 0.0% 0.0% 0.0% 0.00e+00 1846 Task_ReceiveAndSetBoundariesWithWait [region]
|   |-> 2.72e+01 sec 33.2% 0.0% 0.0% 0.0% 0.00e+00 1846 Mesh::Initialize [region]
|       |-> 2.71e+01 sec 33.1% 0.0% 0.0% 0.0% 0.00e+00 1800 RedistributeAndRefineMeshBlocks [region]
|       |   |-> 2.71e+01 sec 33.1% 0.0% 0.0% 0.0% 0.00e+00 1800 LoadBalancingAndAdaptiveMeshRefinement [region]
|       |       |-> 2.70e+01 sec 32.9% 0.0% 0.0% 0.0% 0.00e+00 1770 Driver_Main [region]
|       |       |-> 1.04e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 30 Mesh::Initialize [region]
|-> 1.23e+01 sec 14.9% 0.0% 0.0% 0.0% 0.00e+00 31 Step 8: Recv data and unpack [region]
|   |-> 1.23e+01 sec 14.9% 0.0% 0.0% 0.0% 0.00e+00 31 RedistributeAndRefineMeshBlocks [region]
|       |-> 1.23e+01 sec 14.9% 0.0% 0.0% 0.0% 0.00e+00 31 LoadBalancingAndAdaptiveMeshRefinement [region]
|           |-> 1.23e+01 sec 14.9% 0.0% 0.0% 0.0% 0.00e+00 30 Driver_Main [region]
|-> 1.58e+00 sec 1.9% 0.0% 0.0% 0.0% 0.00e+00 191929 Task_ReceiveBoundaryBuffers_MeshBlockData [region]
|   |-> 1.58e+00 sec 1.9% 0.0% 0.0% 0.0% 0.00e+00 191929 Task_ReceiveBoundaryBuffers_MeshData [region]
|       |-> 1.58e+00 sec 1.9% 0.0% 0.0% 0.0% 0.00e+00 191929 MultiStage_Step [region]
|           |-> 1.58e+00 sec 1.9% 0.0% 0.0% 0.0% 0.00e+00 191929 Driver_Main [region]
|-> 1.04e+00 sec 1.3% 0.0% 0.0% 0.0% 0.00e+00 214 MultiStage_Step [region]
|   |-> 1.04e+00 sec 1.3% 0.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 7.90e-01 sec 1.0% 0.0% 0.0% 0.0% 0.00e+00 216 UpdateMeshBlockTree [region]
|   |-> 7.90e-01 sec 1.0% 0.0% 0.0% 0.0% 0.00e+00 216 LoadBalancingAndAdaptiveMeshRefinement [region]
|       |-> 7.78e-01 sec 0.9% 0.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 5.48e-01 sec 0.7% 0.0% 0.0% 0.0% 0.00e+00 13887 advection_package::PostFill [region]
|   |-> 5.48e-01 sec 0.7% 0.0% 0.0% 0.0% 0.00e+00 13887 PostFillDerived [region]
|       |-> 5.48e-01 sec 0.7% 0.0% 0.0% 0.0% 0.00e+00 13887 Task_FillDerived [region]
|           |-> 4.78e-01 sec 0.6% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|           |   |-> 4.78e-01 sec 0.6% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 4.17e-01 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 428 WeightedSumData [region]
|   |-> 4.17e-01 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 428 Task_WeightedSumData [region]
|       |-> 4.17e-01 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 428 MultiStage_Step [region]
|           |-> 4.17e-01 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 428 Driver_Main [region]
|-> 3.97e-01 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 245 Create bndinfo array [region]
|   |-> 3.60e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 Task_SendBoundaryBuffers_MeshData [region]
|   |   |-> 3.60e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 MultiStage_Step [region]
|   |       |-> 3.60e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 3.79e-01 sec 0.5% 100.0% 0.0% ------ 214 SetBoundaries [for]
|   |-> 3.79e-01 sec 0.5% 100.0% 0.0% 0.0% 0.00e+00 214 Task_SetBoundaries_MeshData [region]
|       |-> 3.79e-01 sec 0.5% 100.0% 0.0% 0.0% 0.00e+00 214 MultiStage_Step [region]
|           |-> 3.79e-01 sec 0.5% 100.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 3.27e-01 sec 0.4% 100.0% 0.0% ------ 12041 x3 flux [for]
|   |-> 3.27e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 12041 Task_Advection_CalculateFluxes [region]
|       |-> 3.27e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|           |-> 3.27e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 3.22e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 Set complete and/or start sending via MPI [region]
|   |-> 3.22e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 Task_SendBoundaryBuffers_MeshData [region]
|       |-> 3.22e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 MultiStage_Step [region]
|           |-> 3.22e-01 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 2.96e-01 sec 0.4% 100.0% 0.0% ------ 12041 x2 flux [for]
|   |-> 2.96e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 12041 Task_Advection_CalculateFluxes [region]
|       |-> 2.96e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|           |-> 2.96e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 2.94e-01 sec 0.4% 100.0% 0.0% ------ 214 SendBoundaryBuffers [for]
|   |-> 2.94e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 214 Task_SendBoundaryBuffers_MeshData [region]
|       |-> 2.94e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 214 MultiStage_Step [region]
|           |-> 2.94e-01 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 2.82e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 214 FluxDivergenceMesh [region]
|   |-> 2.82e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 214 MultiStage_Step [region]
|       |-> 2.82e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 214 Driver_Main [region]
|-> 2.51e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 13887 Task_StartReceiving [region]
|   |-> 2.35e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|   |   |-> 2.35e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 2.50e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 13920 PostFillDerived [region]
|   |-> 2.50e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 13920 Task_FillDerived [region]
|       |-> 1.92e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|       |   |-> 1.92e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 2.42e-01 sec 0.3% 100.0% 0.0% ------ 12041 x1 flux [for]
|   |-> 2.42e-01 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 12041 Task_Advection_CalculateFluxes [region]
|       |-> 2.42e-01 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|           |-> 2.42e-01 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 2.08e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 13920 PreFillDerived [region]
|   |-> 2.08e-01 sec 0.3% 0.0% 0.0% 0.0% 0.00e+00 13920 Task_FillDerived [region]
|       |-> 1.54e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|       |   |-> 1.54e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 1.83e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 RestrictBoundaries [region]
|   |-> 1.83e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 Task_ProlongateBoundaries [region]
|       |-> 1.59e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|       |   |-> 1.59e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 1.62e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13920 FillDerived [region]
|   |-> 1.62e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13920 Task_FillDerived [region]
|       |-> 1.11e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|       |   |-> 1.11e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 1.53e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 1846 Task_SendBoundaryBuffers_MeshBlockData [region]
|   |-> 1.53e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 1846 Mesh::Initialize [region]
|       |-> 1.47e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 1800 RedistributeAndRefineMeshBlocks [region]
|       |   |-> 1.47e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 1800 LoadBalancingAndAdaptiveMeshRefinement [region]
|       |       |-> 1.44e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 1770 Driver_Main [region]
|-> 1.47e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 advection_package::PreFill [region]
|   |-> 1.47e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 PreFillDerived [region]
|       |-> 1.47e-01 sec 0.2% 0.0% 0.0% -0.0% 0.00e+00 13887 Task_FillDerived [region]
|           |-> 1.28e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|           |   |-> 1.28e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 1.46e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 428 Task_WeightedSumData [region]
|   |-> 1.46e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 428 MultiStage_Step [region]
|       |-> 1.46e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 428 Driver_Main [region]
|-> 1.36e-01 sec 0.2% 0.0% 0.0% -0.0% 0.00e+00 13887 Task_ClearBoundary [region]
|   |-> 1.23e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|   |   |-> 1.23e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 1.28e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 advection_package::SquareIt [region]
|   |-> 1.28e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 FillDerived [region]
|       |-> 1.28e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 13887 Task_FillDerived [region]
|           |-> 1.13e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|           |   |-> 1.13e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 1.24e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 20526 Task_ReceiveFluxCorrection [region]
|   |-> 1.24e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 20526 MultiStage_Step [region]
|       |-> 1.24e-01 sec 0.2% 0.0% 0.0% 0.0% 0.00e+00 20526 Driver_Main [region]
|-> 1.13e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 188054 RestrictCellCenteredValues3d [region]
|-> 1.09e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 1 Driver_Main [region]
|-> 1.05e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 3289 Task_ReceiveBoundaryBuffers_MeshData [region]
|   |-> 1.05e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 3289 MultiStage_Step [region]
|       |-> 1.05e-01 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 3289 Driver_Main [region]
|-> 9.88e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 Task_Advection_CalculateFluxes [region]
|   |-> 9.88e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|       |-> 9.88e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 9.48e-02 sec 0.1% 100.0% 0.0% ------ 12087 advection check refinement [reduce]
|   |-> 9.40e-02 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 12041 Task_Tag_Block [region]
|   |   |-> 9.40e-02 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 12041 MultiStage_Step [region]
|   |       |-> 9.40e-02 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 12041 Driver_Main [region]
|-> 9.38e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 32 Mesh::Initialize [region]
|-> 8.40e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 1816 SearchAndSetNeighbors [region]
|   |-> 8.30e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 1800 RedistributeAndRefineMeshBlocks [region]
|   |   |-> 8.30e-02 sec 0.1% 0.0% 0.0% 0.0% 0.00e+00 1800 LoadBalancingAndAdaptiveMeshRefinement [region]

KOKKOS HOST SPACE:
=================== 
MAX MEMORY ALLOCATED: 18129.5 kB
MPI RANK WITH MAX MEMORY: 2
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS CUDA SPACE:
=================== 
MAX MEMORY ALLOCATED: 0.0 kB
MPI RANK WITH MAX MEMORY: 0
ALLOCATIONS AT TIME OF HIGH WATER MARK:

Host process high water mark memory consumption: 55944 kB
  Max: 55944, Min: 54792, Ave: 55321 kB

END KOKKOS PROFILING REPORT.

This tells us, for example, that we are spending the majority of the total runtime (86.9%) in this example in the LoadBalancingAndAdaptiveMeshRefinement region (| |-> 1.78e+01 sec 86.9%* 0.2% 0.1% 0.0% 4.53e+03 214 LoadBalancingAndAdaptiveMeshRefinement [region]).

Profiling regions are the regions we manually define/introduced using Kokkos::Profiling::pushRegion("name string");. In addition the Kokkos parallel regions are also automatically profiled and shown (indicated by a [for] or [reduce] instead of the [region] in the output).

Note that if a region (or parallel regions) uses less than 0.1% of the total runtime it is not displayed (in case you wonder about missing regions).

Also note that the default loop pattern on CPU in Parthenon is using a non-Kokkos for for simd for style. Therefore the [for] related entry is missing and a [region] style entry is shown (as we manually add the region information in the abstraction layter with the same name, see here. In other words, comparing a GPU to a CPU output will show the same "kernels" but on CPUs kernels may be in the [region] or [for] category whereas on devices kernels will always be in the [for] category.

Really note Tracking allocation with the Kokkos tools currently introduces a significant overhead (up to the point that the run freezes on initialization) so that the output from the profiling regions may give a false impression of the hot spots. Thus, there's new "feature" to turn allocation tracking of (merge #388 in your branch and set export KOKKOS_TRACK_ALLOC_OFF=1 before running a test problem.

Compare profiling data using json output

With the "patched" space-time profiling tool of step 2 and the changes of the pgrete/bench-scripts branch pulled in (to allow disabling of allocation tracking), we can now collect the performance data in json format. For completeness, the commands required are (excluding compilation)

# "load" profiling library
export KOKKOS_PROFILE_LIBRARY=/path/to/kokkos-tools/profiling/space-time-stack/kp_space_time_stack.so

# Enable JSON output
export KOKKOS_PROFILE_EXPORT_JSON=1

# Disble allocation tracking in Parthenon
export KOKKOS_TRACK_ALLOC_OFF=1

# Run test
/usr/bin/mpiexec -n 4 /PATH/TO/PARTHENON/BUILD/example/advection/advection-example -i /PATH/TO/PARTHENON/ROOT/tst/regression/test_suites/output_hdf5/parthinput.advection parthenon/job/problem_id=advection_3d parthenon/mesh/numlevel=2 parthenon/mesh/nx1=32 parthenon/meshblock/nx1=8 parthenon/mesh/nx2=32 parthenon/meshblock/nx2=8 parthenon/mesh/nx3=32 parthenon/meshblock/nx3=8 parthenon/time/integrator=rk1 Advection/cfl=0.3 parthenon/output0/dt=-1.0

In addition to the standard profiling text output on the terminal, there is now a new spacetime.json file containing the profiling data in json format.

Note currently the filename is fixed to spacetime.json and the content overriden on each invocation.

The structure of the entry is as follows

{
"average-time" : 3.29e-01,           # average walltime in seconds per call
"percent" : 14.1,                    # percentage of walltime spent relative to the _entire_ run
"percent-kokkos" : 6.9,              # percentage of time spent in Kokkos in this entry
"imbalance" : 6.8,                   # MPI imbalance
"remainder" : 75.7,
"kernels-per-second" : 5.22e+05,
"number-of-calls" : 12041,
"name" : "Task_SendBoundaryBuffers", # name or the Kokkos/par_for kernel or region
"parent-id" : "0x55e57c6a96a0",      # the unique id of the higher level region, needed to reconstruct a nested call hierarchy such as a top down tree
"id" : "0x55e57cdfb480",             # a unique id
"kernel-type" : "region"             # type could be "region", "for", "reduce", "scan", or "copy"
}

An example of how create a comparative figure of two profiles can be found here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling

Examples

Profiling on a Cuda device

Profiling with tau

Prerequisites

Dynamic instrumentation (`tau_exec`)

Profiling with hpctoolkit

Profiling using Kokkos tools

Out of the box usage (space time stack as example)

Clone this wiki locally

Profiling

Examples

Profiling on a Cuda device

Profiling with tau

Prerequisites

Dynamic instrumentation (tau_exec)

Profiling with hpctoolkit

Profiling using Kokkos tools

Out of the box usage (space time stack as example)

Clone this wiki locally

Dynamic instrumentation (`tau_exec`)