2020.07.06 Meeting Notes: Performance Call

32^3 blocks vs. 256^3 blocks, 3x overhead for zone cycles when switching to a "big" buffer packing kernel.

It's worth noting the "big" kernel doesn't allow for per-variable MPI asynchrony.

Some ideas:

Potentially communicate "variable packs" instead - gets you some MPI asynchrony
There may be enough mesh blocks that you don't need per-variable communication
Could potentially pack variables into a single message to make up for the loss of overlap

Low hanging fruit:

Verify that data access/writes are coalesced in pgrete/pack-in-one
FindMeshBlock performance rears its head when targeting smaller mesh blocks. There's a fix in development "Athena" already - @pgrete will share patch file in https://github.com/lanl/parthenon/pull/213

Considerations:

Streams, and therefore MeshBlocks, cannot be shared between threads - Kokkos has issues with sharing streams between threads

Jim:

If he was redoing Athena today, he would pack all variables into a single buffer for MPI rather than how it's done today

Next Steps:

Optimizing a "big kernel" variable pack for cell centered variables that we can get performance with a uniform mesh on a single grid

Threads to pull on:

Use Kokkos hierarchical parallelism to implement the "big kernel" for packing variables
Coalescing read/writes for buffer packing routines
How similar are cell-centered and face-centered
Adding micro-benchmarks

Provide feedback