Skip to content

2020.07.06 Meeting Notes: Performance Call

Andrew Gaspar edited this page Jul 6, 2020 · 2 revisions

See discussion at: https://github.com/lanl/parthenon/issues/189

32^3 blocks vs. 256^3 blocks, 3x overhead for zone cycles when switching to a "big" buffer packing kernel.

It's worth noting the "big" kernel doesn't allow for per-variable MPI asynchrony.

Some ideas:

  • Potentially communicate "variable packs" instead - gets you some MPI asynchrony
  • There may be enough mesh blocks that you don't need per-variable communication
  • Could potentially pack variables into a single message to make up for the loss of overlap

Low hanging fruit:

  • Verify that data access/writes are coalesced in pgrete/pack-in-one
  • FindMeshBlock performance rears its head when targeting smaller mesh blocks. There's a fix in development "Athena" already - @pgrete will share patch file in https://github.com/lanl/parthenon/pull/213

Considerations:

  • Streams, and therefore MeshBlocks, cannot be shared between threads - Kokkos has issues with sharing streams between threads

Jim:

  • If he was redoing Athena today, he would pack all variables into a single buffer for MPI rather than how it's done today

Next Steps:

  • Optimizing a "big kernel" variable pack for cell centered variables that we can get performance with a uniform mesh on a single grid

Threads to pull on:

  • Use Kokkos hierarchical parallelism to implement the "big kernel" for packing variables
  • Coalescing read/writes for buffer packing routines
  • How similar are cell-centered and face-centered
  • Adding micro-benchmarks
Clone this wiki locally