2021.03.04 Meeting Notes

Agenda

Individual/group updates
Scaling update
IO (performance and default basename handling)
Review non-WIP PRs

Individual/group Updates

LANL CS

Joshua Brown

Mostly reviewing pull requests. Debugging CI issues.

Jonas

Looking into dumping output for sparse variables

Galen

Figured out what was wrong with build on RZAnsel, something wrong with the .cshrc file on RZAnsel.

LANL Physics

Josh Dolence

Mostly working on Physics side of things, but noted the importance of tiling when using MDRangePolicy. Significant performance degradation can occur if the wrong tiling is used. https://github.com/lanl/parthenon/issues/466

Jonah

Been working on Physics codes. Table interpolation machinery, tabulated equation of state has been made open source.

Ben

Still working on particles pr.

AthenaPK

Philipp

Found and fixed "invalid free" bug
Various quality of life improvements
- "soft disable" outputs
- split init of Parthenon manager
- allow reuse of Parthenon testing framework in downstream codes
Updated AthenaPK (following Jim's test code) to use two register integration and various reconstruction methods
Gave talk at CSE21
More scaling test (see later topic)

Forrest

Will begin work in implementing static mesh in Parthenon.

Jim

Has been playing around with physics in two different test codes.

Scaling Update

Galen now has performance numbers on Sierra matching what Phil was getting. Has submitted a request on Sierra with higher priority. Has scaled up the advection test to 16000 GPUs. Will also be testing on Trinity with just over 500,000 ranks, using single ranks per physical core.

There appears to be an issue when multiple ranks are assigned per gpu, this does not appear to improve the performance. Galen will test that configuration.

Phil - did weak scaling on uniform grid on summit, the results did not look as good as Athena, seems to be related to the current setup of the driver because currently don't get effective overlapping communication. Only used 64 cubic mesh blocks 500,012 GPUs, slow down was less than a factor of 3.

Will begin looking at restriction and prolongation approach.

IO

Phil raised the question about how to specify the correct name of the outputs, the default is parthenon + the jobid which is not appropriate for downstream codes.

Another, issue is the performance of the io. Currently, restart files are making use of parallel HDF5 which is not as performance as MPI IO, some benchmarking should be done to see what the difference is.

Josh Brown will look into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly