Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AMR and sparse restarts bitwise exact #1073

Merged
merged 12 commits into from
May 15, 2024

Conversation

lroberts36
Copy link
Collaborator

@lroberts36 lroberts36 commented May 10, 2024

PR Summary

This PR saves the derefinement count for each meshblock and the deallocation count for each sparse variable in the restart output (as well as in regular HDF5 output). This should ensure that restarted simulations give bitwise exact results to simulations that were run from start to finish.

Previously, AMR simulations that were restarted had the deallocation count of all meshblocks reset to zero. This meant that after restart, a block could be deallocated a few steps after it would have been deallocated if there was no restart. This resulted in differences in simulation results with and without restarts many orders of magnitude larger than machine precision, although still relatively small compared to characteristic scales of the problem. An example of this difference is in issue #1072 (the problem detailed there is fixed by this PR).

Similarly, the deallocation count of sparse variables was not stored and was reset to zero on restart. This was a known issue, and the restart regression test set the global deallocation count large enough that variables were never deallocated. This PR also stores the deallocation count of variables in the restart file and sets them to the correct values on restart. The restart test now works with a deallocation count of 5.

Also, MeshRefinement::DereferenceCount is renamed to MeshRefinement::DeallocationCount since dereference count doesn't really make sense.

WARNING: This PR does not solve the edge case described in issue #1077

PR Checklist

  • Code passes cpplint
  • Adds a test for any bugs fixed.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@lroberts36 lroberts36 added the bug Something isn't working label May 10, 2024
@lroberts36 lroberts36 linked an issue May 10, 2024 that may be closed by this pull request
@lroberts36 lroberts36 changed the title Bugfix: Make AMR and sparse restarts work Bugfix: Make AMR and sparse restarts bitwise exact May 13, 2024
@lroberts36 lroberts36 changed the title Bugfix: Make AMR and sparse restarts bitwise exact Make AMR and sparse restarts bitwise exact May 13, 2024
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the catch! Some nitpicks below

src/mesh/mesh.cpp Outdated Show resolved Hide resolved
src/outputs/parthenon_hdf5.cpp Show resolved Hide resolved
src/outputs/parthenon_hdf5.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@pdmullen pdmullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/outputs/restart.hpp Outdated Show resolved Hide resolved
src/outputs/restart.hpp Outdated Show resolved Hide resolved
Co-authored-by: Patrick Mullen <33095288+pdmullen@users.noreply.github.com>
@lroberts36 lroberts36 enabled auto-merge May 14, 2024 17:16
@lroberts36 lroberts36 disabled auto-merge May 14, 2024 17:24
@lroberts36 lroberts36 enabled auto-merge May 14, 2024 18:47
@lroberts36 lroberts36 mentioned this pull request May 14, 2024
13 tasks
@lroberts36 lroberts36 disabled auto-merge May 14, 2024 19:27
Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the fix!
Feel free to merge as is if the "restart from old files" is not an issue.

src/outputs/parthenon_hdf5.cpp Outdated Show resolved Hide resolved
src/outputs/parthenon_hdf5.cpp Outdated Show resolved Hide resolved
src/outputs/restart_hdf5.cpp Show resolved Hide resolved
src/parthenon_manager.cpp Show resolved Hide resolved
tst/regression/test_suites/restart/restart.py Show resolved Hide resolved
@lroberts36 lroberts36 enabled auto-merge May 15, 2024 18:54
@lroberts36 lroberts36 merged commit dfda73b into develop May 15, 2024
49 checks passed
pgrete added a commit that referenced this pull request May 29, 2024
…#1073) (#1089)

* Fix deref couter reading from previous restarts

* Fix logic.

* make a separate dataset for derefinement count

* don't set derefinement count to large number in restart test

* Add warning

* ignore derefinement_count in phdf_diff

* only warn on rank 0

* remove hdf debug output

---------

Co-authored-by: Luke Roberts <lfroberts@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some results of restarts are not bitwise exact
4 participants