WIP: Fix MPI Communicator destruction (again) #945

bprather · 2023-09-29T16:29:27Z

Remember #841? It's back!

A refresher: in CI tests, there was a semi-common (30-50%?) race condition between Parthenon's use of MPI and HDF5's use of MPI when making the last outputs and cleaning up. Parthenon would MPI_Comm_free a communicator, and HDF5 would then attempt to MPI_Comm_dup at the same time, which is apparently an issue for implementations.

I found (without explanation nor extensive testing) that I could make that race condition less frequent if I switched the destructor of Reduction objects to use MPI_Comm_disconnect instead of MPI_Comm_free. The difference is, MPI_Comm_free is asynchronous, and sets the current reference to null without actually destroying the communicator until every process has done so. It "disconnects" from the communicator, if you will. Whereas MPI_Comm_disconnect waits until it can confirm that the communicator has been destroyed. That is, it "frees" it (the MPI standard is so straightforward!).

In any case, after reorganizing communicators in KHARMA, I've found that when freeing a lot of communicators together, MPI_Comm_disconnect is very likely to just... hang forever. This condition happens in, for example, the destructor of Mesh, which calls the destructor of Packages_t which calls every package's destructor, which should in codes which follow our best practices free every communicator at once. So, I imagine it will become a problem for other folks as well, manifesting as a run which finishes, prints statistics, and then just... doesn't return. This will be a problem in batch jobs, where a job that would otherwise return will continue spending core-hours without printing anything suspicious.

This PR just reverts to using MPI_Comm_free as the destructor, which fixes the issue in KHARMA, and allows all tests to pass on my machine, both without any warnings/double-free/etc. If it fails CI, we can talk about other sorts of hacks for keeping HDF5 out of our way while we destroy the Mesh.

Teaching sand to think might have been a mistake, but letting it talk to itself was a catastrophe.

Yurlungur · 2023-09-29T19:42:14Z

I think the restriction here, if using MPI_Comm_free is that reducers cannot be stored in the driver, they have to be held in objects that have the same lifespan as (or shorter than) the mesh?

bprather · 2023-09-29T20:26:50Z

Not quite. free and disconnect nominally do exactly the same thing: destroy the communicator. The only difference is synchronicity. Since both the Driver and Mesh are torn down before MPI_Finalize is called, there's not really a danger of lingering communicators here, so long as one scopes them in something.

In a world without HDF5 and with perfect thread safety, we'd call free anytime we want as we tear down the Mesh object -- or, we'd call disconnect and wait a little longer. But in our situation, free caused issues with HDF5, crashing a call to MPI_Comm_dup inside HDF5 somewhere, at least in the CI runs. However, disconnect is causing me much worse problems than free did, hence proposing the switch back.

Until anyone else runs into this issue, or I run into the old race condition, I'm fine letting this PR sit here as WIP (not like it's going to fall out of date!). If it is going to be upstreamed, I should take a dig into reproducing the race condition we saw in #841, and verify that newer Parthenon/HDF5/production simulations won't hit it.

Reductions: switch back to Comm_free to avoid hang

7e3a813

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fix MPI Communicator destruction (again) #945

WIP: Fix MPI Communicator destruction (again) #945

bprather commented Sep 29, 2023

Yurlungur commented Sep 29, 2023

bprather commented Sep 29, 2023 •

edited

WIP: Fix MPI Communicator destruction (again) #945

Are you sure you want to change the base?

WIP: Fix MPI Communicator destruction (again) #945

Conversation

bprather commented Sep 29, 2023

Yurlungur commented Sep 29, 2023

bprather commented Sep 29, 2023 • edited

bprather commented Sep 29, 2023 •

edited