Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Shared memory tests #316

Open
jphaupt opened this issue Jul 8, 2021 · 1 comment
Open

Question: Shared memory tests #316

jphaupt opened this issue Jul 8, 2021 · 1 comment

Comments

@jphaupt
Copy link

jphaupt commented Jul 8, 2021

I have a project using shared memory both from MPI and from OpenMP. So far my only exposure to pFUnit is from the demos repo and a youtube video, but it seems like a great option to test my program. I have two questions:

  • The project readme file mentioned "limited support for OpenMP" but I cannot find information on this elsewhere. Is there documentation for how to use pFUnit with OpenMP (or at least demos -- I found those for MPI very helpful)?
  • pFUnit has a neat way to test parallel programs with a mock MPI using the npes option which by my understanding sets the number of jobs to run. It looks like this calls mpirun on my workstation. Is there any way to simulate several nodes, e.g. 2 nodes running 2 processes each for a total of 4 processes (preferably in a "simulated" way so that I can still use my workstation)? I guess not as it seems a big ask. Is there any workaround? (My program is exhibiting an error if I run it with several nodes, but no error with one, even if the same number of processes -- I would like to use pFUnit to track down the problem)

P.S. please redirect if this is not the place to ask this.

@jphaupt jphaupt changed the title Shared memory tests Question: Shared memory tests Jul 8, 2021
@tclune
Copy link
Member

tclune commented Jul 19, 2021

Sorry for the delay in responding.

The support for OpenMP in pFUnit is minimal in the sense that there is no special mechanism to identify the particular thread where a failure occurs nor to set up an OpenMP parallel region for you. (Though it might be straightforward to extend th e framework to do either of these.)

Rather, the (current) expectation is that the layer to be tested creates and completes its own parallel region within that layer. The "limited" support is that there are OpenMP directives in the bit of code where exceptions are accumulated so that if failures happen on multiple threads you don't have them all hitting the same memory at the same time. This should be quite sufficient for testing most OpenMP instrumented procedures.

But, if one is using OpenMP in a particularly sophisticated manner, one might want the framework to handle more as I mentioned above. E.g., requesting the same test to be run on 1, 2, ... n threads, much as is possible with MPI. If you are interested in helping to evaluate such extensions, I can work to create them.

With regard to multiple nodes: There is no special support in pFUnit, and failures such as you describe can be tricky to diagnose. Often the problem is the MPI implementation itself rather than your own code. E.g., we have had situations where the use of 1-sided MPI calls exposed bugs in various flavors of MPI when used in a multi-node context. Sometimes there are environment variables that allow the MPI to work properly in that context. All such issues are unfortunately, outside the scope of pFUnit.

From the perspective of your own source code, "legal" use of MPI should not be able to detect the difference in the number of nodes involved except in some very special cases where MPI knows about shared memory segments. Or if you are somehow otherwise creating subcommunicators that are associated with given nodes (e.g., using hostname to identify neighboring processes on a node). If you think there are errors in your code that are exposed due to those sort of issues, then advanced uses of "mocking" MPI might be able to help, but probably not worth the effort compared to traditional debugging.

You can of course still use pFUnit to create various tests around the code and use them for triangulation. But you'll need to run the tests on a multi-node cluster unless/until you can expose a bit that fails on a single node.

Happy to speculate further if you want to provide more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants