Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed test: "unit_tests: Can pull variables from container based on Metadata" #585

Closed
kurtsansom opened this issue Aug 24, 2021 · 21 comments

Comments

@kurtsansom
Copy link
Collaborator

I am trying to figure our why this unit test fails:

I am using a branch that has 'develop' merged into pgrete/expose-flux-div.

Any hints would be helpful.

ctest --rerun-failed --output-on-failure
Test project /home/kurt.sansom/build/athena/athenapk/build/parthenon                 
    Start 24: "unit_tests:Can pull variables from containers based on Metadata"         
1/2 Test #24: "unit_tests:Can pull variables from containers based on Metadata" ...Subprocess abo
rted***Exception:   0.57 sec                                                                     
Kokkos::View ERROR: attempt to access inaccessible memory spaceFilters: Can pull variables from containers based on Metadata     
                                                
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
unit_tests is a Catch v2.11.1 host application.
Run with -? for options                                                                          
                                                
-------------------------------------------------------------------------------
Can pull variables from containers based on Metadata
      Given: A Container with a set of variables initialized to zero
-------------------------------------------------------------------------------
/home/kurt.sansom/build/athena/athenapk/athenapk/external/parthenon/tst/unit/test_meshblock_data_
iterator.cpp:72
...............................................................................
                                                                                                 
/home/kurt.sansom/build/athena/athenapk/athenapk/external/parthenon/tst/unit/test_meshblock_data_
iterator.cpp:100: FAILED:                                                                        
  {Unknown expression after the reported line}                                                   
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases: 1 | 1 failed                                                                         assertions: 4 | 3 passed | 1 failed

@Yurlungur
Copy link
Collaborator

Hmm that's a weird one. The test that's failing is this one:
https://github.com/lanl/parthenon/blob/develop/tst/unit/test_meshblock_data_iterator.cpp#L70

But likely the error is much deeper... in the VariablePack or in the Variable machinery. It might help to print some statements in the above-mentioned function to see exactly where the error is coming from.

@pgrete
Copy link
Collaborator

pgrete commented Aug 30, 2021

@kurtsansom I see that you're working with a version coming from AthenaPK.
I occasionally have some hot-fixes in the Parthenon submodule used in AthenaPK (when working on adding/integrating features that require changes on both ends).
Which commit of AthenaPKare you working with?
I can try to reproduce tomorrow.

PS: Feel free to tag me directly for any AthenaPK related issues here.

@kurtsansom
Copy link
Collaborator Author

Apologies @pgrete for not returning to this sooner. I am working off the main athenapk branch. I was looking into the expose-flux-div branch with the upstream dev branch merged in. It looks like that branch has been updated, so I will bring in those changes to see if that changes anything.

@kurtsansom kurtsansom changed the title failed test: "unit_tests: Can pull varaibles from container based on Metadata" failed test: "unit_tests: Can pull variables from container based on Metadata" Sep 14, 2021
@kurtsansom
Copy link
Collaborator Author

kurtsansom commented Sep 14, 2021

I pulled in the latest changes to the branch and parthenon still chokes on that test. is there a way to test where the computation is taking place? i.e. is trying to do the test on a GPU? What do you suggest to narrow it down. (e.g. tips on how to debug would be helpful, I am still relatively new to kokkos )

@pgrete
Copy link
Collaborator

pgrete commented Sep 15, 2021

@kurtsansom would you mind sharing your full configuration, i.e., what hardware are you running on and what cmake line did you use to configure your build and what's the output of the configure step.
This can help me narrow down the issue as I was not able to reproduce the error.
I tried both builds on a CPU and builds on a GPUs and the test passes:

$ ctest -V -R "unit_tests:Can pull variables from containers based"
UpdateCTestConfiguration  from :/mnt/home/gretephi/src/athenapk/build-test/DartConfiguration.tcl
Parse Config file:/mnt/home/gretephi/src/athenapk/build-test/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/opt/software/CMake/3.17.1/bin/cmake
UpdateCTestConfiguration  from :/mnt/home/gretephi/src/athenapk/build-test/DartConfiguration.tcl
Parse Config file:/mnt/home/gretephi/src/athenapk/build-test/DartConfiguration.tcl
Test project /mnt/home/gretephi/src/athenapk/build-test
Run command: bash /mnt/home/gretephi/src/athenapk/external/parthenon/scripts/device_check.sh 1 /opt/software/OpenMPI/4.0.3-gcccuda-2020a/bin/mpiexec 4

****************************************** Beginning Precheck ******************************************

Number of GPUs detected per node: 4
Number of GPUs per node, requested in tests: 1

Number of nodes detected: 1
Number of MPI procs, requested in tests: 4
Number of MPI procs per node: 4

******************************************  Ending Precheck   ******************************************

Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 24
    Start 24: unit_tests:Can pull variables from containers based on Metadata

24: Test command: /mnt/home/gretephi/src/athenapk/build-test/parthenon/tst/unit/unit_tests "Can pull variables from containers based on Metadata"
24: Test timeout computed to be: 1500
24: Filters: Can pull variables from containers based on Metadata
24: ===============================================================================
24: All tests passed (34 assertions in 1 test case)
24: 
1/1 Test #24: unit_tests:Can pull variables from containers based on Metadata ...   Passed    0.88 sec

The following tests passed:
	unit_tests:Can pull variables from containers based on Metadata

100% tests passed, 0 tests failed out of 1

Label Time Summary:
meshblockdataiterator    =   0.88 sec*proc (1 test)
unit_tests               =   0.88 sec*proc (1 test)

Total Test time (real) =   0.91 sec

@kurtsansom
Copy link
Collaborator Author

Hardware/OS

  • Operating System: CentOS Linux 7 (Core)
  • CPE OS Name: cpe:/o:centos:centos:7
  • Kernel: Linux 3.10.0-1160.36.2.el7.x86_64
  • NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
  • 2 Quadro RTX 5000 (Turing architecture)

Compiler /Libraries / Environment

  • gcc 10.0.3, nvcc cuda_11.4.r11.4/compiler.30188945_0
  • spack: hdf5 +cxx+fortran+hl+mpi+tools, openmpi@4.1.1, clang-format llvm-@12.0.1
  • anaconda env for python libraries

I am working on getting a paired down cmake configuration to provide

@kurtsansom
Copy link
Collaborator Author

I tried running on another OS, platform and GPU and appears to also fail. will investigate and provide more detail.

@kurtsansom
Copy link
Collaborator Author

kurtsansom commented Sep 15, 2021

here is the athenapk branch that is using parthenon. https://gitlab.com/kayarre/athenapk/-/tree/test_branch

I am testing the branch from the pull request #558

@kurtsansom
Copy link
Collaborator Author

I tried using cuda+openmpi+hdf5 all installed into a spack environment but ran into a weird gcc 10.3.0 ICE bug, will have to try again tomorrow, but here is configuration I attempted.

# This is a Spack Environment file.
#
# It describes a set of packages to be installed, along with
# configuration settings.
spack:
  # add package specs to the `specs` list
  specs: [hdf5+cxx+fortran+hl+mpi+tools ^openmpi+cuda+cxx, openmpi+cuda+cxx]
  view: true
cmake -DHDF5_DIR=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/share/cmake/hdf5 \
      -DCATCH_BUILD_TESTING:BOOL=OFF \
      -DBUILD_TESTING:BOOL=ON \
      -DCMAKE_INSTALL_PREFIX:PATH=/home/kurt.sansom/build/athena/athenapk/install \
      -DAthenaPK_ENABLE_TESTING:BOOL=OFF \
      -DBLACK:FILEPATH=/home/kurt.sansom/miniconda3/envs/athenapk/bin/black \
      -DCLANG_FORMAT:FILEPATH=/home/kurt.sansom/tools/spack/opt/spack/linux-centos7-cascadelake/gcc-10.3.0/llvm-12.0.1-dj3n5onm2z2qt5eajbs4atrj6nrws66k/bin/clang-format \
      -DCMAKE_CUDA_COMPILER:STRING=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/bin/nvcc \
      -DCUDAToolkit_INCLUDE_DIR:FILEPATH=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/targets/x86_64-linux/include \
      -DCMAKE_CXX_COMPILER:STRING=/home/kurt.sansom/tools/spack/opt/spack/linux-centos7-cascadelake/gcc-10.3.0/gcc-10.3.0-5ohqqvljcsohjccahbiydifzjctwifh5/bin/g++ \
      -DCMAKE_C_COMPILER:STRING=/home/kurt.sansom/tools/spack/opt/spack/linux-centos7-cascadelake/gcc-10.3.0/gcc-10.3.0-5ohqqvljcsohjccahbiydifzjctwifh5/bin/gcc \
      -DCMAKE_CXX_FLAGS_DEBUG:STRING=-g \
      -DMPIEXEC_EXECUTABLE:FILEPATH=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/bin/mpiexec \
      -DMPI_CXX_COMPILER:FILEPATH=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/bin/mpicxx \
      -DKokkos_ENABLE_CUDA:BOOL=ON \
      -DKokkos_ENABLE_CUDA_LAMBDA:BOOL=ON \
      -DKokkos_ENABLE_DEBUG:BOOL=ON \
      -DKokkos_CUDA_DIR:FILEPATH=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/targets/x86_64-linux/lib \
      -DCUDAToolkit_BIN_DIR:FILEPATH=/home/kurt.sansom/tools/spack/var/spack/environments/athenapk/.spack-env/view/targets/x86_64-linux/bin \
      -DKokkos_ARCH_TURING75:BOOL=ON \
      -DKokkos_CXX_STANDARD:STRING=17 \
      -DKokkos_ENABLE_DEBUG:BOOL=ON \
      -DKokkos_ENABLE_EXAMPLES:BOOL=OFF \
      -DKokkos_ENABLE_OPENMP:BOOL=ON \
      -DKokkos_ENABLE_SERIAL:BOOL=ON \
      -DKokkos_ENABLE_TESTS:BOOL=ON \
      -DNUM_GPU_DEVICES_PER_NODE:STRING=2 \
      -DPARTHENON_ENABLE_CPP17:BOOL=ON \
      -DPARTHENON_ENABLE_GPU_MPI_CHECKS:BOOL=ON \
      -DPARTHENON_ENABLE_INTEGRATION_TESTS:BOOL=ON \
      -DPARTHENON_ENABLE_UNIT_TESTS:BOOL=ON \
      -DPARTHENON_ENABLE_PYTHON_MODULE_CHECK:BOOL=ON \
      -DPARTHENON_DISABLE_EXAMPLES:BOOL=ON \
      -DPARTHENON_DISABLE_OPENMP:BOOL=OFF \
      -DPARTHENON_ENABLE_PYTHON_MODULE_CHECK:BOOL=ON \
      -DAthenaPK_ENABLE_TESTING:BOOL=OFF \
      -B /home/kurt.sansom/build/athena/athenapk/build/ \
      -S /home/kurt.sansom/build/athena/athenapk/athenapk/

@pgrete
Copy link
Collaborator

pgrete commented Sep 15, 2021

Thanks for the additional info.
Here's what I got so far (also on a CentOS machine and using a cmake line very similar to yours):

  1. I was able to reproduce the internal compiler error with gcc 10.3.0 It does not show for other gcc version I tried, e.g., gcc 10.2.0 or 9.3.0.
    Seems to be Kokkos related as I also seems to happen for pure Kokkos builds. I now reported this here: Internal compiler error with Cuda/11.4 GCC/10.3.0 kokkos/kokkos#4334

  2. With Gcc 10.2.0 or 9.3.0 I now see the following compile error:

[ 94%] Building CXX object parthenon/tst/unit/CMakeFiles/unit_tests.dir/test_state_descriptor.cpp.o
/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(312): error: qualified name is not allowed

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(312): error: expected a ";"

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(316): error: qualified name is not allowed

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(316): error: expected a ";"

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(317): error: identifier "expected_flags" is undefined

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(317): error: identifier "actual_flags" is undefined

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(317): error: identifier "expected_flags" is undefined

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(317): error: identifier "actual_flags" is undefined

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(39): warning: variable "<unnamed>::autoRegistrar1" was declared but never referenced

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(61): warning: variable "<unnamed>::autoRegistrar8" was declared but never referenced

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(78): warning: variable "<unnamed>::autoRegistrar13" was declared but never referenced

/mnt/home/gretephi/src/athenapk/external/parthenon/tst/unit/test_state_descriptor.cpp(294): warning: variable "<unnamed>::autoRegistrar51" was declared but never referenced

I will bisect the your cmake options tomorrow as I don't see the issue for my configuration

  1. Your athenapk repository seems to be private (I cannot access it/don't see it). Did you make any modifications on the Parthenon submodule that could result in the test failing [as I haven't been able to reproduce the failed test yet]?

@kurtsansom
Copy link
Collaborator Author

I didn’t realize my athenapk fork was private by default, should be public now.

@pgrete
Copy link
Collaborator

pgrete commented Sep 16, 2021

I think I found the bug.
#include <set> was missing in the test_state_descriptor.cpp header.
I assume this didn't show before because for the options we use for testing that header file is included somewhere else.
I now fixed this is the parthenon::pgrete/expose-flux-div branch and was also able to verify that the regression test passes.
It'd be great if you could confirm that this fix also works on your end.

Out of curiosity: Did you specify the options above (e.g., enabling some test but disabling others) on purpose or just during debugging attempts?
I'm asking as my typical cmake line is significantly shorter and I'm wondering if there are some default values set (in AthenaPK) that don't meet your use case.

@kurtsansom
Copy link
Collaborator Author

kurtsansom commented Sep 16, 2021

@pgrete thank you so much. I fiddled around with the test switches some to try and elicit what I thought was primarily important at the moment, (i.e. If the unit tests don't work, does running regression tests matter)

I generally try to keep the environment clean so I don't get myself into a bind with clashes and dependency hell. This means having to explicitly tell cmake where things are. I generally use ccmake and fill things iteratively and there isn't a workflow I know of that makes it easy to go from that to the cmake command line arguments (it would be nice!)

I am also using spack to make it easier to get the hdf5 libraries, which then led to using a newer compiler.

If you have an example cmake line with less options I would be happy to try that, it's what I came it with from the ccmake iterations.

UPDATE: had some trouble redoing my environment and haven't gotten it tested yet but will keep working on it.

@pgrete
Copy link
Collaborator

pgrete commented Sep 17, 2021

I generally try to keep the environment clean so I don't get myself into a bind with clashes and dependency hell. This means having to explicitly tell cmake where things are.

I understand. My approach here is use a consistent module/spack environment, i.e., calling module purge; module load ... and then rely on the default that cmake picks up.
So far this has been working fine for me (though I see the appeal/advantages over explicitly specifying everything).

I generally use ccmake and fill things iteratively and there isn't a workflow I know of that makes it easy to go from that to the cmake command line arguments (it would be nice!)
If you have an example cmake line with less options I would be happy to try that, it's what I came it with from the ccmake iterations.

Have you seen the "machine configuration" capability of Parthenon?
For example, you can create some default configuration (or even multiple one) depending on your local environment (cf., https://github.com/lanl/parthenon/blob/develop/cmake/machinecfg/Summit.cmake) and then override option from the command line.

In other words, you can create a simple file /home/user/mycfg.cmake containing just

set(CMAKE_BUILD_TYPE "Release" CACHE STRING "Release build")
set(Kokkos_ARCH_SKX ON CACHE BOOL "Use Skylake arch")
set(Kokkos_ARCH_VOLTA70 ON CACHE BOOL "Use Volta GPU arch")
set(Kokkos_ENABLE_CUDA ON CACHE BOOL "Enable Cuda")

then either do a cmake -DMACHINE_CFG=/home/user/mycfg.cmake .. or set the MACHINE_CFG environment variable (which will be picked up automatically) and use a plain cmake ...

As mentioned above all command line options take precedence, so if I want a debug build I simply call cmake -DCMAKE_BUILD_TYPE=Debug .. and only that options will be overridden from the defaults set in the file.

@pgrete
Copy link
Collaborator

pgrete commented Sep 17, 2021

Oh, one more thing. We try to set some "reasonable" defaults in AthenaPK's CMakeLists.txt that configure Parthenon (e.g., with respect to tests or performance) so that only very few (to none) Parthenon options should be required for a typical AthenaPK build.
If you have input on whether our defaults are "reasonable" or if you'd have expected/picked sth different, let us know.

@kurtsansom
Copy link
Collaborator Author

@pgrete, thank you the input about configurations. I usually just make a shell script, but I like the cmake approach. I paired down the cmake options and am currently testing out the develop branch. unit test is now passing.

@Yurlungur
Copy link
Collaborator

Glad to hear your issue is resolved, @kurtsansom . I'm going to close this issue, but feel free to re-open it if you run into more trouble.

@dholladay00
Copy link
Collaborator

I am seeing warnings similar to what is shown above warning: variable "<unnamed>::autoRegistrar1" was declared but never referenced, do you know how to get rid of them? My guess is it's a catch2 weirdness.

@pgrete
Copy link
Collaborator

pgrete commented Mar 31, 2023

Are you just seeing warning or also hitting the ICEs?

@pgrete
Copy link
Collaborator

pgrete commented Mar 31, 2023

we might need to bump catch2 as the warnings seem to have been fixed in more recent versions, see catchorg/Catch2#2427

@dholladay00
Copy link
Collaborator

It's just the warning, it went away with a newer version of gcc, I suspect a newer catch2 or newer gcc or both will fix so it seems like this is catch's fault and I won't worry about it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants