Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various runtime issues running numerical simulation #1917

Open
PhilipDeegan opened this issue Apr 13, 2024 · 26 comments
Open

Various runtime issues running numerical simulation #1917

PhilipDeegan opened this issue Apr 13, 2024 · 26 comments

Comments

@PhilipDeegan
Copy link

Hi there,

I work on a somewhat complicated project for modeling astrophysical systems.

I am trying to use uftrace for logging the call stack, and running into some segfaults and such.

We have many tests and run with ASAN and things so to me this segfault is somewhat unusual, but I can't rule out it is indeed a problem with the code, and not uftrace. We are using pybind and execute some python scripts via the embedded interpreter, from our native binary entrypoint, if that might be causing some issues.

running uftrace on our native binaries results in the following cases

> uftrace record  ./build/src/phare/phare-exe src/phare/phare_init_small.py

WARN: process crashed by signal 11: Segmentation fault (si_code: 128)
WARN:  if this happens only with uftrace, please consider -e/--estimate-return option.

WARN: Backtrace from uftrace  ( x86_64 dwarf python3 tui perf sched )
WARN: =====================================
...
segfault

with -e

> uftrace record -e  ./build/src/phare/phare-exe src/phare/phare_init_small.py

WARN: Segmentation fault: address not mapped (addr: 0x55670bc7bec0)
WARN: Backtrace from uftrace  ( x86_64 dwarf python3 tui perf sched )
WARN: =====================================
...
segfault in a different place

If I use this command (picked up from)

uftrace record -d /tmp/uftrace.data -vvv --logfile=/tmp/uftrace.log --no-libcall  ./build/src/phare/phare-exe src/phare/phare_init_small.py

uftrace exits with -1 after a second or so, leaving the child process running (but taking 0% CPU so not scheduled)

Any suggestions are welcome

Thanks

@PhilipDeegan
Copy link
Author

for reference

./configure --prefix=$PWD
uftrace detected system features:
...         prefix: /home/p/git/uftrace
...         libelf: [ on  ] - more flexible ELF data handling
...          libdw: [ on  ] - DWARF debug info support
...      libpython: [ on  ] - python tracing & scripting support
...      libluajit: [ OFF ] - luajit scripting support
...    libncursesw: [ on  ] - TUI support
...   cxa_demangle: [ on  ] - full demangler support with libstdc++
...     perf_event: [ on  ] - perf (PMU) event support
...       schedule: [ on  ] - scheduler event support
...       capstone: [ OFF ] - full dynamic tracing support
...  libtraceevent: [ OFF ] - kernel tracing support
...      libunwind: [ OFF ] - stacktrace support (optional for debugging)

> git log -1
commit 17df844f1488a9a90e218b0a2ba19d1936e4cfb2 (HEAD -> master, origin/master, origin/HEAD)
Merge: 6ea5ba2 e37be98
Author: Namhyung Kim <namhyung@gmail.com>
Date:   Wed Apr 10 20:45:06 2024 -0700

    Merge branch 'misc-update'
    
    Use uftrace_basename() to keep the input string untouched like in the
    GNU version.
    
    A recent change in musl libc introduced an issue with the basename(3)
    that where we can find the declaration (string.h vs. libgen.h).  It
    also brings a subtle difference in the implementation.  Let's be clear
    with our own implementation.
    
    Fixed: #1909
    Signed-off-by: Namhyung Kim <namhyung@gmail.com>

> gcc -v
...
gcc version 12.2.0 (Debian 12.2.0-14)

@namhyung
Copy link
Owner

Sorry for the late reply. Can you please share the log file? It'd be hard to know without more information.

Is there anything special with your program? I guess it's written in C/C++ and Python. Did you build with -pg or something for the C/C++ part? Or you want to trace the Python part only?

@honggyukim
Copy link
Collaborator

Hi @PhilipDeegan,

I'm trying to reproduce your problem so I've tried to build the your project as follows.

$ sudo apt install libhdf5-dev

$ git clone --recursive https://github.com/PHAREHUB/PHARE.git
$ cd PHARE
$ cmake -B build
$ make -C build -j4

But the execution itself without uftrace isn't successful.

$ ./build/src/phare/phare-exe src/phare/phare_init_small.py

                  _____   _    _            _____   ______
                 |  __ \ | |  | |    /\    |  __ \ |  ____|
                 | |__) || |__| |   /  \   | |__) || |__
                 |  ___/ |  __  |  / /\ \  |  _  / |  __|
                 | |     | |  | | / ____ \ | | \ \ | |____
                 |_|     |_|  |_|/_/    \_\|_|  \_\|______|

creating python data provider
python input detected, building with python provider...
reading user inputs...terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ModuleNotFoundError: No module named 'pyphare.pharein'
Aborted (core dumped)

Is there anything I miss for the execution?

If this can be reproduced from our side, then we might be able to help you better.

@PhilipDeegan
Copy link
Author

PhilipDeegan commented Apr 19, 2024

Gentlemen

@namhyung
No worries, I appreciate the response, I will see about getting you the logs.
I have compiled my own direct code with -pg but we have dependencies (libhdf5/libmpi) that do not have -pg

@honggyukim
from the source root directory, you should export the following

export PYTHONPATH="$PWD:$PWD/build:$PWD/pyphare"

this is done automatically for you via cmake/ctest, but running directly from CLI needs the export
that should let you run everything

@PhilipDeegan
Copy link
Author

(clicked wrong button posting the previous comment...)

@PhilipDeegan
Copy link
Author

@honggyukim

I'm not sure libhdf5-dev is the parallel version, it might not make a difference, but I can't be sure either

typically I would do sudo apt-get install libopenmpi-dev libhdf5-openmpi-dev

@honggyukim
Copy link
Collaborator

honggyukim commented Apr 19, 2024

I'm not sure libhdf5-dev is the parallel version, it might not make a difference

It looks like it runs with a single thread.

The execution output looks as follows.

$ ./build/src/phare/phare-exe src/phare/phare_init_small.py
                  _____   _    _            _____   ______                                                                                                                |  __ \ | |  | |    /\    |  __ \ |  ____|                                                                                              
                 | |__) || |__| |   /  \   | |__) || |__                                                                                                                  |  ___/ |  __  |  / /\ \  |  _  / |  __|                                                                                                
                 | |     | |  | | / ____ \ | | \ \ | |____                                                                                                                |_|     |_|  |_|/_/    \_\|_|  \_\|______|                                                                                              
                                                                                                                                                         creating python data provider                                                                                                                            
python input detected, building with python provider...                                                                                                  reading user inputs...src.phare.phare_init_small                                                                                                         
validating dim=1                                                                                                                                         done!                                                                                                                                                    
At :/home/honggyu/work/PHARE/subprojects/samrai/source/SAMRAI/hier/PatchHierarchy.cpp line :326 message: PHARE_hierarchy:  Using zero `proper_nesting_buffer' values.                                                                                                                                             
                                                                                                                                                         /home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.0000000000/pl0/p0#0/density": d
ata has higher floating point precision than hdf5 dataset on write: Float64 -> Float32                                                                   /home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.0000000000/pl0/p0#0/bulkVelocit
y_z": data has higher floating point precision than hdf5 dataset on write: Float64 -> Float32                                                            
/home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.0000000000/pl0/p0#0/bulkVelocit
y_y": data has higher floating point precision than hdf5 dataset on write: Float64 -> Float32
/home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.0000000000/pl0/p0#0/bulkVelocit
y_x": data has higher floating point precision than hdf5 dataset on write: Float64 -> Float32
/home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.0000000000/pl0/p0#0/EM_E_z": da
ta has higher floating point precision than hdf5 dataset on write: Float64 -> Float32
/home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.0000000000/pl0/p0#0/EM_E_y": da
ta has higher floating point precision than hdf5 dataset on write: Float64 -> Float32
    ...

After a while later, the program is finished normally. So I'm trying it with uftrace record but it takes much longer and I see it still runs. I will wait a bit more then see what I can do.

typically I would do sudo apt-get install libopenmpi-dev libhdf5-openmpi-dev

I will try that for the next try.

@honggyukim
Copy link
Collaborator

Hmm.. uftrace record runs finished after 993 seconds while the original execution took just about 26 seconds.

It looks the execution was finished with some issues as follows.

$ uftrace record -P. -t 100us -v ./build/src/phare/phare-exe src/phare/phare_init_small.py
        ...
/home/honggyu/work/PHARE/subprojects/highfive/include/highfive/bits/H5ReadWrite_misc.hpp: 152 [WARN] /t/0.2300000000/pl1/p0#0/EM_B_x": da
ta has higher floating point precision than hdf5 dataset on write: Float64 -> Float32
HDF5-DIAG: Error detected in HDF5 (1.10.7) thread 1:
  #000: ../../../src/H5D.c line 152 in H5Dcreate2(): unable to create dataset
    major: Dataset                                                                                                                                       
    minor: Unable to initialize object 
  #001: ../../../src/H5Dint.c line 338 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: ../../../src/H5L.c line 1605 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: ../../../src/H5L.c line 1846 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: ../../../src/H5Gtraverse.c line 848 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: ../../../src/H5Gtraverse.c line 624 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: ../../../src/H5L.c line 1641 in H5L__link_cb(): name already exists
    major: Links
    minor: Object already exists
terminate called after throwing an instance of 'HighFive::DataSetException'
  what():  Failed to create the dataset "/t/0.2400000000/pl0/p0#0/density": (Links) Object already exists
mcount: mcount trace finished
mcount: exit from libmcount
WARN: child terminated by signal: 6: Aborted

@honggyukim
Copy link
Collaborator

Anyway, the above abnormal record shows some traces and its uftrace tui output looks as follows.

  TOTAL TIME : FUNCTION                                                                                                                                  
   16.031  m : (1) phare-exe                                                                                                                             
   16.031  m : (1) main                                                                                                                                  
  594.675 ms :  ├▶(1) PHARE::SamraiLifeCycle::SamraiLifeCycle                                                                                            
             :  │                                                                                                                                        
   15.306 ms :  ├▶(1) fromCommandLine                                                                                                                    
             :  │                                                                                                                                        
  932.576 ms :  ├▶(1) PHARE::initializer::PythonDataProvider::read                                                                                       
             :  │                                                                                                                                        
   68.648 ms :  ├▶(1) PHARE::getSimulator                                                                                                                
             :  │                                                                                                                                        
    1.391  s :  ├▶(1) PHARE::Simulator::initialize                                                                                                       
             :  │                                                                                                                                        
    5.907  s :  ├▶(241) PHARE::Simulator::dump                                                                                                           
             :  │                                                                                                                                        
   16.022  m :  └─(240) PHARE::Simulator::advance                                                                                                        
   16.022  m :    (240) PHARE::amr::Integrator::advance                                                                                                  
   16.022  m :    (240) SAMRAI::algs::TimeRefinementIntegrator::advanceHierarchy                                                                         
   16.022  m :    (240) SAMRAI::algs::TimeRefinementIntegrator::advanceRecursivelyForRefinedTimestepping                                                 
    4.045  m :     ├▶(240) PHARE::solver::MultiPhysicsIntegrator::advanceLevel                                                                           
             :     │                                                                                                                                     
   11.030  m :     ├─(240) SAMRAI::algs::TimeRefinementIntegrator::advanceRecursivelyForRefinedTimestepping                                              
   11.030  m :     │ (960) PHARE::solver::MultiPhysicsIntegrator::advanceLevel                                                                           
    3.715  s :     │  ├▶(240) PHARE::amr::HybridMessenger::firstStep                                                                                     
             :     │  │                                                                                                                                  
   11.023  m :     │  └─(960) PHARE::solver::SolverPPC::advanceLevel                                                                                     
   22.376  s :     │     ├▶(960) PHARE::solver::SolverPPC::predictor1_                                                                                   
             :     │     │                                                                                                                               
    9.046  m :     │     ├─(1920) PHARE::solver::SolverPPC::moveIons_                                                                                    
    8.053  m :     │     │  ├─(1920) PHARE::core::IonUpdater::updatePopulations                                                                          
    4.027  m :     │     │  │  ├─(960) PHARE::core::IonUpdater::updateAndDepositDomain_                                                                  
    3.012  m :     │     │  │  │  ├─(960) PHARE::core::BorisPusher::move                                                                                 
   12.308  s :     │     │  │  │  │ (960) PHARE::core::BorisPusher::prePushStep_                                                                         
             :     │     │  │  │  │                                                                                                                      
   55.400  s :     │     │  │  │  ├─(960) PHARE::core::Interpolator::operator()                                                                          
             :     │     │  │  │  │                                                                                                                      
   18.783  s :     │     │  │  │  └▶(960) PHARE::core::IonUpdater::updateAndDepositDomain_::$_3::operator()                                              
             :     │     │  │  │                                                                                                                         
    4.026  m :     │     │  │  └▶(960) PHARE::core::IonUpdater::updateAndDepositAll_                                                                     
             :     │     │  │                                                                                                                            
   22.395  s :     │     │  ├▶(1920) PHARE::amr::HybridMessenger::fillIonPopMomentGhosts                                                                 
             :     │     │  │                                                                                                                            
   19.750  s :     │     │  └▶(1920) PHARE::amr::HybridMessenger::fillIonMomentGhosts                                                                    
             :     │     │                                                                                                                               
   22.403  s :     │     ├▶(960) PHARE::solver::SolverPPC::predictor2_                                                                                   
             :     │     │                                                                                                                               
   29.995  s :     │     ├▶(960) PHARE::solver::SolverPPC::corrector_                                                                                    
             :     │     │                                                                                                                               
  226.751 ms :     │     └─(22) PHARE::solver::SolverPPC::average_                                                                                       
             :     │                                                                                                                                     
    6.176  s :     └▶(240) PHARE::solver::MultiPhysicsIntegrator::standardLevelSynchronization   

@PhilipDeegan
Copy link
Author

@honggyukim

trying it for myself, with uftrace HEAD, it exits almost immediately

uftrace record -P. -t 100us -v ./build/src/phare/phare-exe src/phare/phare_init_small.py
uftrace: running uftrace v0.15.2-52-g17df8 ( x86_64 dwarf python3 tui perf sched )
uftrace: checking binary ./build/src/phare/phare-exe
uftrace: using /path/to/lib/uftrace/libmcount.so library for tracing
uftrace: creating 4 thread(s) for recording
mcount: initializing mcount library
wrap: dlopen is called for 'libdebuginfod.so.1'
dynamic: dynamic patch type: phare-exe: 1 (pg)
dynamic: dynamic patch stats for 'phare-exe'
dynamic:    total:    50717
dynamic:  patched:        0 ( 0.00%)
dynamic:   failed:        0 ( 0.00%)
dynamic:  skipped:    50717 (100.00%)
dynamic: no match:        0
plthook: setup PLT hooking "./build/src/phare/phare-exe"
mcount: mcount setup done
mcount: new session started: 09610e440efbd0dc: phare-exe
WARN: child terminated by signal: 5: Trace/breakpoint trap
uftrace: reading uftrace.data/task.txt file
uftrace: flushing /uftrace-09610e440efbd0dc-55624-000

>$ echo $?
2

>$ uftace tui
WARN: cannot open record data: uftrace.data: No data available

@PhilipDeegan
Copy link
Author

for reference

> ldd build/src/phare/phare-exe
	linux-vdso.so.1 (0x00007fff5ccc0000)
	libphare_initializer.so => /path/to/build/src/initializer/libphare_initializer.so (0x00007f252a2de000)
	libmpi_cxx.so.40 => /usr/lib64/openmpi/lib/libmpi_cxx.so.40 (0x00007f252a2c5000)
	libhdf5.so.200 => /usr/lib64/openmpi/lib/libhdf5.so.200 (0x00007f2529e00000)
	libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40 (0x00007f2529cd8000)
	libpython3.12.so.1.0 => /opt/py/python-3.12.2/lib/libpython3.12.so.1.0 (0x00007f2529600000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2529200000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f252951f000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f252a28b000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f252901e000)
	libopen-rte.so.40 => /usr/lib64/openmpi/lib/libopen-rte.so.40 (0x00007f2529466000)
	libopen-pal.so.40 => /usr/lib64/openmpi/lib/libopen-pal.so.40 (0x00007f2528f71000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f252a26f000)
	libhwloc.so.15 => /lib64/libhwloc.so.15 (0x00007f252a20c000)
	libevent_core-2.1.so.7 => /lib64/libevent_core-2.1.so.7 (0x00007f2529ca1000)
	libevent_pthreads-2.1.so.7 => /lib64/libevent_pthreads-2.1.so.7 (0x00007f252a207000)
	libsz.so.2 => /lib64/libsz.so.2 (0x00007f252a1fd000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f252a2fc000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00007f2528f39000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f2529c97000)

@honggyukim
Copy link
Collaborator

The dynamic: skipped: 50717 (100.00%) means patching failures for 100% of functions. That means uftrace records nothing. Your log shows it also get crashed for some reasons showing WARN: child terminated by signal: 5: Trace/breakpoint trap.

Did you compile your binary with -pg option? The -P option is needed only when -pg flag isn't used.

@PhilipDeegan
Copy link
Author

oh, I had used -pg, let me try again

@honggyukim
Copy link
Collaborator

honggyukim commented Apr 19, 2024

typically I would do sudo apt-get install libopenmpi-dev libhdf5-openmpi-dev

I've tried it again with

$ sudo apt purge libhdf5-dev
$ sudo apt-get install libopenmpi-dev libhdf5-openmpi-dev

Then tried cmake config again, but it fails as follows.

-- HDF5 C compiler wrapper is unable to compile a minimal HDF5 program.
CMake Error at /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find HDF5 (missing: HDF5_LIBRARIES HDF5_INCLUDE_DIRS) (found
  version "")
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.22/Modules/FindHDF5.cmake:1009 (find_package_handle_standard_args)
  subprojects/samrai/cmake/thirdparty/SetupSAMRAIThirdParty.cmake:104 (find_package)
  subprojects/samrai/cmake/CMakeBasics.cmake:1 (include)
  subprojects/samrai/CMakeLists.txt:85 (include)

Should I keep libhdf5-dev? Or is there any special flag needed for cmake?

@PhilipDeegan
Copy link
Author

Hmmm, it should work, assuming you don't have some cache somewhere still pointing to the old version

see on our GHA we just do the following

@PhilipDeegan
Copy link
Author

without -pg, it exits immediately, yet without so many skipped

uftrace record -P. -t 100us -v ./build/src/phare/phare-exe src/phare/phare_init_small.py
uftrace: running uftrace v0.15.2-52-g17df8 ( x86_64 dwarf python3 tui perf sched )
uftrace: checking binary ./build/src/phare/phare-exe
uftrace: removing uftrace.data.old directory
uftrace: using /home/deegan/git/uftrace/lib/uftrace/libmcount.so library for tracing
uftrace: creating 4 thread(s) for recording
mcount: initializing mcount library
wrap: dlopen is called for 'libdebuginfod.so.1'
dynamic: dynamic patch type: phare-exe: 0 (none)
dynamic: dynamic patch stats for 'phare-exe'
dynamic:    total:    50714
dynamic:  patched:        0 ( 0.00%)
dynamic:   failed:    50276 (99.13%)
dynamic:  skipped:      438 ( 0.86%)
dynamic: no match:        0
plthook: setup PLT hooking "/home/deegan/git/phare/stage/build/src/phare/phare-exe"
mcount: mcount setup done
WARN: child terminated by signal: 5: Trace/breakpoint trap

I will add for completeness, I am currently on fedora, when I saw the segfaults on debian. I will be on debian again later to confirm

@honggyukim
Copy link
Collaborator

Hmmm, it should work, assuming you don't have some cache somewhere still pointing to the old version

I didn't use ccache and I also tried it from clean build directory, but it shows the same problem.

I will add for completeness, I am currently on fedora, when I saw the segfaults on debian. I will be on debian again later to confirm

I haven't tested on fedora these days. Do you see other simple programs can be traced with uftrace?

@PhilipDeegan
Copy link
Author

PhilipDeegan commented Apr 19, 2024

I haven't tested on fedora these days. Do you see other simple programs can be traced with uftrace?

looks ok generally

uftrace -P. echo yes
yes
# DURATION     TID     FUNCTION
   1.413 us [ 59875] | getenv();
   0.081 us [ 59875] | strrchr();
  37.509 us [ 59875] | setlocale();
   0.751 us [ 59875] | bindtextdomain();
   0.301 us [ 59875] | textdomain();
   0.161 us [ 59875] | __cxa_atexit();
   0.040 us [ 59875] | strcmp();
   0.030 us [ 59875] | strcmp();
   2.755 us [ 59875] | fputs_unlocked();
   6.372 us [ 59875] | __overflow();
   0.151 us [ 59875] | __fpending();
   0.190 us [ 59875] | fileno();
   0.141 us [ 59875] | __freading();
   0.030 us [ 59875] | __freading();
   0.190 us [ 59875] | fflush();
   0.922 us [ 59875] | fclose();
   0.030 us [ 59875] | __fpending();
   0.040 us [ 59875] | fileno();
   0.030 us [ 59875] | __freading();
   0.030 us [ 59875] | __freading();
   0.050 us [ 59875] | fflush();
   0.281 us [ 59875] | fclose();

@PhilipDeegan
Copy link
Author

oh, our third party dependency might be caching the old hdf5

you can try the following from the project root

rm -rf subprojects/samrai
cmake -B build
make -C build -j4

@honggyukim
Copy link
Collaborator

Hmm, the build script downloads samrai even if it's removed as you suggested.

$ rm -rf subprojects/samrai

$ cmake -B build2
SAMRAI NOT FOUND
Cloning into '/home/honggyu/work/uftrace/git/new/PHARE/subprojects/samrai'...
Submodule '.radiuss-ci' (https://github.com/LLNL/radiuss-ci.git) registered for path '.radiuss-ci'
Submodule 'blt' (https://github.com/LLNL/blt.git) registered for path 'blt'
Cloning into '/home/honggyu/work/uftrace/git/new/PHARE/subprojects/samrai/.radiuss-ci'...
Cloning into '/home/honggyu/work/uftrace/git/new/PHARE/subprojects/samrai/blt'...
Submodule path '.radiuss-ci': checked out 'f4e490e571af8341c1305362f0894856c4ff9ad4'
Submodule path 'blt': checked out '058b312f8a5ef305e12a4380deaa13d618eff54e'
    ...
-- HDF5 C compiler wrapper is unable to compile a minimal HDF5 program.
CMake Error at /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find HDF5 (missing: HDF5_LIBRARIES HDF5_INCLUDE_DIRS) (found
  version "")
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.22/Modules/FindHDF5.cmake:1009 (find_package_handle_standard_args)
  subprojects/samrai/cmake/thirdparty/SetupSAMRAIThirdParty.cmake:104 (find_package)
  subprojects/samrai/cmake/CMakeBasics.cmake:1 (include)
  subprojects/samrai/CMakeLists.txt:85 (include)

Then it still fails with the same error as above.

@PhilipDeegan
Copy link
Author

hmm, you can tell cmake where HDF5 is via cmake -DHDF5_ROOT=/path/to/hdf5

where the directory would be something like /usr/lib64/openmpi

when the library exists at /usr/lib64/openmpi/lib/libhd5.so

you might not have a symlink in /usr/lib64 or something

you can check like sudo updatedb && locate libhdf5 assuming you've installed mlocate via sudo apt-get install mlocate

@honggyukim
Copy link
Collaborator

I get this after removing then install parallel versions as follows.

$ sudo apt purge libhdf5-dev
$ sudo apt-get install libopenmpi-dev libhdf5-openmpi-dev

If this wasn't what you mean then I will try again later. It's getting late so I need to sleep now.

@PhilipDeegan
Copy link
Author

@honggyukim no worries, thanks for your attention 🛌

@honggyukim
Copy link
Collaborator

honggyukim commented Apr 20, 2024

Hi, I've tried it after installing libhdf5-dev again, then built it with -pg option.

Then I see it takes about a bit less than 2 mins as follows.

$ time ./build.pg/src/phare/phare-exe src/phare/phare_init_small.py
        ...
real    1m55.083s
user    1m54.220s
sys     0m0.369s

I've tried it with uftrace record, then the record was successful, but it took about more than 12 mins even though I applied 1ms of time filter as follows.

$ uftrace info
# system information
# ==================
# program version     : v0.15.2-52-g17df8 ( x86_64 dwarf python3 luajit tui perf sched dynamic kernel )
# recorded on         : Sat Apr 20 22:37:12 2024
# cmdline             : uftrace record -t 1ms --no-libcall ./build.pg/src/phare/phare-exe src/phare/phare_init_small.py
# cpu info            : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
# number of cpus      : 6 / 6 (online / possible)
# memory info         : 5.9 / 15.5 GB (free / total)
# system load         : 1.66 / 1.87 / 1.88 (1 / 5 / 15 min)
# kernel version      : Linux 6.6.0
# hostname            : bing
# distro              : "Ubuntu 22.04.1 LTS"
#
# process information
# ===================
# number of tasks     : 2
# task list           : 24124(phare-exe), 24127(orted)
# exe image           : /home/honggyu/work/PHARE/build.pg/src/phare/phare-exe
# build id            : d33efc56e41fe90c646a35996a749e4ca22019b8
# pattern             : regex
# exit status         : exited with code: 0
# elapsed time        : 759.843381581 sec
# cpu time            : 0.728 / 758.384 sec (sys / user)
# context switch      : 1171 / 24060 (voluntary / involuntary)
# max rss             : 82596 KB
# page fault          : 44 / 38031 (major / minor)
# disk iops           : 9688 / 6528 (read / write)

The record looks fine. But I'm just wondering if the original execution is timing sensitive so it takes much longer when uftrace is attached.

@honggyukim
Copy link
Collaborator

Some report output looks as follows.

$ uftrace report
  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
   12.039  m    1.940 ms           1  main
   12.034  m    1.673 ms         300  PHARE::Simulator::advance
   12.034  m  234.057 us         300  PHARE::amr::Integrator::advance
   12.034  m  237.298 us         300  SAMRAI::algs::TimeRefinementIntegrator::advanceHierarchy
   12.034  m   22.145 ms         600  SAMRAI::algs::TimeRefinementIntegrator::advanceRecursivelyForRefinedTimestepping
   12.029  m   47.242 ms        1500  PHARE::solver::MultiPhysicsIntegrator::advanceLevel
   12.023  m  734.672 ms        1500  PHARE::solver::SolverPPC::advanceLevel
   10.051  m   10.429  s        3000  PHARE::solver::SolverPPC::moveIons_
   10.000  m   59.659 ms        3600  PHARE::core::IonUpdater::updatePopulations
    7.054  m    7.025  m        7201  PHARE::core::BorisPusher::move
    5.000  m  523.614 ms        1800  PHARE::core::IonUpdater::updateAndDepositDomain_
    4.059  m  598.109 ms        1800  PHARE::core::IonUpdater::updateAndDepositAll_
    3.046  m    2.024  m       17489  PHARE::core::Interpolator::operator()
    1.018  m  112.107 ms       22215  PHARE::amr::RefinerPool::fill
    1.017  m   11.396  s       22216  PHARE::amr::Refiner::fill
    1.006  m    6.840  s       42919  SAMRAI::xfer::RefineSchedule::fillData
   59.661  s   55.352  s       43213  SAMRAI::xfer::RefineSchedule::recursiveFill
   35.161  s    4.539 ms        4501  PHARE::amr::HybridMessenger::fillCurrentGhosts
        ...

@PhilipDeegan
Copy link
Author

that's nice @honggyukim but it's not really representative of what I'm doing which with parallel mpi

if you check the output of ldd build/src/phare/phare-exe you can see if it is similar to what I posted before

still for me, it fails

uftrace  ./build/src/phare/phare-exe src/phare/phare_init_small.py
WARN: child terminated by signal: 5: Trace/breakpoint trap
# DURATION     TID     FUNCTION
            [ 26842] | _GLOBAL__sub_I_signal_handler() {
            [ 26842] |   __static_initialization_and_destruction_0() {
            [ 26842] |     std::pair::pair() {
   0.070 us [ 26842] |       std::forward();
 849.608 us [ 26842] |       /* linux:schedule */
  53.140 ms [ 26842] |       /* linux:schedule */
  17.438 ms [ 26842] |       /* linux:schedule */
            [ 26842] |       /* linux:task-exit */

uftrace stopped tracing with remaining functions
================================================
task: 26842
[2] std::pair::pair
[1] __static_initialization_and_destruction_0
[0] _GLOBAL__sub_I_signal_handler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants