Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impi/2021.9.0-intel-compilers-2023.1.0 sanity check fails when using RPATH due to missing libfabric #20295

Open
cgross95 opened this issue Apr 4, 2024 · 0 comments

Comments

@cgross95
Copy link
Contributor

cgross95 commented Apr 4, 2024

I'm trying to build impi/2021.9.0-intel-compilers-2023.1.0 and it fails during the sanity check, I believe due to building with RPATH enabled and libfabric not being found.

The sanity check fails after a small test has been built with mpicc -cc=icx ... -o mpi_test. The compilation succeeds, but running with mpirun -n 8 .../mpi_test fails with not much help:

== 2024-04-04 16:41:21,155 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/tools/build_log.py:111 in caller_info): Sanity check failed: sanity check command mpirun -n 8 /tmp/eessi-build.gkHxuLQPcO/easybuild/build/impi/2021.9.0/intel-compilers-2023.1.0/mpi_test exited with code 143 (output: Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1546)..............: 
MPIDI_OFI_mpi_init_hook(1480): 
(unknown)(): Other MPI error
...
Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1546)..............: 
MPIDI_OFI_mpi_init_hook(1480): 
(unknown)(): Other MPI error
) (at easybuild/framework/easyblock.py:3661 in _sanity_check_step)

Some digging (reproducing the install environment and running with I_MPI_DEBUG=30 mpirun -v -n 2 .../mpi_test) shows me:

[1] MPI startup(): failed to load libfabric: libfabric.so.1: cannot open shared object file: No such file or directory

I originally thought there might be a problem with mpicc -cc=icx not using icx with an RPATH wrapper, since readelf -d .../mpi_test shows

Dynamic section at offset 0x2dc8 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libmpifort.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib/release:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib]
...

I tried forcing a compilation of the new copy with the wrapper which gave me

Dynamic section at offset 0x2dc8 contains 28 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libmpifort.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/lib:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/lib64:$ORIGIN:$ORIGIN/../lib:$ORIGIN/../lib64:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib/release:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/lib:/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/libfabric/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/UCX/1.14.1-GCCcore-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/numactl/2.0.16-GCCcore-12.3.0/lib:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/tbb/2021.9.0/lib/intel64/gcc4.8:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/compiler/lib/intel64_lin:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/lib/x64:/opt/software-current/2023.06/x86_64/generic/software/intel-compilers/2023.1.0/compiler/2023.1.0/linux/lib]
...

but had the same problem (even though the runpath includes the path to the libfabric libs).

Eventually running with

LD_LIBRARY_PATH=/opt/software-current/2023.06/x86_64/generic/software/impi/2021.9.0-intel-compilers-2023.1.0/mpi/2021.9.0/libfabric/lib mpirun -n 2 ./mpi_test

succeeds with no errors.

So I'm not sure why the executable is not picking up the libfabric libraries when compiled with RPATH. Any help would be greatly appreciated! As a side note, this is being done on top of EESSI, so if there's anything relevant there that I can share, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant