Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boost_adaptbx: method import_ext causing segmentation fault #796

Open
dermen opened this issue Oct 4, 2022 · 4 comments
Open

boost_adaptbx: method import_ext causing segmentation fault #796

dermen opened this issue Oct 4, 2022 · 4 comments

Comments

@dermen
Copy link
Contributor

dermen commented Oct 4, 2022

TLDR; sys.setdlopenflags(0x100|0x2) in import_ext is causing segfault due to mysterious boost+mpi4py+eigen interactions.

This is quite involved, and I have a patch for it, but I would like to get to the bottom of whats going on.

Assume one has their own boost extension module tester_ext.cpp:

#include <Eigen/Dense>
#include<Eigen/StdVector>
#include <boost/python.hpp>
#include <mpi4py/mpi4py.h>
    
typedef Eigen::Matrix<double,3,1> vec3;
typedef std::vector<vec3,Eigen::aligned_allocator<vec3> > eigVec3_vec;

int test(bool fix_segfault){

    vec3 vec(1,1,1);
    eigVec3_vec vecs;
    if (fix_segfault)
        vecs.reserve(1);
    vecs.push_back(vec);
    printf("OK\n");
    return 1;
}


BOOST_PYTHON_MODULE(tester_ext)
{
  if (import_mpi4py() < 0) return;
  def("run_test", test);
}

Lets also assume that one wishes to use an existing extension module from cctbx (another_ext), whose source code is given by another.cpp:

#include <Eigen/Dense>
#include<Eigen/StdVector>

class another{ 
  another();
  ~another(){};
};

another::another(){
    std::vector<Eigen::Vector3d,Eigen::aligned_allocator<Eigen::Vector3d> > vecs; 
    Eigen::Vector3d vec(0,0,0);
    vecs.push_back(vec);
}

and whose extension wrapper is another_ext.cpp:

BOOST_PYTHON_MODULE(another_ext)
{
  printf("import another\n");
}

After everything is built, if one runs the following python script with the flag --makeSegfault , the segfault can be triggered

import sys
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("--makeSegfault", action="store_true")
parser.add_argument("--fixSegfault", action="store_true")
args = parser.parse_args()

import boost_adaptbx.boost.python as bp
if args.makeSegfault:
    bp.import_ext("another_ext")
    import tester_ext
else:  # switching the import order avoids the segfault , dont know why
    import tester_ext
    bp.import_ext("another_ext")

tester_ext.run_test(args.fixSegfault)

This issue appears to be platform dependent. I've tested it at NERSC and it segfaults on CORI GPU, but not in Perlmutter. Note replacing bp.import_ext("another_ext") with import another_ext does not trigger the segfault regardless of the --makeSegfault flag. Also, commenting out the line of code if (import_mpi4py() < 0) return; prevents the segfault, regardless of the --makeSegfault flag. Or, instead, if one comments out the line vecs.push_back(vec) in another.cpp, then the segfault is avoided. Lastly, (see build script below), if one leaves out another.o during the linking step that writes another_ext.so, then the segfault won't be triggered.

Example build script for python3.8:

#!/bin/bash

CPRE=/path/to/cctbx/conda_base
CCTBX_MOD=/path/to/cctbx/modules

EIG_INC=-I${CCTBX_MOD}/eigen
CONDA_INC=-I${CPRE}/include
PY_INC=-I${CPRE}/include/python3.8
MPI4PY_INC=$(libtbx.python -c "import mpi4py;print('-I'+mpi4py.get_include())")

CONDA_LIB=-L${CPRE}/lib

g++ -c another.cpp  $EIG_INC $CONDA_INC -lboost_python38 -lboost_system  -lboost_numpy38  -lstdc++ -fPIC -O3   -o another.o  

g++ -c another_ext.cpp  $EIG_INC $CONDA_INC  -lboost_python38 -lboost_system  -lboost_numpy38  -lstdc++ -fPIC -O3   -o another_ext.o  

g++ -shared another_ext.o another.o $CONDA_LIB  -lboost_numpy38  -lboost_python38 -o another_ext.so

mpic++ -c tester_ext.cpp  $EIG_INC $CONDA_INC $PY_INC $MPI4PY_INC  -lboost_python38 -lboost_system  -lboost_numpy38  -lstdc++ -fPIC -O3   -o tester_ext.o  

mpic++ -shared tester_ext.o $CONDA_LIB -lboost_numpy38  -lboost_python38 -o tester_ext.so
@dermen
Copy link
Contributor Author

dermen commented Oct 4, 2022

Note, the relevant line in boost_adaptbx is sys.setdlopenflags(0x100|0x2), commenting out that line prevents the segfault.

@bkpoon
Copy link
Member

bkpoon commented Oct 4, 2022

What happens if bp.import_ext is used to import tester_ext?

@ndevenish
Copy link
Contributor

I had a recent SEGV issue in BOOST_PYTHON_MODULE that I tracked down to a compiler version bug, though this looks different.

FWIW I'v never liked sys.setdlopenflags(0x100|0x2) - I think it's loading things into RTLD_GLOBAL to compensate for the fact that mostly the build scripts don't link with e.g. -lboost_python38 (which you are doing here, so is redundant); it shouldn't be necessary. I vaguely recall reading an early boost.python thread where RWGK realised this, but it was too late to change in cctbx.

That said, I'm not 100% sure it's the same issue but I've also seen several bugs arise from linking both -lpython3.8. I don't think you are supposed to link to libpython (e.g. the manylinux instructions), because any python interpreter that is loading your dyld will already have it loaded, and it can cause problems if - as in conda, for instance - the python you are using doesn't have a libpython (conda IIRC builds it statically) and so linking to it can cause the system libpython to be picked up instead. Obviously linking to multiple python symbol sets (even ~ the same version) is a recipe for problems, and we've definitely accidentally run into this a couple of times.

In fact, if it is this problem, the RTLD_GLOBAL flag possibly makes sense, because it could be clobbering the already-global symbols that the running interpreter is using?

@dermen
Copy link
Contributor Author

dermen commented Oct 4, 2022

What happens if bp.import_ext is used to import tester_ext?

@bkpoon , still receive the segfault

That said, I'm not 100% sure it's the same issue but I've also seen several bugs arise from linking both -lpython3.8

@ndevenish Thanks for the tip, I was unaware of this! But if I remove the -lpython3.8 flags and rebuild, I can still generate the segfault. I removed the flags from the example build script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants