Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring support leads to crashes on ppcle64 machines. #4283

Open
simotek opened this issue Jan 12, 2024 · 12 comments
Open

io_uring support leads to crashes on ppcle64 machines. #4283

simotek opened this issue Jan 12, 2024 · 12 comments

Comments

@simotek
Copy link

simotek commented Jan 12, 2024

  • Version: 1.45
  • Platform: openSUSE Tumbleweed ppc64le (kernel 6.6.7-1-default)

We have been tracking a segfault found in the cmake test suite [1], [2] on ppc64le systems and have tracked it back to the introduction of io_uring support in libuv #3952

  1. https://bugzilla.opensuse.org/show_bug.cgi?id=1218365#c12
  2. https://gitlab.kitware.com/cmake/cmake/-/issues/25500
@santigimeno
Copy link
Member

Happy to take a look but lacking the hardware to test would appreciate hints on how to setup a virt environment that reproduces the issue.

@glaubitz
Copy link

Access to POWER machines can be obtained either through the GCC Compile Farm [1] or OpenPOWER at OSUOSL [2].

If you want to use the GCC Compile Farm, I can install any package dependencies on the machine gcc203 of which I am an admin.

[1] https://gcc.gnu.org/wiki/CompileFarm
[2] https://osuosl.org/services/powerdev/request_hosting/

@santigimeno
Copy link
Member

Thanks for the info! Have you been able to reproduce the issue in the GCC Compile Farm? Reading https://gitlab.kitware.com/cmake/cmake/-/issues/25500#note_1468171 it's not clear to me. Or the suggestion is running a openSUSE Tumbleweed VM in those boxes?

@glaubitz
Copy link

I have been able to reproduce it on Debian unstable which is what's running on gcc203.

@santigimeno
Copy link
Member

Thanks. I requested access to the GCC Compile Farm. I'll get back to you in case I need assistance once/if I have access.

@bnoordhuis
Copy link
Member

@libuv/aix (since we don't have a dedicated ppc team) probably also relevant to IBM's business interests? You may want to take a look.

The backtrace from the cmake issue is strange:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  <signal handler called>
#2  0x00007ffff75b4d54 in epoll_pwait () from /lib/powerpc64le-linux-gnu/libc.so.6
#3  0x00007ffff7c5fd38 in uv__io_poll (loop=0x100cfd100, timeout=-1) at ./src/unix/linux.c:1359
#4  0x00007ffff7c44984 in uv_run (loop=0x100cfd100, mode=UV_RUN_ONCE) at ./src/unix/core.c:447

Sleeping in epoll_pwait() and apparently interrupted by a signal that segfaults the process because the signal handler jumps (or points?) to the zero address? Assuming it's not some gdb artifact, I'm not sure how to make sense of that.

@glaubitz
Copy link

Thanks. I requested access to the GCC Compile Farm. I'll get back to you in case I need assistance once/if I have access.

Sure. Let me know when you need a specific package or kernel installed.

@richardlau
Copy link
Contributor

@libuv/aix (since we don't have a dedicated ppc team) probably also relevant to IBM's business interests? You may want to take a look.

@abmusse Is this something you could look at? Just to be clear this is a Linux on ppcle64 issue (not AIX nor IBM i).

@bnoordhuis Could we add @abmusse to the libuv/aix team (I don't have the ability to do so)? He's already on libuv/ibmi and has been expanding his involvement beyond IBM i to include other platforms. He works for IBM.

The ppcle64 machines we have in the Node.js CI are all RHEL 8 or CentOS 7, neither of which support io_uring (FWIW for RHEL, io_uring is available as a Technology Preview in RHEL 9.3 but disabled by default but we currently aren't running RHEL 9 on our VMs).

@santigimeno
Copy link
Member

Could we add @abmusse to the libuv/aix team (I don't have the ability to do so)? He's already on libuv/ibmi and has been expanding his involvement beyond IBM i to include other platforms. He works for IBM.

@richardlau I just added him to the team.

@abmusse
Copy link
Contributor

abmusse commented Jan 12, 2024

@abmusse Is this something you could look at? Just to be clear this is a Linux on ppcle64 issue (not AIX nor IBM i).

@richardlau I will have a look at this one! Adding it to my backlog.

bradking added a commit to bradking/libuv that referenced this issue Jan 12, 2024
Since `io_uring` support was added, libuv's signal handler randomly
segfaults on ppc64le when interrupting `epoll_pwait`.  Disable it
pending further investigation.

Issue: libuv#4283
bradking added a commit to bradking/libuv that referenced this issue Jan 12, 2024
Since `io_uring` support was added, libuv's signal handler randomly
segfaults on ppc64le when interrupting `epoll_pwait`.  Disable it
pending further investigation.

Issue: libuv#4283
bradking added a commit to bradking/libuv that referenced this issue Jan 12, 2024
Since `io_uring` support was added, libuv's signal handler randomly
segfaults on ppc64 when interrupting `epoll_pwait`.  Disable it
pending further investigation.

Issue: libuv#4283
santigimeno pushed a commit that referenced this issue Jan 13, 2024
Since `io_uring` support was added, libuv's signal handler randomly
segfaults on ppc64 when interrupting `epoll_pwait`.  Disable it
pending further investigation.

Issue: #4283
@santigimeno
Copy link
Member

I have been able to reproduce it on Debian unstable which is what's running on gcc203.

Finally got access to cfarm203 and got to build cmake 3.28 from source and run the bin/cmake -P Tests/CMakeTests/MathTest.cmake 10K in a loop but no crashes. @glaubitz do you have the specific steps that were used to reproduce the crash? Thanks

@glaubitz
Copy link

@santigimeno I started a two-stage build of LLVM-17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants