Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router crash at network scale #1463

Open
kgiusti opened this issue Apr 12, 2024 · 2 comments
Open

Router crash at network scale #1463

kgiusti opened this issue Apr 12, 2024 · 2 comments

Comments

@kgiusti
Copy link
Contributor

kgiusti commented Apr 12, 2024

Attempting to grow the router network will eventually cause a router crash:

*** SKUPPER-ROUTER FATAL ERROR ***
Version: 2.4.2-rh-1
Signal: 11 SIGSEGV
Process ID: 1 (skrouterd)
Thread ID: 15 (wrkr_0)

Backtrace:
[0] IP: 0x00007f8b27ddcdf0 (/lib64/libc.so.6 + 0x0000000000054df0)
    Registers:
      RAX: 0x00000000ffffffff RDI: 0x00007f8b24a18bf0 R11: 0x00007f8b24a19550
      RBX: 0x0000000000001388 RBP: 0x00007f8b24a1a0f0 R12: 0x00005627ed9eaa80
      RCX: 0x00007f8b27ecd2db R8:  0x0000000000000000 R13: 0x0000000000000000
      RDX: 0x0000000000000000 R9:  0x0000000000000000 R14: 0x0000000000000000
      RSI: 0x0000000000000002 R10: 0x00000000000000cb R15: 0x00005627edc58640
      SP:  0x00007f8b24a19400

[1] IP: 0x00005627ed929b8f (skrouterd + 0x00000000000c5b8f)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x00005627ed9eaa80 R11: 0xc59df5176bf549ed
      RBX: 0x0000000000001388 RBP: 0x00007f8b24a1a0f0 R12: 0x00005627ed9eaa80
      RCX: 0x0000000000000000 R8:  0x00007f8afc086fb0 R13: 0x0000000000000000
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x0000000000000000
      RSI: 0x0000000000000000 R10: 0x00007f8b27f393e0 R15: 0x00005627edc58640
      SP:  0x00007f8b24a1a0d0

[2] IP: 0x00005627ed925c0b (skrouterd + 0x00000000000c1c0b)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x00005627ed9eaa80 R11: 0xc59df5176bf549ed
      RBX: 0x00005627edc58640 RBP: 0x00007f8b24a1a120 R12: 0x00007f8ae81b4988
      RCX: 0x0000000000000000 R8:  0x00007f8afc086fb0 R13: 0x00007f8b04067c48
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f8b04067d88
      RSI: 0x0000000000000000 R10: 0x00007f8b27f393e0 R15: 0x00005627edc58640
      SP:  0x00007f8b24a1a100

[3] IP: 0x00005627ed92ddbe (skrouterd + 0x00000000000c9dbe)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x00005627ed9eaa80 R11: 0xc59df5176bf549ed
      RBX: 0x00007f8ae8983448 RBP: 0x00007f8b24a1a290 R12: 0x0000000000000000
      RCX: 0x0000000000000000 R8:  0x00007f8afc086fb0 R13: 0x00007f8af1c24de0
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f8af1f52210
      RSI: 0x0000000000000000 R10: 0x00007f8b27f393e0 R15: 0x00005627edc58640
      SP:  0x00007f8b24a1a130

[4] IP: 0x00007f8b27e27802 (/lib64/libc.so.6 + 0x000000000009f802)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x00005627ed9eaa80 R11: 0xc59df5176bf549ed
      RBX: 0x00007f8b24a1b640 RBP: 0x0000000000000000 R12: 0x00007f8b24a1b640
      RCX: 0x0000000000000000 R8:  0x00007f8afc086fb0 R13: 0x0000000000000002
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f8b27e27530
      RSI: 0x0000000000000000 R10: 0x00007f8b27f393e0 R15: 0x0000000000000000
      SP:  0x00007f8b24a1a2a0

[5] IP: 0x00007f8b27dc7314 (/lib64/libc.so.6 + 0x000000000003f314)
    Registers:
      RAX: 0x0000000000000000 RDI: 0x00005627ed9eaa80 R11: 0xc59df5176bf549ed
      RBX: 0x00007ffda1a3a350 RBP: 0x0000000000000000 R12: 0x00007f8b24a1b640
      RCX: 0x0000000000000000 R8:  0x00007f8afc086fb0 R13: 0x0000000000000002
      RDX: 0x0000000000000001 R9:  0x0000000000000000 R14: 0x00007f8b27e27530
      RSI: 0x0000000000000000 R10: 0x00007f8b27f393e0 R15: 0x0000000000000000
      SP:  0x00007f8b24a1a340

*** END ***

Unclear exactly how large the user had scaled to before the crash, but they were trying for 128 nodes, which is way beyond what we test in CI.

@kgiusti
Copy link
Contributor Author

kgiusti commented Apr 12, 2024

Analysis of the dump:

  1. thread_run is calling qd_connection_free() due to handling an TRANSPORT_CLOSED event
  2. qd_connection_free() has a connector, it schedules the connector for reconnect
  3. qd_timer_schedule(connector->timer, 1000) is called
  4. connector->timer is 0 (r13 is zero in the unwind, which should hold the pointer to the timer)
  5. SEGV accessing via 0 pointer to timer

@kgiusti
Copy link
Contributor Author

kgiusti commented Apr 12, 2024

Hmmm... looks like the management code that deletes the connector does set timer to zero. However it also sets the connector state to DELETED, all under the connector lock. That lock is held and the state is checked where the crash occurred while locked. Seems like it would be impossible to attempt to schedule the timer at that point.

One somewhat troubling point is the timer is freed outside of the lock, which means the timer may fire after the connector->timer has been zeroed. But even that doesn't seem to line up with the crash since the timer callback also verifies the connector state under lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant